Summary

Encoder process the input using non-causal/full self-attention, the resulting embeddings are fed to the decoder part through cross-attention i.e. query=Q(encoder_emb), key, value= K(output_emb), V(output_emb)
From “UL2: Unifying Language Learning Paradigms”
- Encoder-Decoder models process input and targets independently with a different set of parameters. This is a form of sparsity where different set of parameters are used for different tokens. Encoder-Decoder models also have a cross attention component that connects input tokens to target tokens. Meanwhile, decoder-only models process inputs and targets by concatenating them. Hence, the representations of inputs and targets are concurrently build layer by layer as the input/targets propagate up the network. Conversely, the decoder in Encoder-decoder models generally only looks at the fully processed encoder input. The distinct property is that Encoder-Decoder models are generally approximately 2x parameters of a decoder-only model when compute-matched.
- From “Decoder-Only or Encoder-Decoder? Interpreting Language Model as a Regularized Encoder-Decode”

Relevant Literature

Summary

Architecture guidance

Recent work found that a 2/3 – 1/3 split (encoder gets about two-thirds of total parameters, decoder one-third) often maximizes performance for small model)
deeper, narrower architectures tend to work well at sub-billion scales
- If constrained by parameter budget, one might prefer adding a couple of extra layers (especially to the encoder) over making layers excessively wide, as the former tends to yield better downstream task performance at small scale.
use relative position encoding
GQA, GLU, possibly MoE

Training best practices

for the encoder, span masking seems to be the consensus choice
- masking contiguous spans of text (as opposed to individual tokens)
- random spans of the input are replaced with unique mask tokens and the decoder generates those missing spans in sequence. The loss function in these cases is the standard cross-entropy between the decoder’s output and the original text, averaged over all predicted tokens.
In practice, choosing a rich pre-training objective that combines multiple masking strategies can lead to a more universally useful model.
- include a mix of bidirectional context training (e.g. masked tokens) for understanding and autoregressive training (e.g. next token prediction) for generation, so the encoder-decoder model learns both comprehension and fluency.
- UL2 (Tay et al., 2023) unified these ideas with a Mixture-of-Denoisers (MoD) objective
increasingly popular practice is knowledge distillation, where a larger teacher (often a decoder-only model) guides the training of a smaller encoder-decoder student. They used a combined loss: a KL-divergence between student and teacher token distributions (with the teacher logits softened by a temperature) plus the usual cross-entropy on the ground-truth text. This yielded substantial gains (+6 points on average across tasks).

Deep Research report

Encoder-Decoder Language Models: Architecture, Training, Inference, and Applications

Introduction

Encoder-decoder transformers are a class of sequence-to-sequence models that use a dedicated encoder to process input text and a decoder to generate output. They differ from decoder-only models (like GPT-style LMs) which use a single stack for both input and output. This report provides a comprehensive technical overview of encoder-decoder models for machine learning scientists. We cover architectural variants and parameter-efficient designs, training best practices (from loss functions to data and transfer learning), inference-time optimizations, downstream performance comparisons with decoder-only models, and considerations for small-scale models (<3B parameters). Throughout, we cite core references and integrate recent findings to offer both research insights and practical implementation guidance.

Architecture Guidance for Encoder-Decoder Models

Encoder-Decoder vs Decoder-Only Transformers: Encoder-decoder architectures separate the processing of inputs and outputs into two modules. The encoder reads the full input sequence (often bidirectionally, without causal masking) and produces a fixed-length representation, while the decoder generates the output sequence autoregressively, attending to the encoder’s representation. This separation offers a key efficiency advantage: the input is processed once, and that encoded representation is reused for all output tokens. In contrast, decoder-only models typically concatenate input and output and attend over the growing sequence for each generated token, leading to repeated reprocessing of the input context. As Figure 1 illustrates, the encoder-decoder design avoids the expanding key-value caches and redundant computations of decoder-only transformers (Return of the Encoder: Maximizing Parameter Efficiency for SLMs). This makes encoder-decoder models particularly attractive when input sequences are long or when running on resource-constrained hardware.

Parameter Allocation and Transformer Variants: A practical question in designing encoder-decoder models is how to allocate model capacity between the encoder and decoder. Recent work found that a 2/3 – 1/3 split (encoder gets about two-thirds of total parameters, decoder one-third) often maximizes performance for small models ([PDF] Return of the Encoder: Maximizing Parameter Efficiency for SLMs). Intuitively, giving more capacity to the encoder helps create richer input representations, which the decoder can then leverage with fewer parameters. This aligns with findings that deeper, narrower architectures tend to work well at sub-billion scales. For example, Liu et al. (2024b) show that smaller models benefit from increased depth and techniques like Grouped-Query Attention (GQA), which ties together query projections to reduce parameters while preserving multi-head diversity. GQA and similar attention variants improve efficiency by lowering the number of learned projections, a boon for parameter-limited models.

Attention Structure Trade-offs: Encoder-decoder transformers use self-attention within the encoder (usually full bidirectional attention since the whole input can be attended) and within the decoder (usually causal, to maintain autoregressive generation order). Additionally, the decoder uses cross-attention to attend to encoder outputs. This structured separation yields benefits but also introduces an information bottleneck: the decoder only receives information through the encoder’s compressed representations. Historically, some hypothesized that this bottleneck could hurt performance at scale, but evidence suggests otherwise up to very large sizes. Raffel et al. (2020) demonstrated strong results with a 11B-parameter T5 model, and recent analysis up to 1B params found no inherent performance cliff for encoder-decoder models. In fact, the “Return of the Encoder” study observed a consistent 6-7% performance lead for encoder-decoder over decoder-only models as model size scaled from 330M to 1B. The supposed bottleneck may actually encourage more efficient representations, and strategies like adding residual connections between encoder and decoder layers have been proposed to further alleviate it.

Positional Encodings – Relative vs Absolute: Position encoding is another important architectural choice. Early transformers used fixed or learned absolute positional embeddings, but many encoder-decoder models have moved to relative position encodings for greater flexibility. T5 introduced a relative position bias scheme (with learned bias buckets) as part of its design (Raffel et al., 2020), and others have adopted Rotary Positional Embeddings (RoPE) for continuous relative positioning in both encoder and decoder. These relative schemes generally improve generalization to longer sequences and exhibit better training stability. By contrast, absolute position embeddings are simpler but may limit a model’s extrapolation to sequence lengths not seen in training. Recent improvements in positional encoding (e.g. Su et al., 2024) further refine relative approaches. In practice, using relative or rotary position encodings is recommended for modern encoder-decoder models, especially when target sequence lengths or transfer to longer texts is anticipated.

Depth vs Width Scaling: How a model’s layers (depth) versus hidden size (width) scale impacts performance and efficiency. Classic scaling law studies on decoder-only LMs suggested that within broad ranges, model “shape” (the aspect ratio of layers vs width) had minimal effect on cross-entropy loss. However, those studies focused on very large models and may not translate to smaller regimes. For encoder-decoders under a few billion parameters, evidence favors going deeper rather than wider. A deeper model (more layers) can allocate complexity across sequential transformations, which may benefit tasks requiring multi-step reasoning or representation refinement. In contrast, wider models (very large feed-forward dimensions or attention heads) might yield diminishing returns if the model capacity isn’t fully utilized by data. In practice, many encoder-decoder designs (e.g. T5) keep a balanced depth/width (e.g. 12 or 24 layers in each stack, moderate width) and rely on scaling both together for larger models. If constrained by parameter budget, one might prefer adding a couple of extra layers (especially to the encoder) over making layers excessively wide, as the former tends to yield better downstream task performance at small scale.

Architectural Variants and Modern Enhancements: Beyond the standard Transformer, researchers have explored variants to improve efficiency. Evolved Transformers (So et al., 2019) introduced convolutional sublayers, Universal Transformers (Dehghani et al., 2018) share parameters across time steps (recurrently updating representations), and Mixture-of-Experts (MoE) models (Shazeer et al., 2017; Fedus et al., 2022) use conditional routing to different feed-forward experts. Each comes with trade-offs. For example, MoE drastically increases the number of parameters but uses only a subset for each token, yielding a high parameter-to-FLOPs ratio. Such models can achieve higher quality per compute in theory, but can be harder to deploy due to sharding and communication overhead. Meanwhile, linearized attention mechanisms (Performers, etc.) promise speedups for long sequences by approximating softmax attention – yet studies found some perform poorly when scaling up, struggling to match the quality of full attention. Generally, for encoder-decoder architectures geared toward broad NLP tasks, the vanilla Transformer remains a robust choice, often yielding the best scaling behavior overall. Innovations like Grouped-Query Attention, shared feed-forward layers, or lightweight convolutions can be integrated for efficiency, but should be validated on target tasks. The key is to introduce inductive biases that help the model handle sequence data more effectively without hampering its ability to scale with more data and compute.

Training Best Practices

Pre-training Objectives and Loss Functions: Encoder-decoder models are typically pre-trained on large unlabeled text corpora using self-supervised objectives. The sequence-to-sequence denoising objective is common: the model is fed a corrupted input and trained to reconstruct the original text. For example, BART (Lewis et al., 2019) uses a noising function that shuffles sentences and applies span masking (replacing spans of text with a mask token), then trains the model to recover the original sequence. This approach generalizes the idea of BERT’s masked language modeling (bidirectional context) and GPT’s autoregressive prediction (unidirectional) into one encoder-decoder framework. Empirically, BART found that a combination of random sentence permutation and novel infilling (span masking) yielded the best downstream performance. Similarly, T5 (Raffel et al., 2020) explored several pre-training schemes and landed on a span corruption objective (“span denoising”) where random spans of the input are replaced with unique mask tokens and the decoder generates those missing spans in sequence. The loss function in these cases is the standard cross-entropy between the decoder’s output and the original text, averaged over all predicted tokens.

Aside from denoising, other objectives have been successfully used in encoder-decoders: prefix language modeling (a hybrid where the model sees the beginning of text and predicts the rest, useful for generative tasks) and sequenced language modeling (SLM) where both encoder and decoder have different mask scopes. UL2 (Tay et al., 2023) unified these ideas with a Mixture-of-Denoisers (MoD) objective, which samples from a variety of tasks (regular masked spans, prefix LM, and others) during pre-training (Brief Review — UL2: Unifying Language Learning Paradigms). This mixture helps the model handle diverse paradigms – from closed-book question answering (which benefits from seeing full input context) to story generation (which benefits from strong prefix-based generation ability). In practice, choosing a rich pre-training objective that combines multiple masking strategies can lead to a more universally useful model. The general guideline is to include a mix of bidirectional context training (e.g. masked tokens) for understanding and autoregressive training (e.g. next token prediction) for generation, so the encoder-decoder model learns both comprehension and fluency.

Masking Strategies: The way text is masked or partitioned during pre-training greatly influences what the model learns. Encoder-decoder models allow flexible masking strategies since the encoder can ingest a partially corrupted input and the decoder generates the rest. Common strategies include:

Fully Visible Input (No Causal Mask on Encoder): Typically, the encoder has no causal mask, meaning it can attend to all tokens in a corrupted input sequence (as in standard denoising). This leverages bidirectional context for understanding. Research has found that giving the encoder full visibility is crucial for certain tasks like translation – using a causal left-to-right mask on the source (as if it were language modeling the input) significantly hurt translation quality. In other words, for tasks where the input is provided entirely upfront (most seq2seq tasks), do not causally mask the encoder, as doing so discards useful context and acts like an unnecessary handicap.
Causal Masking on Decoder: The decoder, when generating output, is usually a left-to-right language model conditioned on the encoder representation. This means at training time, the decoder uses causal masking (each position attends only to earlier output positions, plus all encoder positions via cross-attention). This setup naturally trains the model for autoregressive text generation, which is needed for any free-form text output.
Span Masking / Infill Masking: Both BART and T5 have shown that masking contiguous spans of text (as opposed to individual tokens) is an effective pre-training noise. The model then treats each masked span as a gap to fill. T5’s span corruption randomly chooses span lengths and replaced them with a single sentinel token, whereas BART’s infilling replaced spans with a single <mask> token. Models trained with span masking can learn to generate cohesive chunks of text, improving downstream generation tasks (like summarization or question answering where whole phrases must be produced). As evidence, BART achieved up to a 6 ROUGE point improvement on abstractive summarization benchmarks using such a denoising strategy.
Denoising Variations (Swap, Delete, etc.): BART also experimented with other noising functions (token deletion, rotation, etc.) though found the combination of sentence shuffle + span infill most effective. The general best practice is to apply diverse but task-relevant corruption – enough to force the model to learn syntax and semantics, but not so much that the task becomes too unnatural. For example, shuffling sentences in a paragraph teaches discourse rearrangement (useful for summarization tasks), and random span deletion teaches the model to connect context across gaps.

Pre-training Data: The choice of pre-training data is a major factor in model performance. Encoder-decoder models have been pre-trained on everything from general web crawls to specific mixtures of corpora. T5 introduced the C4 (Colossal Clean Crawled Corpus), a large cleaned web scrape, to pre-train a family of models from small (60M) to XXL (11B) and found that scale plus clean data yielded significant improvements (state-of-the-art on GLUE, SuperGLUE, SQuAD, summarization, etc.). BART was pre-trained on a combination of news articles and books, similar to the data used for RoBERTa, which helped it match RoBERTa’s performance on GLUE and SQuAD despite being a generative model. The key is to have a large, varied corpus that captures the diversity of language.

For encoder-decoder models, it can also be beneficial to include multiple languages or modalities if cross-lingual or multi-modal use is anticipated. For instance, mT5 extended T5’s approach to a multilingual corpus and achieved strong cross-lingual understanding. Recent models like FLAN-T5 go further by mixing in fine-tuning tasks (instruction-style data) during pre-training to steer the model towards following prompts. In all cases, ensuring a high-quality dataset (clean, deduplicated, representative of target domains) is critical – smaller models especially benefit from data quality as they cannot memorize as much noisy or rare information. Indeed, some efficient small-model efforts (e.g. “SmolLM”) attribute their gains to training on high-quality data subsets, underscoring that data quality can sometimes trump sheer data quantity for moderate model sizes.

Transfer Learning and Fine-tuning: Encoder-decoder models are typically fine-tuned on downstream tasks after pre-training. A best practice emerging from T5 and subsequent work is to cast each task into a text-to-text format – where inputs (and possibly a task-specific prefix) are given to the encoder and outputs are produced by the decoder. This unified approach allows a single pre-trained model to handle diverse tasks with minimal modifications (just different fine-tuning data). Raffel et al. dubbed this the “unified text-to-text” framework, and it has proven very powerful. During fine-tuning, one uses a standard supervised loss (cross-entropy on the target text) often starting from the pre-trained weights for initialization. Techniques like multi-task fine-tuning (finetuning on a mixture of tasks) or sequential transfer (pre-finetuning on an intermediate task, then on target task) can further improve generalization.

A notable case of transfer learning is instruction tuning. Models like T5 (Flan-T5, T0) were fine-tuned on hundreds of task instructions and their solutions, producing models that can follow natural language instructions to perform tasks zero-shot. This has proven that encoder-decoder models can be very effective general-purpose learners when given broad training. In fact, a 11B parameter T5-based model (T0) demonstrated competitive zero-shot performance on unseen tasks, approaching GPT-3’s 175B performance in some benchmarks, thanks to transfer learning via multi-task instruction data. More recently, UL2 20B (an encoder-decoder model) was shown to outperform a 175B decoder-only GPT-3 on zero-shot SuperGLUE and achieve three times the one-shot summarization performance of T5-XXL. These results highlight the benefit of combining powerful pre-training with extensive transfer learning: the model leverages its pre-trained knowledge and adapts to the format of downstream tasks extremely effectively.

Optimization and Regularization: When training encoder-decoder models, standard Transformer training best practices apply. Use of learning rate schedules (linear warmup then decay), Adam-family optimizers, and regularization like dropout are common. One must also manage the two-part architecture in terms of masking: ensure the encoder does not attend to padding tokens and the decoder masks out future tokens appropriately. Some advanced techniques include label smoothing (to prevent overconfidence in the softmax outputs) and balanced fine-tuning if dealing with multiple objectives. Another increasingly popular practice is knowledge distillation, where a larger teacher (often a decoder-only model) guides the training of a smaller encoder-decoder student. The “Return of the Encoder” paper introduced a knowledge distillation approach specifically to transfer knowledge from a large decoder-only LM to a small encoder-decoder model. They used a combined loss: a KL-divergence between student and teacher token distributions (with the teacher logits softened by a temperature) plus the usual cross-entropy on the ground-truth text. This yielded substantial gains (+6 points on average across tasks) for the student model while retaining the student’s inherent efficiency. When implementing such distillation, it’s important to carefully align the teacher and student inputs/outputs (e.g., pad sequences appropriately so that the teacher predicts for the same positions the student is responsible for). Overall, techniques that encourage the model to learn from richer signals (multiple tasks, teachers, etc.) tend to produce more capable encoder-decoders without changing the fundamental architecture.

Inference Optimization Techniques

A well-designed encoder-decoder model can also be optimized for fast and efficient inference. Unlike training, where throughput and scalability are paramount, inference often demands low latency (especially for real-time applications) and efficiency at smaller batch sizes or single examples.

One-Time Encoding and Reuse: The primary architecture-level advantage of encoder-decoders is that the input is encoded once and reused. This means that for tasks like summarization or QA, where the input (e.g. a document or question) can be significantly longer than the output, the model saves a lot of computation compared to a decoder-only model. A decoder-only model would attend to the entire input prefix at every generation step, incurring quadratic time in total sequence length per output token. Encoder-decoder models avoid this by fixing the encoder’s output as a static memory that the decoder attends to. This leads to dramatically lower inference time for long inputs. For example, processing a 1024-token input, an encoder-decoder might use less than half the FLOPs of a decoder-only model. The gap widens with longer sequences – at 4096 tokens, the encoder-decoder required ~3.2× less compute in one study. This difference is visualized in Figure 2, which shows decoder-only inference time skyrockets with longer input, whereas encoder-decoder inference grows much more slowly (since the decoder cost is independent of input length).

First-Token Latency and Throughput: Empirical measurements confirm that encoder-decoder models have lower latency for initial outputs and can achieve higher token throughput. Elfeki et al. (2025) report a 47% lower first-token latency and 4.7× higher throughput for an encoder-decoder model compared to a decoder-only model of the same size on edge hardware. The latency gain comes from avoiding the need to process a long prompt in the decoder stack before emitting the first token – the encoder has already processed it in parallel. The throughput gain comes from the fact that the decoder’s per-token work is reduced (only attends to encoder outputs and previous few tokens). These benefits held consistently across GPU, CPU, and even neural processing unit (NPU) deployments. In other words, for real-time applications like interactive assistants or on-device summarization, an encoder-decoder can be markedly more responsive.

KV Cache and Memory Considerations: Decoder-only models rely on caching key/value (KV) pairs for each transformer layer to avoid re-computation of attention on past tokens. While effective, this means memory usage grows with sequence length and each new token still involves attention over all past tokens’ keys. Encoder-decoder models only need a KV cache for the decoder’s self-attention (which is typically short, since it only covers generated tokens, not the long input). The encoded input acts like a fixed memory that doesn’t grow over the generation process (Return of the Encoder: Maximizing Parameter Efficiency for SLMs). This not only saves compute, but also stabilizes memory requirements – critical for deployment on devices with limited RAM. The fixed-memory property also enables optimizations like pre-computing and storing encodings for frequently used inputs (e.g., a long document that you may generate multiple queries from).

Attention Pruning and Sparsity: To further speed up inference, one can prune parts of the model that contribute less. Research on attention head pruning has shown that not all heads are equally important – some can be removed after training with minor loss in quality, resulting in faster attention computations. Similarly, one can limit decoder cross-attention to only top-k most relevant encoder tokens (perhaps identified via an upstream retrieval or by attention scores). This creates a sparse cross-attention pattern, reducing complexity from O(n_m) (n output, m input tokens) to O(n_k) with k ≪ m. Such techniques need to be used carefully to avoid degrading model outputs, but for domains like very long documents, a retrieve-and-read strategy (encoder reads top passages instead of the whole text) effectively prunes the attention domain and improves speed.

Conditional Computation (Mixture-of-Experts and Beyond): Conditional computation refers to activating only parts of the model for a given inference, rather than the full network every time. In encoder-decoders, MoE layers can be used to route tokens through one of several expert sub-networks, effectively using a larger model’s capacity with the inference cost of a smaller one. This can improve the quality-latency trade-off (though it introduces complexity in implementation). Another form of conditional computation is early exiting or adaptive decoding, where the model might decide to terminate generation early or skip some layers for straightforward tokens. While less common, research into depth adaptive Transformers allows the model to use fewer decoder layers for tokens it is confident on, which can lower inference time for easy parts of the output. These kinds of strategies are an active area of research, but they hold promise especially for long sequences where not every token needs the full model’s attention.

Architecture-Level Choices for Speed: Some architectural tweaks inherently improve inference speed. For instance, using lighter feed-forward networks (smaller intermediate dimension or even dynamic linear units) can reduce computation per token. T5 introduced relative position biases which are added to attention scores without extra per-token computation, as opposed to explicit position embeddings that might require larger embedding lookups – such small choices can simplify the compute graph. Another example is using shared parameters (as in Universal Transformers or ALBERT for BERT models) to reduce memory footprint and perhaps cache certain computations across layers. One extreme case is the Retrieval-Transformer idea: instead of a gigantic monolithic model, use a smaller model plus a retrieved external memory – at inference you only compute with the smaller model and do cheap lookups in memory, which can be faster than carrying around huge internal weights. While not a pure encoder-decoder change, it reflects how different design decisions (external memory, etc.) can shift computation around to make inference more efficient.

In summary, to optimize encoder-decoder models for inference: maximize what can be precomputed or reused (the encoder output), minimize repeated work (via static memory and caches), and consider sparsity or conditional execution to reduce unnecessary computations. Properly leveraged, these models can achieve not only strong accuracy but also production-friendly speed. For many applications requiring understanding of a prompt and then generation, encoder-decoders offer a clear efficiency win.

Downstream Performance: Encoder-Decoder vs Decoder-Only

Encoder-decoder models have demonstrated strong performance across a wide range of NLP tasks, often outperforming decoder-only models of comparable size, especially when fine-tuned. We compare the two paradigms on some general-purpose tasks and highlight where encoder-decoders shine or lag:

Summarization: This task exemplifies the strengths of encoder-decoders. Models like BART and T5 achieved state-of-the-art results on summarization benchmarks (CNN/DailyMail, XSum) at the time of their introduction. BART, for instance, improved ROUGE scores by up to +6 points over previous approaches, and T5-11B topped leaderboards by leveraging its massive pre-trained encoder-decoder architecture. The encoder’s ability to fully digest the input article yields a rich representation for the decoder to distill into a summary. Decoder-only models (like GPT-3/4) can also summarize via prompting or fine-tuning, but they require either processing the article as a very long prompt or being trained with specialized attention to handle such long contexts. At equal model sizes, encoder-decoders tend to produce more accurate and concise summaries. One reason is that bidirectional encoding handles long input dependencies better – the encoder can attend to all parts of the document when determining the context for a summary sentence, whereas a decoder-only LM generating a summary might struggle if important information appeared far back in the prompt (without very long context windows). That said, as model scale grows, decoder-only models have also become competitive in summarization through brute-force scale and training. Where encoder-decoders shine is in data efficiency: a moderately sized enc-dec model fine-tuned on summarization often beats a much larger LM that’s prompted to summarize. Where they might lag is in zero-shot settings – a large decoder-only model like GPT-4, with no fine-tuning, can produce decent summaries purely from its broad knowledge, whereas a smaller encoder-decoder usually needs fine-tuning or instruction training to do the same.
Question Answering (QA): QA can refer to extractive QA (answer is a span from a provided text) or abstractive/open QA (answer is generated, possibly from knowledge). Encoder-decoder models excel at extractive QA and reading comprehension. BART and T5, when fine-tuned on SQuAD and related tasks, matched or exceeded the performance of encoder-only BERT-style models, which was notable because they also carry generative capability. On SQuAD 2.0, for example, Raffel et al. report T5 achieved near state-of-the-art with much fewer parameters than GPT-3 (which wasn’t even fine-tuned on it in the original case). The separate encoder allows the model to thoroughly encode a passage and question, and the decoder can be trained to output the exact answer span (or generate it). In open-domain QA (where the model must recall facts), very large decoder-only models (with enough training data) can retrieve from their internal knowledge. Encoder-decoder models can instead be combined with retrieval systems (e.g., Fusion-in-Decoder architecture or using an encoder to read retrieved documents) to produce answers. In practice, both architectures can perform well, but encoder-decoders often have an edge when the task format is providing an answer given explicit context, while decoder-only might have an edge in implicit knowledge recall. One study on small models found an encoder-decoder (330M) significantly outperformed a decoder-only counterpart on a QA-like reasoning benchmark (SQuAD v2), with a 12-point gain in exact match. This underscores that at smaller scales especially, having a dedicated encoder to fully understand the question and context yields better answers.
Machine Translation: Translation was traditionally the forte of encoder-decoder Transformers (the original “Transformer” paper by Vaswani et al., 2017 was in fact an encoder-decoder for translation). For bilingual translation, encoder-decoder models remain highly effective and are typically used in state-of-the-art systems. However, interestingly, recent research examined if decoder-only LMs can perform translation when trained appropriately. Zhang et al. (2022) found that decoder-only models with full visibility on the source (no causal mask on input) and trained with a sequence-to-sequence objective can approach parity with enc-dec models on supervised translation. In their experiments, architectural differences had a significant impact at small scales – enc-dec outperformed – but the gap narrowed as models grew larger. Moreover, decoder-only models struggled with off-target translations in zero-shot multi-lingual settings, an area where enc-dec models did better. So for translation tasks, encoder-decoders generally shine, offering robust performance and easier training. Decoder-only approaches can work but require careful design (no source LM objective) and usually much larger scale to compensate. On low-resource language pairs, having the explicit encoder is particularly beneficial for capturing source nuances.
Instruction Following and Conversational AI: Instruction-following tasks (where the model is given a task or question in natural language and must follow it) have been tackled by both decoder-only and encoder-decoder models. Large decoder-only models (GPT-3, GPT-4) fine-tuned with methods like Reinforcement Learning from Human Feedback (RLHF) currently dominate public perception due to their fluent and knowledgeable responses. However, encoder-decoder models such as T5 (e.g., Flan-T5 XXL) and UL2 have also shown excellent instruction-following ability when fine-tuned on broad instruction datasets (like FLAN or NATURAL-INSTRUCTIONS). T5-based models like T0 demonstrated that with only 3-11B parameters (and appropriate training on instructions), they could outperform or match much larger LMs on many tasks zero-shot. UL2 20B, as noted, even surpassed GPT-3 in some benchmarks. Encoder-decoder models tend to require finetuning on instructions to be as generally helpful, whereas very large decoder-only models sometimes rely on in-context learning (prompting with few examples) due to their training on internet-scale data. In scenarios like multi-turn dialogue, decoder-only models have an advantage in that the entire conversation can be threaded as a single generated sequence. Encoder-decoder models can also handle dialogue by treating the conversation history as input (encoder) and the reply as output (decoder), but this requires careful formatting and can be less straightforward for very long dialogues (where the encoder might run out of capacity). Where encoder-decoder might lag in this domain is unbounded generation: if you want a model to keep generating a long continuation without a well-defined input, decoder-only models are naturally suited. Still, for most practical instruction-following (where you have a user query that can be encoded), encoder-decoders perform robustly and often with less computational cost.
Open-Ended Text Generation (Creative Writing, Story Generation): This is one area where decoder-only LMs traditionally excel because they are explicitly trained to predict long sequences of text and can be primed with a prompt to continue in a flexible way. An encoder-decoder model is usually given a prompt in the encoder and then generates – which works, but if the prompt is very short (e.g., “Once upon a time,”), the encoder doesn’t have much to encode, and the heavy lifting is all on the decoder. In such settings, a decoder-only model of comparable size might generate more coherent or longer continuations because its entire architecture and training objective were focused on that mode of generation. However, at smaller scales, even for creative tasks, encoder-decoders can hold their own. In the 330M model comparison by Elfeki et al., an encoder-decoder significantly outperformed a decoder-only model on a creative writing task when both were fine-tuned (and the gap widened further with cross-architecture knowledge distillation). This suggests that when fine-tuned specifically for a task, encoder-decoders can surpass decoder-only models in quality, even for open-ended generation. The trade-off is that decoder-only models might be better at few-shot improvisation, whereas encoder-decoders might need examples or fine-tuning to align with the style. For many applications (story generation, dialogue) that rely on sheer model size and broad pretraining (like GPT-3.5’s chat ability), decoder-only has been the go-to simply because the largest models have that form. But as research shows, a well-designed and trained encoder-decoder can be extremely effective, possibly more efficient, in these tasks too.

In summary, encoder-decoder models tend to shine in tasks where a substantial input context needs to be understood, such as translation, summarization, and reading comprehension QA. They use parameters efficiently to digest input and can often achieve stronger performance per parameter on these tasks. They also do very well in structured generation tasks (where output closely depends on input). Decoder-only models may excel in tasks that are closer to unconditional language modeling – extremely open-ended generation or tasks where the prompt is brief and the output is long. They also became popular for few-shot settings due to their large-scale pretraining allowing implicit meta-learning (e.g., GPT-3’s in-context learning). However, the gap is closing: with instruction tuning and creative pretraining mixtures (like UL2’s MoD), encoder-decoder models have proven they can handle instruction following and even some zero-shot reasoning on par with much larger decoder-only LMs. The choice may ultimately come down to constraints: if you need the absolute best single-model performance and have unlimited compute, a giant decoder-only model (100B+ parameters) might currently lead. But if you care about performance on a budget or need a model in the few-hundred-million to few-billion range, encoder-decoder architectures are often the superior choice.

Encoder-Decoder Models at Small Scale (<3B Parameters)

While a lot of attention goes to extremely large models, there are many use cases for small-scale encoder-decoder models (under 3B parameters) – from running on consumer devices to powering applications where latency and cost are critical. We examine current models in this range, their performance, and best practices for designing and training them.

Notable Models <3B: Several encoder-decoder models have variants in this size regime:

T5 Small, Base, Large: The T5 family includes models of 60M (Small), 220M (Base), and 770M (Large) which all fall well below 3B. These models, pre-trained on C4 with a span-corruption objective, have been widely used in research and applications. For instance, T5-Base and T5-Large fine-tuned on specific tasks like summarization or GLUE benchmarks often outperform much larger pre-Transformer models and serve as strong baselines for efficiency. Google’s T5 paper noted that even T5-Large (770M) matched or beat prior state-of-the-art on many NLP tasks after fine-tuning.
BART Base and Large: BART has a base model (~140M) and a large model (~400M). BART-large became a go-to model for summarization (e.g., used in many winning solutions for summarization competitions) because of its strong pre-training on denoising and relatively moderate size that can be fine-tuned on a single GPU. These models show that even with <0.5B parameters, an encoder-decoder can achieve high performance – BART-large reached ~44 ROUGE-L on CNN/DM, near the state-of-the-art at the time, which was comparable to models several times its size.
PEGASUS: Another seq2seq model (Pegasus, Zhang et al. 2020) targeted summarization specifically, with ~568M params. It used a tailored pre-training objective (Gap Sentence Generation) and performed excellently on summaries, proving that carefully designed small enc-dec models can excel in niche tasks.
mT5 Base to XL: For multilingual uses, mT5 models start from ~300M (small) up to 3B. The 1.2B “XL” version is within our range and provides strong cross-lingual capabilities, showing that scaling encoder-decoder models to around a billion parameters can yield very competent multi-language systems.
FLAN-T5 and T0: These are models built on T5 but fine-tuned on broad instruction data. Notably, T0-3B (which is around 3 billion parameters, slightly above our cutoff but in the same ballpark) demonstrated that an enc-dec model in this range can generalize to many unseen tasks through multi-task prompted training. Flan-T5 XXL (11B) is beyond 3B, but Flan-T5 Large (780M) is within and has been shown to follow certain instructions or perform zero-shot tasks reasonably well, considering its size.
Mobile/Edge-focused models: Research specifically targeting smaller model deployments includes MobileBERT-type ideas for seq2seq. “MobileLM” and “SmolLM” referenced by Elfeki et al. focus on data or architecture optimizations for small LMs, though these particular examples were decoder-only. However, the Return of the Encoder (RoE) approach itself produced a series of encoder-decoder models at 330M and 500M that, with distillation, achieved surprisingly strong performance (closing much of the gap to a teacher model that was 10× larger). These models are designed for use on edge devices and demonstrate the viability of <1B encoder-decoders in real applications.

Performance and Usage: Small encoder-decoder models, when well-trained, can perform a wide range of tasks with fine-tuning. They are especially useful in scenarios where deploying a multi-billion parameter model is infeasible. For example:

On-device summarization of a webpage or email can be done with a 300M-parameter BART model, providing near real-time results.
Virtual assistants can use a ~1B parameter T5-based model to handle instructions or answer queries locally, preserving privacy and reducing latency by avoiding server calls.
Translation systems on mobile (e.g., offline translators) often use compact seq2seq models in the 100M-600M range for reasonable quality.

A key observation from recent studies is that at these small scales, encoder-decoder models often have a larger quality advantage over similarly sized decoder-only models than they do at larger scales. Zhang et al. (2022) noted architectural differences are most pronounced when model capacity is limited, and Elfeki et al. (2025) found 330M enc-dec models beat 330M dec-only models by several points on complex tasks (with the enc-dec even rivaling a 3× larger decoder-only LM after applying knowledge distillation). This is encouraging for practitioners: if you only can afford a smaller model, using an encoder-decoder and investing in good training (and possibly distillation from a larger teacher) can yield outsize returns in performance.

Design and Training Practices for <3B Models: When building models in this size range, consider the following best practices:

Emphasize Encoder Capacity: Given a fixed parameter budget, allocate a healthy fraction to the encoder. As mentioned, a 2:1 or similar ratio (encoder:decoder) can be beneficial ([PDF] Return of the Encoder: Maximizing Parameter Efficiency for SLMs). This ensures the model can build a strong representation of inputs, which is valuable since smaller models can’t rely on sheer param count to “memorize” or brute-force generate correct outputs.
Use Modern Efficient Attention: Incorporate features like relative positional encodings or rotary embeddings to maximize generalization without increasing parameters significantly. These help smaller models not overfit to a specific sequence length.
Knowledge Distillation: It is highly effective to train a small encoder-decoder with guidance from a larger model. One can use a large decoder-only model (which might be available off-the-shelf, like GPT-3 family or LLaMA) to generate soft targets or enrich the training data for the small model. As demonstrated, an encoder-decoder student can learn from a decoder teacher and still maintain its speed advantage, resulting in the best of both worlds – high performance and efficiency.
High-Quality Data: Smaller models are more sensitive to the signal-to-noise ratio in training data. It’s often worth curating a better dataset or doing additional cleaning. For example, if building a 500M model for conversational AI, filtering pre-training data to dialogue-like interactions and high-quality text will help it learn the right patterns more than it would for a 50B model (which might muddle through the noise by scale).
Regularization and Training Stability: Small models can overfit or diverge if the learning rate is too high or if the model capacity is barely enough for the task. Techniques like grad clipping, longer warmup, or even training objectives that mix in some simpler tasks can stabilize training. UL2’s idea of mixing denoising tasks at various difficulties could be useful – a small model might benefit from an easier MLM objective in addition to a harder prefix-LM objective to not overwhelm it early in training.
Efficient Inference Techniques: Since these models are likely to be deployed in constrained environments, consider quantization (8-bit or 4-bit weights) and efficient runtime implementations. Encoder-decoder models quantize well in many cases (especially the encoder part) and can run on CPU at reasonable speeds. The fixed encoder computation means that if multiple outputs are needed from the same input (like multiple questions on one document), you pay the cost once – which is a pattern that can be leveraged in system design.

Use Cases of <3B Models: Despite the hype around large LMs, small encoder-decoder models are quietly ubiquitous. They power translation apps, summarization features in word processors and email clients, and even some search engines for query understanding. Their appeal is in being good enough for many tasks while being efficient enough to serve millions of users or run under strict latency constraints. For instance, a 600M parameter multilingual enc-dec model can run in real-time on a modern smartphone for translation – something a 6B decoder-only model could not feasibly do without cloud support. As edge AI grows, we see a “return of the encoder” where the architectural efficiency of these models makes them the preferred choice for smaller deployments.

In conclusion, designing an encoder-decoder model under 3B parameters involves smart architecture and training choices to make the most of limited capacity. When done right, these models punch above their weight, sometimes rivaling models several times larger in downstream performance. They underscore an important point: progress in NLP is not only about scaling up, but also about innovating in model design to achieve more with less. Encoder-decoder transformers, with their parameter-efficient separation of duties, are a prime example of such innovation – enabling powerful language systems at all scales.

🤖 Harold's Notes

Explorer

Encoder-decoder models

Summary

Relevant Literature

Summary

Architecture guidance

Training best practices

Deep Research report

Encoder-Decoder Language Models: Architecture, Training, Inference, and Applications

Introduction

Architecture Guidance for Encoder-Decoder Models

Training Best Practices

Inference Optimization Techniques

Downstream Performance: Encoder-Decoder vs Decoder-Only

Encoder-Decoder Models at Small Scale (<3B Parameters)

Graph View

Table of Contents

Backlinks