Language models have become increasingly successful in recent years, especially large language models (LLMs) like GPT-4.
These models have shown remarkable abilities in various natural language processing (NLP) tasks, such as text generation, language translation, question-answering, and more.
Their success can be attributed to their ability to learn from large amounts of text data and sophisticated architecture and training methods.
Moreover, LLMs have opened up new possibilities for various applications in artificial intelligence (AI) and NLP. For example, they have been used to improve chatbots, automated content generation, and voice assistants.
In this blog, we discuss the architecture design of Language Models (LLMs), including the mainstream architecture, pre-training objective, and detailed configuration.
The Transformer architecture is widely used for LLMs due to its parallelizability and capacity, enabling the scaling of language models to billions or even trillions of parameters.
Existing LLMs can be broadly classified into three types: encoder-decoder, causal decoder, and prefix decoder.
As discussed above, the existing LLMs can be broadly classified into 3 types: encoder-decoder, causal decoder, and prefix decoder.
Figure 1: Examining the attention patterns in three prominent architectures reveals distinct differences. The attention between prefix tokens, attention between prefix and target tokens, attention between target tokens, and masked attention are represented by rounded rectangles in blue, green, yellow, and grey colors, respectively.
Based on the vanilla Transformer model, the encoder-decoder architecture consists of two stacks of Transformer blocks - an encoder and a decoder.
The encoder utilizes stacked multi-head self-attention layers to encode the input sequence and generate latent representations. The decoder performs cross-attention on these representations and generates the target sequence.
Causal Decoder Architecture
The causal decoder architecture incorporates a unidirectional attention mask, allowing each input token to attend only to past tokens and itself. Both the input and output tokens are processed in the same manner within the decoder.
The GPT-series models, including GPT-1, GPT-2, and GPT-3, are representative language models built on this architecture. GPT-3 has shown remarkable in-context learning capabilities.
Prefix Decoder Architecture
The prefix decoder architecture, also known as the non-causal decoder, modifies the masking mechanism of causal decoders to enable bidirectional attention over prefix tokens and unidirectional attention on generated tokens.
Like the encoder-decoder architecture, prefix decoders can encode the prefix sequence bidirectionally and predict output tokens autoregressively using shared parameters.
Instead of training from scratch, a practical approach is to train causal decoders and convert them into prefix decoders for faster convergence. LLMs based on prefix decoders include GLM130B and U-PaLM.
All three architecture types can be extended using the mixture-of-experts (MoE) scaling technique, which sparsely activates a subset of neural network weights for each input.
Since introducing the Transformer model, researchers have made several advancements to improve its training stability, performance, and computational efficiency.
In this section, we will examine the configurations related to four crucial components of the Transformer: normalization, position embeddings, activation functions, and attention and bias.
The position of LN is crucial for LLM performance. While the original Transformer model used post-LN, most LLMs employ pre-LN to achieve more stable training, even though it may slightly decrease performance.
Additional LN layers, known as Sandwich-LN, have been introduced before residual connections to prevent value explosion. However, it has been observed that Sandwich-LN sometimes fails to stabilize LLM training and can lead to training collapse.
Adding an extra LN after the embedding layer can also stabilize LLM training but often results in a significant performance drop, leading to its exclusion in recent LLMs.
To achieve good performance in feed-forward networks, selecting activation functions is crucial. GeLU activations are commonly used in existing LLMs.
Furthermore, recent LLMs such as PaLM and LaMDA have employed variants of GLU activation, particularly the SwiGLU and GeGLU variants, which have demonstrated better performance in practical applications.
Figure 2: Activation Function Comparison
However, these variants require additional parameters (approximately 50%) in the feed-forward networks compared to GeLU activations.
Position embeddings are used in Transformers to incorporate absolute or relative position information in modeling sequences since the self-attention modules are permutation equivariant.
The vanilla Transformer has two absolute position embeddings: sinusoids and learned position embeddings. LLMs commonly utilize learned position embeddings.
On the other hand, relative positional encodings generate embeddings based on the offsets between keys and queries. This allows them to perform well on longer sequences, even ones beyond the lengths encountered during training, enabling extrapolation.
ALiBi introduces a penalty based on the distance between keys and queries to bias attention scores, resulting in better zero-shot generalization and stronger extrapolation capabilities than other position embeddings.
RoPE, on the other hand, uses rotatory matrices based on absolute positions to compute scores between keys and queries, incorporating relative position information for modeling long sequences. Consequently, RoPE has been widely adopted in recent LLMs.
Figure 3: Positional Embedding
Attention and Bias
In addition to the full self-attention mechanism in the original Transformer, GPT-3 utilizes sparse attention, specifically Factorized Attention, to reduce computation complexity.
Various approaches have been explored to effectively model longer sequences, such as introducing special attention patterns or optimizing GPU memory access, as seen in certain models like FlashAttention.
Moreover, while biases are typically included in each dense kernel and Layer Norm following the original Transformer, recent LLMs like PaLM and Galactica have removed biases. This removal of biases has been shown to improve training stability in LLMs.
Detailed formulations for the network configurations.
Figure 4: Detailed formulations for the network configurations.
To summarize the suggestions from existing literature regarding detailed configuration for Language Models (LLMs): for stronger generalization and training stability, it is recommended to use pre-RMS Norm for layer normalization and SwiGLU or GeGLU as the activation function.
It is advised not to use LN immediately after embedding layers as it may lead to performance degradation.
Regarding position embeddings, RoPE or ALiBi is a better choice as they perform well on long sequences.
In conclusion, large language models (LLMs) have achieved remarkable success in various natural language processing (NLP) tasks and have opened up new possibilities in AI and NLP applications.
The architecture design of LLMs plays a crucial role in their performance and capabilities.
LLMs can be categorized into three main types: encoder-decoder, causal decoder, and prefix decoder. Each type has advantages and has been used in different LLMs for specific purposes.
The encoder-decoder architecture is based on the vanilla Transformer model and consists of an encoder and a decoder.
Causal decoders employ a unidirectional attention mask, while prefix decoders modify the masking mechanism to allow bidirectional attention over prefix tokens.
Mixture-of-experts (MoE) scaling has been applied to these architectures to improve performance further.
Detailed LLM configurations focus on normalization, activation functions, positional embeddings, attention mechanisms, and bias. Layer normalization (LN) is commonly used for training stability, with pre-LN preferred over post-LN in most LLMs.
Alternative normalization techniques like RMS Norm and DeepNorm have been proposed. Activation functions like GeLU and its variants SwiGLU and GeGLU are commonly employed, with the latter variants offering better performance but requiring additional parameters.
Position embeddings like RoPE and ALiBi are preferred for modeling long sequences, as they offer better generalization and extrapolation capabilities.
Sparse attention mechanisms, like Factorized Attention, reduce computation complexity. Biases in dense kernels and Layer norms are unnecessary or even detrimental in some LLMs, leading to their removal for improved training stability.
LLMs' architecture and model training are critical to achieving impressive performance.
The selection of architecture type, detailed configurations of normalization, activation functions, positional embeddings, attention mechanisms, and biases all contribute to LLMs' overall effectiveness and stability.
Frequently Asked Questions (FAQ)
1. What are large language models (LLMs)?
Large language models (LLMs) are powerful artificial intelligence models that excel in natural language processing tasks. They are trained on vast amounts of text data and have shown remarkable capabilities in tasks like text generation, language translation, and question-answering.
2. How do LLMs contribute to advancements in AI and NLP?
LLMs have opened up new possibilities in AI and NLP applications. They have improved chatbots, automated content generation, and voice assistants. LLMs have also enhanced various NLP tasks, enabling more accurate language understanding and generation.
3. What are the different types of LLM architectures?
LLM architectures can be broadly classified into three types: encoder-decoder, causal decoder, and prefix decoder. Each type has its own advantages and has been used in different LLM models for specific purposes.
4. Which activation functions are commonly used in LLMs?
GeLU (Gaussian Error Linear Unit) activations are commonly used in existing LLMs. However, recent models like PaLM and LaMDA have employed variants of GLU (Gated Linear Unit) activation, such as SwiGLU and GeGLU, which have shown better performance in practical applications.