Announcement: We're launching LabelGPT, World's fastest prompt based labeling tool. Join the waiting list to get beta access

Evolution of Neural Networks to Large Language Models in Details

Evolution of Neural Networks to Large Language Models in Details
In this blog we discuss the evolution of Neural Networks to Large Language Models in Detail.

Over the last few decades, language models have evolved. Simple language models were utilized initially for tasks like speech recognition, machine translation, and information retrieval.

These models were built using statistical approaches, including n-gram and hidden Markov models. These models, however, have limits in terms of accuracy and scalability.

Neural networks have been more popular for language modeling applications since the introduction of deep learning.

In this field, recurrent neural networks (RNNs) and long short-term memory (LSTM) networks have proved particularly effective.

These models can capture sequential relationships in linguistic data and produce coherent output.

Attention-based approaches, such as the Transformer architecture, have lately gained appeal. These models create output by attending to distinct sections of the input sequence using self-attention techniques.

They have been demonstrated to be extremely successful in various natural language processing applications, including language modeling.

Timeline for the Evolution of Language Models

      Figure: Timeline for the Evolution of Language Models

This blog will briefly introduce various language models that lead to forming large language models.

Probabilistic Models

Probabilistic models in language are mathematical models that seek to represent the likelihood of occurrences in language. Natural language processing (NLP) use these models to model and comprehend language.

The n-gram model is one of the most extensively used probabilistic models for language purposes.

The n-gram model is a statistical language model that predicts the likelihood of the next word in a sequence based on the previous n-1 words. In a bigram model, for example, the likelihood of the next word in a sequence is predicted based on the preceding word.

N-Gram Model

            Figure: N-Gram Model

This model proposes that a word's likelihood is exclusively affected by the preceding word and not by any other words in the sequence. Because of this assumption, the model is efficient and scalable for huge datasets.

Another Probabilistic Model used in language-based tasks includes HMM (Hidden Markov Model). The Hidden Markov Model (HMM) is another extensively used probabilistic model in language.

Hidden Markov Model (HMM) is a statistical model used in natural language processing and other fields to model data sequences that are assumed to have a Markovian structure.

It is a probabilistic model that assumes an underlying sequence of hidden states generates a sequence of observable events. The model is called "hidden" because the states are not directly observed but can be inferred from the observable events.

Speech recognition, part-of-speech tagging, and machine translation are among the activities that require HMMs.

Neural Network-based Language Models

In recent years, neural network-based language models have revolutionized natural language processing (NLP). These models are based on training a neural network to predict the next word in a series of words given the words that came before it.

The neural network learns to recognize patterns and correlations in the training data and uses these patterns to make probabilistic predictions for the following word.

Neural Networks

             Figure: Neural Networks

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are a form of artificial neural network that processes incoming data one at a time while retaining a state that summarises the history of previous inputs.

RNNs can handle variable-length inputs and variable-length output sequences, making them useful for natural language processing applications, including language synthesis, machine translation, and speech recognition.

RNNs are distinguished by their capacity to capture temporal dependencies through feedback loops that allow prior outputs to be sent back into the model as inputs.

This allows the network to use its memory to keep track of prior inputs and generate outputs informed by those inputs.

The vanishing gradient problem is the major issue with RNNs when the gradients become too tiny to train the network effectively. This can make learning long-term dependencies in sequential data more challenging.

Furthermore, RNNs are susceptible to the exploding gradient problem, in which the gradients grow too large and cause the weights to update in an unstable manner.

Finally, because RNNs are sequential, they can be computationally expensive and difficult to parallelize, limiting their scalability to large datasets.

Recurrent Neural Networks

         Figure: Recurrent Neural Networks

Long Short-Term Memory (LSTM) Networks

Long Short-Term Memory (LSTM) Networks are a kind of RNN design that overcomes the vanishing gradient problem by incorporating a specialized memory cell that can selectively retain or forget information over time.

Hochreiter and Schmidhuber invented LSTMs in 1997, and they have since become a popular choice for modeling sequential data.

Three gates control the memory cell in an LSTM network: the input gate, the forget gate, and the output gate.

The input gate regulates new data flow into the memory cell, whereas the forget gate regulates the retention of current data in the memory cell. The output gate regulates the flow of information from the memory cell to the network's output.

LSTM networks have been found to perform well in various natural language processing (NLP) applications, such as language modeling, machine translation, and sentiment analysis. They've also been employed in tasks like voice recognition and picture captioning.

Gated Recurrent Unit (GRU) Networks

GRU Networks is a neural network design utilized in deep learning and natural language processing (NLP).

They are similar to LSTM networks in that they are intended to solve the vanishing gradient problem in RNNs.

GRUs, like LSTMs, contain a gating mechanism that allows the network to update and forget information selectively over time.

GRUs, on the other hand, have a simpler design with fewer parameters than LSTMs, making them faster to train and easy to deploy.

The number of gates used to regulate the flow of information is one of the primary distinctions between GRU and LSTM. In LSTM networks, three gates are used: the input gate, the forget gate, and the output gate.

In contrast, GRU networks employ only the reset and update gates.

Encoder-Decode Networks

Encoder-Decoder architecture is a form of neural network architecture used for sequential tasks such as language translation, audio recognition, and picture captioning.

It comprises two parts: an encoder network that processes the input sequence and a decoder network that creates the output sequence.

In the case of language translation, the encoder network analyses the source language input sentence and generates a fixed-length representation of the phrase known as the context vector.

This context vector is then input into the decoder network, which creates the target language translation word by word.

Sequence-to-Sequence (Seq2Seq) architectures are among the most widely used encoder-decoder designs. Recurrent neural networks (RNNs) are the foundation for the encoder and decoder networks in the Seq2Seq paradigm.

The input sequence is processed by the encoder RNN, which creates a fixed-length vector that encapsulates the input sequence's meaning. This vector is then sent into the decoder RNN, which produces the output sequence.

Encoder-Decoder Architecture

     Figure: Encoder-Decoder Architecture

Attention Mechanism

In the standard encoder-decoder architecture:

  1. First, the input sequence is encoded into a fixed-length vector representation.
  2. The decoder takes the vector representation and generates the output sequence.

However, when long input sequences are long, this fixed-length encoding can result in information loss.

This problem is addressed by the attention mechanism, which allows the decoder to look back at the input sequence and choose to attend to the important sections of the sequence at each decoding stage.

The study "Neural Machine Translation by Jointly Learning to Align and Translate" by Bahdanau et al. from 2014 was the first to discuss the attention process.

The research developed a sequence-to-sequence model with an attention mechanism that performed better on machine translation tasks than the current state-of-the-art models.

The attention mechanism assigns an attention weight to each input sequence element depending on its importance to the current decoding phase.

These attention weights are then utilized to generate a weighted sum of the input sequence components, which serves as the context vector for the current decoding phase.

Transformer Architecture

Vaswani et al. first described the Transformer architecture, a kind of neural network design, in 2017. It is largely used for text categorization, language modeling, and other natural language processing tasks like machine translation.

The Transformer architecture is similar to an encoder-decoder architecture. The encoder takes the input sequence and generates a hidden representation of it.

The hidden representation is sent to the decoder, which generates the output sequence. The encoder and decoder are built from many layers of self-attention and feedforward neural networks.

The attention weights between all pairs of input components are computed by the self-attention layer and used to compute a weighted sum of the input elements. The feedforward layer then applies a non-linear change to the self-attention layer's output.

Transformer Architecture

          Figure: Transformer Architecture

The Transformer design is more efficient in various ways than prior neural network architectures. For instance:

  1. It enables parallel processing of the input sequence, making it quicker and more efficient.
  2. It is easier to understand than previous architectures because the attention weights can be visualized to see which parts of the input sequence the model focuses on.
  3. It enables the model to consider the complete input sequence, improving performance on tasks like machine translation.

Large Language Models (LLMs)

Large language models have predominantly used the transformer architecture since 2018, which has become the standard deep learning technique for sequential data. Before this, recurrent architectures such as the LSTM were more commonly used.

The transformer architecture is known for efficiently processing long data sequences. It is particularly well-suited to natural language processing tasks, such as language translation and text generation.

The transformer model introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017 has since been widely adopted to develop large language models such as GPT-3.5, BERT, and T5.

General Architecture Properties Tokenization

Large language models (LLMs) are mathematical functions that take lists of numbers as input and output. To process words, a tokenizer is used to convert them into numbers.

Tokenizers are bijective functions that map between texts and lists of integers.

They are trained on the entire dataset and then frozen before the LLM is trained. Tokenizers compress text to save compute, where common words or phrases are encoded into a single token.

LLMs generally use tokenizers where one token maps to around four characters or 0.75 words in common English text.


A large language model (LLM) output is a probability distribution over its vocabulary, typically implemented through a softmax function.

Upon receiving a text, the LLM outputs a vector y in R^V, where V is the vocabulary size. The unnormalized logit vector y is then passed through a softmax function to obtain a probability vector, a probability distribution over the LLM's vocabulary.

The softmax function is defined mathematically with no parameters to vary and therefore is not trained.

The resulting probability vector has V entries, all non-negative, and sum to 1. It represents the LLM's prediction of the probability of each word in its vocabulary given the input text.

Some examples of LLMs

Some of the very recent developments of LLMs include:

  1. GPT-4: GPT-4 is the fourth iteration of the Generative Pre-trained Transformer series and is known for its ability to generate human-like text, answer questions, create poetry, and write code.
  2. BERT: BERT, developed by Google, is a bidirectional LLM that captures context from both directions and has become foundational for various NLP tasks. T5, also developed by Google, treats all NLP tasks as a text-to-text problem and has shown exceptional performance in tasks like translation, summarization, and question answering.
  3. RoBERTa: RoBERTa, developed by Facebook, is an optimized version of BERT that has achieved state-of-the-art results in numerous NLP benchmarks.
  4. Megatron: Megatron, developed by NVIDIA, is an LLM designed to scale up model training while maintaining efficiency and allows researchers to train massive models with billions of parameters

LLM introduced by OpenAI

  ChatGPT: LLM introduced by OpenAI


Finally, developing neural networks for large language models has resulted in breakthroughs in natural language processing.

From probabilistic models like n-grams and Hidden Markov models to neural network-based models like Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Gated Recurrent Units (GRUs), models have been continuously improved to overcome limitations like vanishing gradients and scalability to large datasets.

Attention-based techniques, especially the Transformer architecture, have evolved and performed exceptionally in various natural language processing applications.

These models have had considerable success in language modeling because they employ self-attention approaches to attend to distinct regions of the input sequence.

In the end, we focused on Large Language Models (LLMs). Large language models (LLMs) are machine learning models that use deep neural networks to create natural language text.

To analyze and produce text, LLMs use a variety of methodologies, including recurrent neural networks (RNNs), feedforward neural networks, and attention processes.