Complete Guide On Fine-Tuning LLMs using RLHF

In discussions about why ChatGPT has captured our fascination, two common themes emerge:

  1. Scale: Increasing data and computational resources.
  2. User Experience (UX): Transitioning from prompt-based interactions to more natural chat interfaces.

However, there's an aspect often overlooked – the remarkable technical innovation behind the success of models like ChatGPT. One particularly ingenious concept is Reinforcement Learning from Human Feedback (RLHF), which combines reinforcement learning and human input in the field of Natural Language Processing (NLP).

Historically, reinforcement learning has been challenging to apply, primarily confined to gaming and simulated environments like Atari or MuJoCo.

Just a few years ago, RL and NLP advanced independently, using different tools, techniques, and setups. Witnessing RLHF's effectiveness in a new, expansive domain is impressive.

So, how does RLHF function? Why is it effective? This article addresses these questions by explaining the mechanics of RLHF.

To grasp RLHF, it's essential to comprehend the process of training a model like ChatGPT and where RLHF comes into play – this forms the initial focus.

Reinforcement learning from human feedback Overview

Let's visualize the ChatGPT development process to identify where RLHF comes into play. When glancing at the diagram below, you might notice a resemblance to the meme Shoggoth wearing a smiley face.

  Figure: A schematic diagram of How LLM is brought to Production

The initial pre-trained model is akin to an unrefined creature, as it was trained on indiscriminate internet data encompassing clickbait, misinformation, propaganda, conspiracy theories, and biases against specific groups.

This creature then underwent fine-tuning using higher-quality data sources like StackOverflow, Quora, and human annotations, rendering it somewhat socially acceptable.

Subsequently, the fine-tuned model was refined further through RLHF, transforming it into a version suitable for customer interactions – think giving it a smiley face – and eliminating the Shoggoth analogy.

How does RLHF operate?

The RLHF training process unfolds in three stages:

  1. Initial Phase: The outset involves designating an existing model as the primary model to establish a benchmark for accurate behavior. Given the extensive data requirements for training, utilizing a pre-trained model proves efficient.
  2. Human Feedback: Following the initial model training, human testers contribute their evaluations of its performance. Human trainers assign quality or accuracy ratings to different outputs generated by the model. Based on this human feedback, the system generates rewards for reinforcement learning.
  3. Reinforcement Learning: The reward model is fine-tuned using outputs from the primary model, and it receives quality scores from testers. The main model uses this feedback to enhance its performance for subsequent tasks.

This process is iterative, as the collection of human feedback and refinement through reinforcement learning occur repeatedly, ensuring ongoing enhancement of the model's capabilities.

     Figure: Working procedure in RLHF (With Reward Model)

Deconstructing the RLHF Training Process: Step-by-Step Analysis

As we venture into the intricate workings of the Reinforcement Learning from Human Feedback (RLHF) algorithm, it's crucial to maintain an awareness of its connection with the fundamental component: the initial pretraining of a Language Model (LM).

Let's dissect each phase of the RLHF algorithm.

Pretraining a Language Model (LM)

The pretraining phase establishes the groundwork for RLHF. In this stage, a Language Model (LM) undergoes training using a substantial dataset of text material sourced from the internet.

           Figure: Pretraining with large text corpus

This data enables the LM to comprehend diverse aspects of human language, encompassing syntax, semantics, and even context-specific intricacies.

Step 1: Choosing a Base Language Model

The initial phase involves the critical task of selecting a foundational language model. The choice of model is not universally standardized; rather, it hinges on the specific task, available resources, and unique complexities of the problem at hand.

Industry approaches differ significantly, with OpenAI adopting a smaller iteration of GPT-3 called InstructGPT, while Anthropic and DeepMind explore models with parameter counts ranging from 10 million to 280 billion.

Step 2: Acquiring and Preprocessing Data

In the context of RLHF, the chosen language model undergoes preliminary training on an extensive dataset, typically comprising substantial volumes of text sourced from the internet.

This raw data requires cleaning and preprocessing to render it suitable for training. This preparation often entails eliminating undesired characters, rectifying errors, and normalizing text anomalies.

Step 3: Language Model Training

Following this, the LM undergoes training using the curated dataset, acquiring the ability to predict subsequent words in sentences based on preceding words. This phase involves refining model parameters through techniques like stochastic gradient descent.

The overarching objective is to minimize disparities between the model's predictions and actual data, typically quantified using a loss function such as cross-entropy.

Step 4: Model Assessment

The model's performance is assessed upon completing training using an isolated dataset that was not employed in the training process.

This step is crucial to verify the model's capacity for generalization and ascertain that it has yet to memorize the training data. If the assessment metrics meet the required criteria, the model is deemed prepared for the subsequent phase of RLHF.

Step 5: Preparing for RLHF

Although the LM has amassed substantial knowledge about human language, it needs an understanding of human inclinations.

To address this, supplementary data is necessary. Often, organizations compensate individuals to produce responses to prompts, which subsequently contribute to training a reward model. While this stage can incur expenses and consume time, it's pivotal for orienting the model towards human-like preferences.

It's noteworthy that the pretraining phase doesn't yield a perfect model; errors and erroneous outputs are expected. Nevertheless, it furnishes a significant foundation for RLHF to build upon, enhancing the model's accuracy, safety, and utility.

Developing a Reward Model through Training

At the core of the RLHF procedure lies establishing and training a reward model (RM). This model serves as a mechanism for alignment, providing a means to infuse human preferences into the AI's learning trajectory.

           Figure: Creating a reward Model For RLHF

Step 1: Creating the Reward Model

The reward model can take the form of either an integrated language model or a modular structure. Its fundamental role is associating input text sequences with a numerical reward value, progressively facilitating reinforcement learning algorithms to enhance their performance in various settings.

For instance, if an AI generates two distinct text outputs, the reward model will ascertain which one better corresponds to human preferences, essentially 'acknowledging' the more suitable outcome.

Step 2: Data Compilation

Initiating the training of the reward model involves assembling a distinct dataset separate from the one employed in the language model's initial training. This dataset is specialized, concentrating on particular use cases, and composed of pairs consisting of prompts and corresponding rewards.

Each prompt is linked to an anticipated output, accompanied by rewards that signify the level of desirability for that output. While this dataset is generally smaller than the initial training dataset, it plays a crucial role in steering the model toward generating content that resonates with users.

Step 3: Model Learning

Utilizing the prompt and reward pairs, the model is instructed to associate specific outputs with their corresponding reward values.

This process often harnesses expansive 'teacher' models or combinations thereof to enhance diversity and counteract potential biases. The primary objective here is to construct a reward model capable of effectively gauging the appeal of potential outputs.

Step 4: Incorporating Human Feedback

Integrating human feedback is an integral facet of refining the reward model. A prime illustration of this can be observed in ChatGPT, where users can rate the AI's outputs using a thumbs-up or thumbs-down mechanism.

This collective feedback holds immense value in enhancing the reward model, as it provides direct insights into human preferences. Through this iterative cycle of model training and human feedback, AI undergoes continuous refinement, progressively aligning itself better with human preferences.

       Figure: Incorporating Human Feedback in LLMs

Techniques to Fine-Tune Model with Reinforcement Learning with Human Feedback

Fine-tuning plays a vital role in Reinforcement Learning with the Human Feedback approach. It enables the language model to refine its responses according to user inputs.

This refinement process employs reinforcement learning methods, incorporating techniques like Kullback-Leibler (KL) divergence and Proximal Policy Optimization (PPO).

Figure: Techniques to Fine-Tune Model with Reinforcement Learning with Human Feedback

Step 1: Applying the reward model

At the outset, a user's input, referred to as a prompt, is directed to the RL policy, which is essentially a refined version of the language model (LM).

The RL policy generates a response, and both the RL policy's output and the initial LM's output are assessed by the reward model. The reward model assigns a numeric reward value to gauge the quality of these responses.

Step 2: Establishing the feedback loop

This process is iterated within a feedback loop, allowing the reward model to assign rewards to as many responses as resources permit. Responses receiving higher rewards gradually influence the RL policy, guiding it to generate responses that better align with human preferences.

Step 3: Quantifying differences using KL Divergence

A pivotal role is played by Kullback-Leibler (KL) Divergence, a statistical technique that measures distinctions between two probability distributions.

In RLHF, KL Divergence is employed to compare the probability distribution of the current responses generated by the RL policy with a reference distribution representing the ideal or most human-aligned responses.

     Figure: Measuring Similarity between Probability Distribution

Step 4: Fine-tuning through Proximal Policy Optimization

Integral to the fine-tuning process is Proximal Policy Optimization (PPO), a widely recognized reinforcement learning algorithm known for its effectiveness in optimizing policies within intricate environments featuring complex state and action spaces.

PPO's strength in maintaining a balance between exploration and exploitation during training is particularly advantageous for the RLHF fine-tuning phase.

This equilibrium is vital for RLHF agents, enabling them to learn from both human feedback and trial-and-error exploration. The integration of PPO accelerates learning and enhances robustness.

Step 5: Discouraging inappropriate outputs

Fine-tuning serves the purpose of discouraging the language model from generating improper or nonsensical responses. Responses that receive low rewards are less likely to be repeated, incentivizing the language model to produce outputs that more closely align with human expectations.

Challenges Associated with Reinforcement Learning with Human Feedback

What are the difficulties and restrictions associated with RLHF?

  1. Variability and human mistakes: Feedback quality can differ among users and evaluators. Experts in specific domains like science or medicine should contribute feedback for tackling intricate inquiries, but locating these experts can be costly and time-intensive.
  2. Question phrasing: The accuracy of answers hinges on the wording of questions. Even with substantial RLHF training, an AI agent struggles to grasp user intent without adequately trained phrasing. This can lead to inaccurate responses due to contextual misunderstandings, although rephrasing the question may resolve this issue.
  3. Bias in training: RLHF is susceptible to machine learning biases. While factual queries yield one correct answer (e.g., "What's 2+2?"), complex questions, especially those related to politics or philosophy, can have multiple valid responses. AI tends to favor its training-based answer, introducing bias by overlooking alternative responses.
  4. Scalability: Since RLHF involves human input, the process tends to be time-intensive. Adapting this method to train larger, more advanced models requires significant time and resources due to its dependency on human feedback. This can potentially be alleviated by devising methods to automate or semi-automate the feedback loop.


In discussions about why people are so interested in ChatGPT, two big reasons come up: one is using more data and computer power, and the other is making conversations with AI more natural and easy.

But there's something important that often gets overlooked – the really smart technical ideas that make models like ChatGPT work well. One of these clever ideas is called "Reinforcement Learning from Human Feedback," or RLHF for short.

This idea mixes together two things: a way computers learn called "reinforcement learning" and input from people. This happens in the field of computers understanding human language.

A while back, using reinforcement learning was mostly limited to games or fake environments. But recently, this kind of learning and understanding language has come together in a really cool way – and that's RLHF!

So, how does RLHF work, and why is it good? This article explains it step by step. To get what RLHF is all about, you need to understand how models like ChatGPT are trained, and where RLHF comes into play.

Imagine ChatGPT is like a creature that's being trained. First, it's trained on a lot of internet text, but this makes it a bit rough around the edges. Then it's fine-tuned with better information to make it better. Finally, RLHF comes into play to make it even better for talking with people – like giving it a friendly smile.

The RLHF training has three main parts: first, setting up a good starting point; second, getting feedback from humans; and third, improving based on this feedback. This happens, again and again, to keep making the creature (or model) better.

The blog also explains each part of RLHF. It starts with training the model on lots of internet text. Then it picks the right model and cleans up the data. After that, it trains the model to predict what comes next in sentences. It also checks if the model is doing well.

The next step is preparing for RLHF. Even though the model knows a lot about human language, it doesn't know what people like. So, it gets more information from people about what's good. This is important even though it takes time and money.

The reward model is super important in RLHF. It's like a guide that tells the model if its answers are good or not. This guide is trained using data where people say what they like. This helps the model get better at giving good answers.

Then, fine-tuning comes in. This step uses the guide to help the model improve its answers even more. It's like teaching the model to be better by using rewards. This helps the model learn faster and give better answers.

But there are challenges too. Sometimes people give different feedback, and the way questions are asked can affect the answers. Also, the model might have some biases or prefer certain answers. And getting all this feedback takes time.

Frequently Asked Questions

1.  What purpose does RLHF serve?

Reinforcement Learning from Human Feedback (RLHF) is a machine learning strategy that merges reinforcement learning methods like rewards and comparisons with human direction to educate an artificial intelligence (AI) agent. Machine learning plays a crucial role in the development of AI.

2.  How does the process of reinforcement learning from human feedback function?

Once the initial model is trained, human evaluators offer their input on its performance. These human trainers assign scores indicating the quality or accuracy of outputs produced by the model. Subsequently, the system assesses its performance using human feedback to generate rewards for the purpose of reinforcement learning.

3.  Can you provide an illustration of reinforcement learning in humans?

Consider a young child seated on the ground (representing their present state) who takes an action – attempting to stand up – and subsequently receives a reward. This scenario encapsulates reinforcement learning concisely. Although the child might not be familiar with walking, they grasp it through experimentation and learning from mistakes.