Announcement: We're launching LabelGPT, World's fastest prompt based labeling tool. Join the waiting list to get beta access

Fine-Tuning Tutorial: Falcon-7b LLM To A General Purpose Chatbot

Hands-on with Fine-Tuning LLM
Fine-Tuning Tutorial: Falcon-7b LLM To A General Purpose Chat-bot

LLM models undergo training on extensive text data sets, equipping them to grasp human language in depth and context.

In the past, most models underwent training using the supervised method, where input features and corresponding labels were fed. In contrast, LLMs take a different route by undergoing unsupervised learning.

In this process, they consume vast volumes of text data devoid of any labels or explicit instructions. Consequently, LLMs efficiently learn the significance and interconnections among words in a language. They have a broad spectrum of applications, such as generating text, addressing queries, translating languages, and more.

Moreover, these expansive language models can be fine-tuned on customized datasets tailored to specific domains, adding an extra layer of versatility. In this article, I will delve into the necessity of fine-tuning, explore the array of LLMs available, and provide an illustrative example where we try to fine-tune Falcon-7b LLM to act as a general-purpose chatbot.

Understanding Fine-Tuning

While a pre-trained LLM possesses general knowledge, it might need help with specific domain questions and comprehend medical terminology and abbreviations. This is where fine-tuning comes into play.

But what exactly is fine-tuning? In essence, it's akin to transferring knowledge. These expansive language models undergo training on extensive datasets using substantial computational resources and boast millions of parameters.

Figure: Transfer Learning

            Figure: Transfer Learning

The linguistic patterns and representations acquired by LLM during its initial training are transferred to your current task. In technical terms, we begin with a model that's initialized using pre-trained weights.

Subsequently, it undergoes training using data relevant to your specific task, refining the parameters to be more aligned with the task's requirements. You also have the flexibility to adjust the model's architecture and modify its layers to suit your specific needs.

Falcon -7b

Falcon LLM stands as a foundational large language model with a staggering 40 billion parameters, developed through training on an extensive corpus of one trillion tokens. The Technology Innovation Institute (TII) has recently introduced Falcon LLM, showcasing this impressive 40B model.

Notably, Falcon LLM's training process was conducted with remarkable efficiency, utilizing only 75 percent of the training compute employed by GPT-3, 40 percent of Chinchilla's, and 80 percent of PaLM-62B's.

Falcon firmly secures the top position when assessed on the OpenLLM Leaderboard within the HuggingFace platform, effectively surpassing META's LLaMA-65B model.

Falcon, a decoder-only autoregressive model, boasts 40 billion parameters and was trained using a substantial dataset of 1 trillion tokens. This intricate training process spanned two months and involved the use of 384 GPUs hosted on AWS.

To build its pretraining dataset, Falcon drew from public web crawls, compiling a collection of text data.

Through rigorous filtration procedures that removed machine-generated content and adult material, the resulting pretraining dataset emerged from dumps provided by CommonCrawl. This meticulous curation yielded a dataset of nearly five trillion tokens.

Fine-Tuning with PEFT (Parameter Efficient Fine Tuning)

The process of training models with a size exceeding 10 billion parameters can present significant technical and computational challenges.

This section focuses on the tools available within the Hugging Face ecosystem to efficiently train these extremely large models using basic hardware. It also demonstrates the fine-tuning process of Falcon-7b on a single NVIDIA T4 (16GB) within Google Colab.

The approach involves training Falcon using the Guanaco dataset, which is a high-quality subset extracted from the Open Assistant dataset. This subset comprises approximately 10,000 dialogues.

Figure: Parameter Efficient Fine Tuning

Figure: Parameter Efficient Fine Tuning

Leveraging the PEFT library, the fine-tuning process employs the QLoRA approach, which involves fine-tuning adapters placed on top of the frozen 4-bit model. Further information about integrating 4-bit quantized models can be found in the associated blog post.

Figure: Fine-Tuning with QLora

      Figure: Fine-Tuning with QLora

Utilizing Low-Rank Adapters (LoRA) for fine-tuning allows only a small portion of the model to be trainable. This substantially reduces the number of learned parameters and the size of the final trained model artifact. For instance, the saved model occupies a mere 65MB for the 7B parameters model, whereas the original model is around 15GB in half precision.

To elaborate, the process involves selecting the target modules for adaptation, often the query/key layers of the attention module. Small trainable linear layers are then appended near these modules. The hidden states produced by these adapters are combined with the original states to derive the ultimate hidden state.

After the training is completed, there is no necessity to save the entire model, as the base model remains frozen. Additionally, the model can be maintained in any preferred data type (int8, fp4, fp16, etc.), provided that the output hidden states from these modules are cast into the same data type as those from the adapters.

This holds true for bitsandbytes modules, specifically, Linear8bitLt and Linear4bit, which generate hidden states with the same data type as the original unquantized module.

Hands-on With Code

In this section, we try to fine-tune a Falcon-7 b foundational model using the Parameters efficient fine-tuning approach.


To understand the below tutorial, one should be familiar with the following:

  1. Python: Language we use for hands-on.
  2. Large Language Models: Foundational models train on a large corpus of data.
  3. Pytorch: Python framework used for working with Machine-learning/Deep learning projects.


We're going to make use of the PEFT library from Hugging Face's collection and also utilize QLoRA to make the process of fine-tuning more memory-friendly.

We'll use the Guanaco dataset, a refined part of the OpenAssistant dataset designed specifically to train versatile chatbots. The dataset contains various questions that require generative outputs.

The data is like a question along with its answer. Further, its multi-lingual, i.e., we have questions in English and in Spanish. The dataset contains about 9.85K training instances along with 518 test instances.


Execute the code cells provided below to establish and deploy the necessary libraries. Our experimentation necessitates the utilization of accelerate, peft, transformers, datasets, and TRL, which will allow us to harness the capabilities of the newly introduced SFTTrainer.

To achieve the quantization of the base model into 4 bits, we'll incorporate the bitsandbytes module. Additionally, the installation of einops is imperative, given its role in loading Falcon models.


We begin by installing all the required dependencies.

pip install -q -U trl transformers accelerate git+

!pip install -q datasets bitsandbytes einops wandb

In our tutorial, we will use the Guanaco dataset, which constitutes a refined segment of the OpenAssistant dataset designed specifically for the training of versatile chatbots. You can access the dataset through this link.

# Import the necessary library for loading datasets
from datasets import load_dataset

# Specify the name of the dataset
dataset_name = "timdettmers/openassistant-guanaco"

# Load the dataset from the specified name and select the "train" split
dataset = load_dataset(dataset_name, split="train")

Next, we load our model. We will be loading the Falcon 7B model, applying 4bit quantization to it, and then adding LoRA adapters to the model.

# Importing the required libraries
import torch

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Defining the name of the Falcon model
model_name = "ybelkada/falcon-7b-sharded-bf16"

# Configuring the BitsAndBytes quantization
bnb_config = BitsAndBytesConfig(

# Loading the Falcon model with quantization configuration
model = AutoModelForCausalLM.from_pretrained(

# Disabling cache usage in the model configuration
model.config.use_cache = False

Next, we load the Tokenizer.

# Load the tokenizer for the Falcon 7B model with remote code trust
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Set the padding token to be the same as the end-of-sequence token
tokenizer.pad_token = tokenizer.eos_token

Next, we will import the configuration file to construct the LoRA model. It's crucial to incorporate all linear layers within the transformer block for optimal results.

For this reason, we will include the dense, dense_h_to_4_h, and dense_4h_to_h layers as target modules alongside the mixed query key value layer. This comprehensive approach aims to achieve the highest level of performance.

# Import the necessary module for LoRA configuration
from peft import LoraConfig

# Define the parameters for LoRA configuration
lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

# Create the LoRA configuration object
peft_config = LoraConfig(

We now utilize the SFTTrainer provided by the TRL library, which offers a convenient interface around the Transformers Trainer.

This enables straightforward fine-tuning of models on instruction-based datasets using PEFT adapters. To begin, we'll load the training arguments as shown below.

from transformers import TrainingArguments
# Define the directory to save training results
output_dir = "./results"

# Set the batch size per device during training
per_device_train_batch_size = 4

# Number of steps to accumulate gradients before updating the model
gradient_accumulation_steps = 4

# Choose the optimizer type (e.g., "paged_adamw_32bit")
optim = "paged_adamw_32bit"

# Interval to save model checkpoints (every 10 steps)
save_steps = 10

# Interval to log training metrics (every 10 steps)
logging_steps = 10

# Learning rate for optimization
learning_rate = 2e-4

# Maximum gradient norm for gradient clipping
max_grad_norm = 0.3

# Maximum number of training steps
max_steps = 50

# Warmup ratio for learning rate scheduling
warmup_ratio = 0.03

# Type of learning rate scheduler (e.g., "constant")
lr_scheduler_type = "constant"

# Create a TrainingArguments object to configure the training process
training_arguments = TrainingArguments(
fp16=True,  # Use mixed precision training (16-bit)

Pass all the configured components to the trainer.

# Import the SFTTrainer from the TRL library
from trl import SFTTrainer

# Set the maximum sequence length
max_seq_length = 512

# Create a trainer instance using SFTTrainer
trainer = SFTTrainer(

We will perform pre-processing on the model by converting the layer norms to float 32. This step is taken to ensure more stable training. Following this, we will proceed with training the model.

# Iterate through the named modules of the trainer's model
for name, module in trainer.model.named_modules():

# Check if the name contains "norm"
if "norm" in name:
	# Convert the module to use torch.float32 data type
	module =



In the above tutorial, we have fine-tuned a falcon-7b model on guanaco dataset, which contains questions regarding general-purpose chatbot.  

Figure: Loss Plot

         Figure: Loss Plot

From the above loss plot, we can see that the loss continuously decreases over the data. It means the model is learning how to predict output to queries that align with human preferences.

		Figure: Output on the fine-tuned falcon-7b model

       Figure: Output on the fine-tuned falcon-7b model


In conclusion, the journey into the realm of fine-tuning Large Language Models (LLMs) has illuminated the power of these models in comprehending human language. Unlike traditional supervised learning, where models learn from labeled data, LLMs undergo unsupervised training on vast text datasets, enabling them to grasp intricate language patterns and relationships.

The process involves immersing LLMs in text data without explicit labels or instructions, fostering a deep understanding of language nuances. This foundation has led to their application in various domains, including text generation, translation, and more.

Furthermore, the article unveiled the concept of fine-tuning, a process that bridges the gap between pre-trained LLMs and domain-specific tasks. Fine-tuning involves transferring the acquired knowledge from general language understanding to specialized tasks.

Falcon LLM, a remarkable model with 40 billion parameters, showcased its prowess as a foundational model. Its efficient training process, relative to other models, and its impressive performance on OpenLLM Leaderboards exemplify its capabilities.

The tutorial's hands-on section explored fine-tuning Falcon LLM using the PEFT library and Low-Rank Adapters (LoRA). This demonstrated the practical aspects of preparing datasets, configuring models, and utilizing tools like SFTTrainer.

By showcasing the process on a single NVIDIA T4 GPU, the tutorial provided a glimpse into efficiently fine-tuning large models using basic hardware. Altogether, this exploration showcases the potential of LLMs and offers a guide for implementing effective fine-tuning strategies.

Frequently Asked Questions

1.  What is the Falcon 7B Model?

The Falcon 7B model, known as Falcon-7B-Instruct, is a causal decoder-only model with 7 billion parameters. Developed by TII (Technology Innovation Institute), this model is built upon the Falcon-7B architecture and has been fine-tuned using a combination of chat and instructive datasets.

2.  What is QLora?

QLoRA is a technique designed to enhance the efficiency of large language models (LLMs) by decreasing their memory requirements without compromising performance.

This is achieved through a series of methods, including implementing 4-bit quantization, introducing a novel data type referred to as 4-bit NormalFloat (NF4), double quantization, and utilizing paged optimizers.

3.  What is PEFT Library Used For?

Parameter-Efficient Fine-Tuning (PEFT) is a toolkit developed for the efficient adjustment of pre-trained language models (PLMs) for different practical uses, all while avoiding the need to fine-tune every single parameter of the model