Announcement: We're launching LabelGPT, World's fastest prompt based labeling tool. Join the waiting list to get beta access

Tutorial To Leverage Open AI's CLIP Model For Fashion Industry

Tutorial To Leverage Open AI's CLIP Model  For Fashion Industry
Tutorial To Leverage Open AI's CLIP Model For Fashion Industry

In recent years, the field of artificial intelligence and machine learning has made significant strides, enabling researchers and developers to achieve groundbreaking outcomes. Among these advancements, the CLIP (Contrastive Language–Image Pretraining) model by OpenAI has emerged as a revolutionary breakthrough in AI.

Leveraging its multimodal capacity, CLIP is remarkably capable of understanding and establishing connections between text and images. This model holds immense promise across a wide range of applications, with a particular highlight being its prowess in zero-shot classification.

	Figure: Working of CLIP Model

          Figure: Working of CLIP Model

As we have already been through technical know-how for the CLIP Model in our previous blog, here, we aim to utilize the clip model and pre-train it over our custom Indio-fashion data to make our clip model more domain-specific.


To proceed further, one should be familiar with:

  1. Python: All the below code will be written using python
  2. OpenCV (Open source Computer Vision): OpenCV provides a standard infrastructure for computer vision-based applications.
  3. Pytorch:  PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella.
  4. transformers: PyTorch-Transformers (formerly known as PyTorch-pre trained-bert) is a library of state-of-the-art pre-trained models for Natural Language Processing (NLP).
  5. Kaggle: Kaggle is a data science and artificial intelligence platform. On this platform, large companies and organizations publish contests with monetary prizes.


This tutorial aims to fine-tune a clip model with the Indo fashion Dataset available on Kaggle.


The dataset under consideration consists of 106,000 images, encompassing 15 unique clothing categories. To ensure equitable evaluation, the distribution of these categories is kept consistent in both the validation and test sets, with each category containing 500 samples.

The Indofashion dataset stands as the first extensive collection of Indian ethnic clothing images, totaling over 106,000 items and divided into 15 different categories for precise classification purposes.

To fine-tune the clip model, we know that the dataset required to train is in (Image, Text) format. So, for this, in our dataset, we have:

  1. Images Folder
  2. JSON File for each train, test with each JSON row containing attributes link Product Title, Class label, Image path, Image URL.

So, for text, we use the product title attribute, with a maximum length of 40 characters.

Figure: Above image corresponds to Image sample and Below image corresponds to json file associated with it.

Figure: The above image corresponds to the Image sample, and the Below image corresponds to the JSON file associated.

Hands-On with Code

Before working on the code, you should have access to Kaggle. For that, Create an account on Kaggle and visit here and create a new notebook.

Before we begin with tuning our clip model, we would require first to install the openai-clip model, which offers the following:

  1. Pre-trained model
  2. Optimizers states

For more information on how memory is consumed inside RAM when an llm is loaded for fine-tuning, refer here.

For this, run:

!pip install openai-clip

We then begin by importing all the required libraries.

#For Parsing json data
import json

#For Loading Images
from PIL import Image

#For displaying loadbar
from tqdm import tqdm

#Importing pytorch to finetune our clip
import torch
import torch.nn as nn
from import DataLoader

#Here, we import clip model developed by meta
import clip

#Constructs a CLIP processor which wraps a CLIP image processor and a CLIP tokenizer into a single processor.
from transformers import CLIPProcessor, CLIPModel

Now, we load our input data and create a list of JSON data named input_data. This list will be used to create our text input and image.

# Respective Paths

json_path = '/kaggle/input/indo-fashion-dataset/train_data.json'
image_path = '/kaggle/input/indo-fashion-dataset/images/train/'

input_data = []
with open(json_path, 'r') as f:
	for line in f:
		obj = json.loads(line)

# Load the CLIP model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

Setting our device to GPU (Cuda) and loading the pre-trained CLIP model.

# Choose computation device
device = "cuda:0" if torch.cuda.is_available() else "cpu" 

# Load pre-trained CLIP model
model, preprocess = clip.load("ViT-B/32", device=device, jit=False)

Now, we create a class for defining our cDatasetataset.

# Define a custom dataset
class image_title_dataset():
	def __init__(self, list_image_path, list_txt):
		# Initialize image paths and corresponding texts
		self.image_path = list_image_path

		# Tokenize text using CLIP's tokenizer
		self.title = clip.tokenize(list_txt)
    def __len__(self):
		# Define the length of the dataset
		return len(self.title)

	def __getitem__(self, idx):
        # Get an item from the dataset
		# Preprocess image using CLIP's preprocessing function
        image = preprocess([idx]))
		title = self.title[idx]
		return image, title

Now, we prepare our dataset using the pytorch Dataloader, which facilitates images in batches efficiently. Our Dataset is of the form (Image, text), where the text is extracted from the product title attribute of JSON file.

# Define a custom dataset
class image_title_dataset():
	def __init__(self, list_image_path, list_txt):
		# Initialize image paths and corresponding texts
		self.image_path = list_image_path

		# Tokenize text using CLIP's tokenizer
		self.title = clip.tokenize(list_txt)
    def __len__(self):
		# Define the length of the dataset
		return len(self.title)

	def __getitem__(self, idx):
        # Get an item from the dataset
		# Preprocess image using CLIP's preprocessing function
        image = preprocess([idx]))
		title = self.title[idx]
		return image, title

Now, we set our Adam optimizer with certain parameters and set respective loss functions.

# Check if the device is set to CPU
if device == "cpu":
    model.float()  # Convert the model's parameters to float if using CPU

# Prepare the optimizer
optimizer = torch.optim.Adam(
    model.parameters(), lr=5e-5, betas=(0.9, 0.98), eps=1e-6 ,weight_decay=0.2) 
# Adam optimizer is used with specific hyperparameters
# lr (learning rate) is set to 5e-5, which is considered safe for fine-tuning to a new dataset
# betas are used for the optimization algorithm
# eps is a small value to prevent division by zero
# weight_decay adds L2 regularization to the optimizer

# Specify the loss function for images
loss_img = nn.CrossEntropyLoss()

# Specify the loss function for text
loss_txt = nn.CrossEntropyLoss()

Finally we train our model. Below is the code:

# Train the model
num_epochs = 10  # Number of training epochs
for epoch in range(num_epochs):
    pbar = tqdm(train_dataloader, total=len(train_dataloader))
    # Iterate through the batches in the training data
    for batch in pbar:
        optimizer.zero_grad()  # Zero out gradients for the optimizer
        # Extract images and texts from the batch
        images, texts = batch
        # Print the current device (CPU or GPU)
        # Move images and texts to the specified device (CPU or GPU)
        images =
        texts =

        # Forward pass through the model
        logits_per_image, logits_per_text = model(images, texts)

        # Compute the loss
        ground_truth = torch.arange(len(images), dtype=torch.long, device=device)
        total_loss = (loss_img(logits_per_image, ground_truth) + loss_txt(logits_per_text, ground_truth)) / 2

        # Backward pass and update the model's parameters
        # If the device is CPU, directly update the model
        if device == "cpu":
            # Convert model's parameters to FP32 format, update, and convert back

        # Update the progress bar with the current epoch and loss
        pbar.set_description(f"Epoch {epoch}/{num_epochs}, Loss: {total_loss.item():.4f}")


Artificial intelligence and machine learning have witnessed significant advancements in recent years, enabling researchers and developers to achieve groundbreaking results.

One such remarkable breakthrough in AI is the CLIP (Contrastive Language–Image Pretraining) model developed by OpenAI.

CLIP stands out for its exceptional ability to comprehend and establish meaningful connections between text and images, making it a versatile tool with enormous potential across various applications.

One of its standout features is its proficiency in zero-shot classification, allowing it to classify images based on textual descriptions without explicit training for those specific classes.

To harness the power of CLIP for a more domain-specific application, we embarked on fine-tuning the model using the Indo fashion dataset, which comprises a substantial collection of Indian ethnic clothing images categorized into 15 different classes.

This dataset, totaling over 106,000 images, was carefully balanced to ensure equitable distribution across the validation and test sets.

Our journey to fine-tune the CLIP model involved several steps:

  1. Data Preparation: We organized the dataset, consisting of images and corresponding textual attributes, in (Image and text) format. The textual attribute we used was the product title, limited to a maximum of 40 characters.
  2. Model and Library Setup: We imported essential libraries, including OpenCV, PyTorch, transformers, and the CLIP model itself. We also created a custom dataset class for efficient data loading and preprocessing.
  3. Data Loading: We loaded the input data, which included image paths, text descriptions, class labels, image URLs, and more, from JSON files.
  4. Model Initialization: We initialized the CLIP model and its associated processor, setting the device to either CPU or GPU as available.
  5. Dataset Creation: We prepared the data for training using our custom dataset class, tokenizing the text descriptions and preprocessing the images.
  6. Training Setup: We set up the training environment, including the optimizer with specific hyperparameters and defined loss functions for both images and text.
  7. Training Loop: We embark on training process, iterating through the dataset for a predefined number of epochs. We performed forward and backward passes for each batch, updating the model's parameters using the Adam optimizer. We also accounted for device-specific considerations when updating the model.

By following these steps, we fine-tuned the CLIP model to make it more domain-specific, and specifically tailored for classifying Indian ethnic clothing images.

This process showcased the CLIP model's versatility and highlighted the potential for customizing AI models for specialized tasks. In the ever-evolving landscape of AI and machine learning, CLIP, and similar models continue to open new avenues for innovation and problem-solving across various domains.

Frequently Asked Questions

1.  What is CLIP Model?

Clip is a neural network that efficiently learns visual concepts from natural language supervision. It can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the "zero-shot" capabilities of GPT-2 and GPT-3.

2.  How does CLIP AI operate?

CLIP acquires visual concepts through natural language guidance. Traditional supervised computer vision systems are trained or fine-tuned with a predefined set of labels, which restricts the model's adaptability whenever it encounters new labels. CLIP draws upon prior research such as VirTex and ConVIRT to address these limitations.