CLIP

Tutorial To Leverage Open AI's CLIP Model For Fashion Industry

Akshit Mehra

Sep 18, 2023 • 9 min read

Tutorial To Leverage Open AI's CLIP Model For Fashion Industry

In recent years, the field of artificial intelligence and machine learning has made significant strides, enabling researchers and developers to achieve groundbreaking outcomes. Among these advancements, the CLIP (Contrastive Language–Image Pretraining) model by OpenAI has emerged as a revolutionary breakthrough in AI.

Leveraging its multimodal capacity, CLIP is remarkably capable of understanding and establishing connections between text and images. This model holds immense promise across a wide range of applications, with a particular highlight being its prowess in zero-shot classification.

Figure: Working of CLIP Model

Figure: Working of CLIP Model

As we have already been through technical know-how for the CLIP Model in our previous blog, we aim to utilize the clip model and pre-train it over our custom Indo-fashion data to make it more domain-specific.

Pre-Requisites

To proceed further, one should be familiar with:

Python: All the below code will be written using python
OpenCV (Open source Computer Vision): OpenCV provides a standard infrastructure for computer vision-based applications.
Pytorch: PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella.
transformers: PyTorch-Transformers (formerly known as PyTorch-pre trained-bert) is a library of state-of-the-art pre-trained models for Natural Language Processing (NLP).
Kaggle: Kaggle is a data science and artificial intelligence platform. On this platform, large companies and organizations publish contests with monetary prizes.

Methodology

This tutorial aims to fine-tune a clip model with the Indo fashion Dataset available on Kaggle.

Dataset

The dataset under consideration consists of 106,000 images, encompassing 15 unique clothing categories. To ensure equitable evaluation, the distribution of these categories is kept consistent in both the validation and test sets, with each category containing 500 samples.

Below is the list of the classes used:

Saree
Lehenga
Women Kurta
Dupatta
Gown
Nehru Jacket
Sherwani
Men Kurta
Men Mojari
Leggings and Salwar
Blouse
Palazzo
Dhoti Pants
Petticoat
Women Mojari

‌
Figure: IndoFashion Dataset

Figure: IndoFashion Dataset

The Indofashion dataset stands as the first extensive collection of Indian ethnic clothing images, totaling over 106,000 items and divided into 15 different categories for precise classification purposes.

To fine-tune the clip model, we know that the dataset required to train is in (Image, Text) format. So, for this, in our dataset, we have:

Images Folder
JSON File for each train, test with each JSON row containing attributes link Product Title, Class label, Image path, Image URL.

So, for text, we use the product title attribute, with a maximum length of 40 characters.

Figure: Above image corresponds to Image sample and Below image corresponds to json file associated with it.

Figure: The above image corresponds to the Image sample, and the Below image corresponds to the JSON file associated.

Hands-On with Code

Before working on the code, you should have access to Kaggle. For that, Create an account on Kaggle visit here, and create a new notebook.

Before we begin with tuning our clip model, we would require first to install the openai-clip model, which offers the following:

Pre-trained model
Optimizers states

The optimizer state refers to properties like the optimizer's momentum vector or historical tracking information.

For instance, in the case of the Adam optimizer, it keeps track of moving averages of the gradient and squared gradient. It helps in continuing the fine-tuning of the model from the checkpointed area.

For more information on how memory is consumed inside RAM when an LLM/Foundational Model is loaded for fine-tuning, refer here.

For this, add a new code block, and type:

!pip install openai-clip

We then begin by importing all the required libraries.

#For Parsing json data
import json

#For Loading Images
from PIL import Image

#For displaying loadbar
from tqdm import tqdm

#Importing pytorch to finetune our clip
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

#Here, we import clip model developed by meta
import clip

#Constructs a CLIP processor which wraps a CLIP image processor and a CLIP tokenizer into a single processor.
from transformers import CLIPProcessor, CLIPModel

Now, we load our input data and create a list of JSON data named input_data. This list will be used to create our text input and image.

# Respective Paths

json_path = '/kaggle/input/indo-fashion-dataset/train_data.json'
image_path = '/kaggle/input/indo-fashion-dataset/images/train/'

input_data = []
with open(json_path, 'r') as f:
	for line in f:
		obj = json.loads(line)
		input_data.append(obj)

# Load the CLIP model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

Setting our device to GPU (Cuda) and loading the pre-trained CLIP model.

# Choose computation device
device = "cuda:0" if torch.cuda.is_available() else "cpu" 

# Load pre-trained CLIP model
model, preprocess = clip.load("ViT-B/32", device=device, jit=False)

Now, we create a class for defining our Clothing Dataset.

# Define a custom dataset
class image_title_dataset():
	def __init__(self, list_image_path, list_txt):
		# Initialize image paths and corresponding texts
		self.image_path = list_image_path

		# Tokenize text using CLIP's tokenizer
		self.title = clip.tokenize(list_txt)
	
    def __len__(self):
		# Define the length of the dataset
		return len(self.title)

	def __getitem__(self, idx):
		
        # Get an item from the dataset
		# Preprocess image using CLIP's preprocessing function
		
        image = preprocess(Image.open(self.image_path[idx]))
		title = self.title[idx]
		return image, title

Now, we prepare our dataset using the pytorch Dataloader, which facilitates images in batches efficiently. Our Dataset is of the form (Image, text), where the text is extracted from the product title attribute of JSON file.

# use your own data
list_image_path = []
list_txt = []
for item in input_data:
  img_path = image_path + item['image_path'].split('/')[-1]
  
  #As we have image text pair, we use product title as description.
  caption = item['product_title'][:40]
  list_image_path.append(img_path)
  list_txt.append(caption)

dataset = image_title_dataset(list_image_path, list_txt)
train_dataloader = DataLoader(dataset, batch_size=100, shuffle=True) 
#Define your own dataloader

# Function to convert model's parameters to FP32 format
#This is done so that our model loads in the provided memory.
def convert_models_to_fp32(model): 
    for p in model.parameters(): 
        p.data = p.data.float() 
        p.grad.data = p.grad.data.float()

Now, we set our Adam optimizer with certain parameters and set respective loss functions.

# Check if the device is set to CPU
if device == "cpu":
    model.float()  # Convert the model's parameters to float if using CPU

# Prepare the optimizer
optimizer = torch.optim.Adam(
    model.parameters(), lr=5e-5, betas=(0.9, 0.98), eps=1e-6 ,weight_decay=0.2) 
    
# Adam optimizer is used with specific hyperparameters
# lr (learning rate) is set to 5e-5, which is considered safe for fine-tuning to a new dataset
# betas are used for the optimization algorithm
# eps is a small value to prevent division by zero
# weight_decay adds L2 regularization to the optimizer

# Specify the loss function for images
loss_img = nn.CrossEntropyLoss()

# Specify the loss function for text
loss_txt = nn.CrossEntropyLoss()

Finally we train our model. Below is the code:

# Train the model
num_epochs = 4 # Number of training epochs
for epoch in range(num_epochs):
    pbar = tqdm(train_dataloader, total=len(train_dataloader))
    
    # Iterate through the batches in the training data
    for batch in pbar:
        optimizer.zero_grad()  # Zero out gradients for the optimizer
        
        # Extract images and texts from the batch
        images, texts = batch
        
        # Print the current device (CPU or GPU)
        print(device)
        
        # Move images and texts to the specified device (CPU or GPU)
        images = images.to(device)
        texts = texts.to(device)

        # Forward pass through the model
        logits_per_image, logits_per_text = model(images, texts)

        # Compute the loss
        ground_truth = torch.arange(len(images), dtype=torch.long, device=device)
        total_loss = (loss_img(logits_per_image, ground_truth) + loss_txt(logits_per_text, ground_truth)) / 2

        # Backward pass and update the model's parameters
        total_loss.backward()
        
        # If the device is CPU, directly update the model
        if device == "cpu":
            optimizer.step()
        else:
            # Convert model's parameters to FP32 format, update, and convert back
            convert_models_to_fp32(model)
            optimizer.step()
            clip.model.convert_weights(model)

        # Update the progress bar with the current epoch and loss
        pbar.set_description(f"Epoch {epoch}/{num_epochs}, Loss: {total_loss.item():.4f}")

Results

So, we have trained our model for 4 epochs, which took around 1.5-2 hours on a normal Kaggle notebook, which has around 13GB of RAM and 16GB of GPU. For inference, follow the below code:

import torch
import clip
from PIL import Image
import os
import matplotlib.pyplot as plt  # Import the matplotlib library for image visualization

# Check if CUDA (GPU) is available, otherwise use CPU
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the CLIP model and preprocessing pipeline
model, preprocess = clip.load("ViT-B/32", device=device)

# List of clothing items for comparison
clothing_items = [
    "Saree",
    "Lehenga",
    "Women Kurta",
    "Dupatta",
    "Gown",
    "Nehru Jacket",
    "Sherwani",
    "Men Kurta",
    "Men Mojari",
    "Leggings and Salwar",
    "Blouse",
    "Palazzo",
    "Dhoti Pants",
    "Petticoat",
    "Women Mojari"
]

# Index of the input data you want to analyze
index_ = 500

# Assuming 'input_data' is a list of JSON-like objects with image information
image_json = input_data[index_]

# Construct the full path to the image file using the given 'image_path'
image_path = os.path.join("/kaggle/input/indo-fashion-dataset", image_json['image_path'])

# Get the class label of the image
image_class = image_json['class_label']

# Preprocess the image and move it to the appropriate device (CPU or GPU)
image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)

# Tokenize and move the clothing item names to the appropriate device
text = torch.cat([clip.tokenize(f"a photo of a {c}") for c in clothing_items]).to(device)

# Perform inference
with torch.no_grad():
    # Encode image and text
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    # Calculate similarity scores between image and text
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

# Normalize image and text features
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

# Calculate similarity scores
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)

# Print the top predictions
print("\nTop predictions:\n")
for value, index in zip(values, indices):
    print(f"{clothing_items[index]:>16s}: {100 * value.item():.2f}%")

# Display the image with its class label
plt.imshow(plt.imread(image_path))
plt.title(f"Image for class: {image_class}")
plt.axis('off')
plt.show()

In the above code, we have analyzed an image with the following details:

Running this code gave the following output:

Similarly, we can try different image indices to try out how the model performs. The model may not always give correct output, which is due to the fact that we have fine-tuned only for 4 epochs.

Conclusion

Artificial intelligence and machine learning have witnessed significant advancements in recent years, enabling researchers and developers to achieve groundbreaking results.

One such remarkable breakthrough in AI is the CLIP (Contrastive Language–Image Pretraining) model developed by OpenAI.

CLIP stands out for its exceptional ability to comprehend and establish meaningful connections between text and images, making it a versatile tool with enormous potential across various applications.

One of its standout features is its proficiency in zero-shot classification, allowing it to classify images based on textual descriptions without explicit training for those specific classes.

To harness the power of CLIP for a more domain-specific application, we embarked on fine-tuning the model using the Indo fashion dataset, which comprises a substantial collection of Indian ethnic clothing images categorized into 15 different classes.

This dataset, totaling over 106,000 images, was carefully balanced to ensure equitable distribution across the validation and test sets.

Our journey to fine-tune the CLIP model involved several steps:

Data Preparation: We organized the dataset, consisting of images and corresponding textual attributes, in (Image and text) format. The textual attribute we used was the product title, limited to a maximum of 40 characters.
Model and Library Setup: We imported essential libraries, including OpenCV, PyTorch, transformers, and the CLIP model itself. We also created a custom dataset class for efficient data loading and preprocessing.
Data Loading: We loaded the input data, which included image paths, text descriptions, class labels, image URLs, and more, from JSON files.
Model Initialization: We initialized the CLIP model and its associated processor, setting the device to either CPU or GPU as available.
Dataset Creation: We prepared the data for training using our custom dataset class, tokenizing the text descriptions and preprocessing the images.
Training Setup: We set up the training environment, including the optimizer with specific hyperparameters and defined loss functions for both images and text.
Training Loop: We embark on training process, iterating through the dataset for a predefined number of epochs. We performed forward and backward passes for each batch, updating the model's parameters using the Adam optimizer. We also accounted for device-specific considerations when updating the model.

By following these steps, we fine-tuned the CLIP model to make it more domain-specific, and specifically tailored for classifying Indian ethnic clothing images.

This process showcased the CLIP model's versatility and highlighted the potential for customizing AI models for specialized tasks. In the ever-evolving landscape of AI and machine learning, CLIP, and similar models continue to open new avenues for innovation and problem-solving across various domains.

Frequently Asked Questions

1. What is CLIP Model?

Clip is a neural network that efficiently learns visual concepts from natural language supervision. It can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the "zero-shot" capabilities of GPT-2 and GPT-3.

2. How does CLIP AI operate?

CLIP acquires visual concepts through natural language guidance. Traditional supervised computer vision systems are trained or fine-tuned with a predefined set of labels, which restricts the model's adaptability whenever it encounters new labels. CLIP draws upon prior research such as VirTex and ConVIRT to address these limitations.

Train Your Vision/NLP/LLM Models 10X Faster

Book our demo with one of our product specialist

Book a Demo

Pre-Requisites

Methodology

Conclusion

Frequently Asked Questions

Sign up for more like this.