Harnessing NLP Superpowers: A Step-by-Step Hugging Face Fine Tuning Tutorial

Kajal 17 Oct, 2023

11 min read

Introduction

Fine-tuning a natural language processing (NLP) model entails altering the model’s hyperparameters and architecture and typically adjusting the dataset to enhance the model’s performance on a given task. You can achieve this by adjusting the learning rate, the number of layers in the model, the size of the embeddings, and various other parameters. Fine-tuning is a time-consuming procedure that demands a firm grasp of the model and the job. This article will look at how to fine-tune a Hugging Face Model.

A Step-by-Step Hugging Face Fine Tuning Tutorial

Learning Objectives

Understand the T5 model’s structure, including Transformers and self-attention.
Learn to optimize hyperparameters for better model performance.
Master text data preparation, including tokenization and formatting.
Know how to adapt pre-trained models to specific tasks.
Learn to clean, split, and create datasets for training.
Gain experience in model training and evaluation using metrics like loss and accuracy.
Explore real-world applications of the fine-tuned model for generating responses or answers.

This article was published as a part of the Data Science Blogathon.

About Hugging Face Models
Import Necessary Libraries
Import Dataset
Problem Statement
Initialize Parameters
T5 Transformer
T5Tokenizer
Dataset Preparation
DataLoader
Model Building
Model Training
Model Prediction
Prediction
Frequently Asked Questions

About Hugging Face Models

Hugging Face is a firm that provides a platform for natural language processing (NLP) model training and deployment. The platform hosts a model library suitable for various NLP tasks, including language translation, text generation, and question-answering. These models undergo training on extensive datasets and are designed to excel in a wide range of natural language processing (NLP) activities.

The Hugging Face platform also includes tools for fine tuning pre-trained models on specific datasets, which can help adapt algorithms to particular domains or languages. The platform also has APIs for accessing and utilizing pre-trained models in apps and tools for constructing bespoke models and delivering them to the cloud.

Using the Hugging Face library for natural language processing (NLP) tasks has various advantages:

Wide selection of models: A significant range of pre-trained NLP models are available through the Hugging Face library, including models trained on tasks such as language translation, question answering, and text categorization. This makes it simple to choose a model that meets your exact requirements.
Compatibility across platforms: The Hugging Face library is compatible with standard deep learning systems such as TensorFlow, PyTorch, and Keras, making it simple to integrate into your existing workflow.
Simple fine-tuning: The Hugging Face library contains tools for fine-tuning pre-trained models on your dataset, saving you time and effort over training a model from scratch.
Active community: The Hugging Face library has a vast and active user community, which means you can obtain assistance and support and contribute to the library’s growth.
Well-documented: The Hugging Face library contains extensive documentation, making it easy to start and learn how to use it efficiently.

Import Necessary Libraries

Importing necessary libraries is analogous to constructing a toolkit for a particular programming and data analysis activity. These libraries, which are frequently pre-written collections of code, offer a wide range of functions and tools that help to speed development. Developers and data scientists can access new capabilities, increase productivity, and use existing solutions by importing the appropriate libraries.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split


import torch

from transformers import T5Tokenizer
from transformers import T5ForConditionalGeneration, AdamW

import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint

pl.seed_everything(100)

import warnings
warnings.filterwarnings("ignore")

Import Dataset

Importing a dataset is a crucial initial step in data-driven projects.

df = pd.read_csv("/kaggle/input/queestion-answer-dataset-qa/train.csv")
df.columns

df = df[['context','question', 'text']]
print("Number of records: ", df.shape[0])

Problem Statement

“To create a model capable of generating responses based on context and questions.”

For example,

Context = “Clustering groups of similar cases, for example, can
find similar patients or use for customer segmentation in the
banking field. The association technique is used for finding items or events
that often co-occur, for example, grocery items that a particular customer usually buys together. Anomaly detection is used to discover abnormal
and unusual cases; for example, credit card fraud
detection.”

Question = “What is the example of Anomaly detection?”

Answer = ????????????????????????????????

df["context"] = df["context"].str.lower()
df["question"] = df["question"].str.lower()
df["text"] = df["text"].str.lower()

df.head()

Initialize Parameters

input length: During training, we refer to the number of input tokens (e.g., words or characters) in a single example fed into the model as input length. If you’re training a language model to predict the next word in a sentence, the input length would be the number of words in the phrase.
Output length: During training, the model is expected to generate a specific quantity of output tokens, such as words or characters, in a single sample. The output length corresponds to the number of words the model predicts within the sentence.
Training batch size: During training, the model processes several samples at once. If you set the training batch size to 32, the model handles 32 instances, such as 32 phrases, simultaneously before updating its model weights.
Validating batch size: Similar to the training batch size, this parameter indicates the number of instances that the model handles during the validation phase. In other words, it represents the volume of data the model processes when it is tested on a hold-out dataset.
Epochs: An epoch is a single trip through the complete training dataset. So, if the training dataset comprises 1000 instances and the training batch size is 32, one epoch will need 32 training steps. If the model is trained for ten epochs, it will have processed ten thousand instances (10 * 1000 = ten thousand).

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 
INPUT_MAX_LEN = 512 # Input length
OUT_MAX_LEN = 128 # Output Length
TRAIN_BATCH_SIZE = 8 # Training Batch Size
VALID_BATCH_SIZE = 2 # Validation Batch Size
EPOCHS = 5 # Number of Iteration

T5 Transformer

The T5 model is based on the Transformer architecture, a neural network designed to handle sequential input data effectively. It comprises an encoder and a decoder, which include a sequence of interconnected “layers.”

The encoder and decoder layers comprise various “attention” mechanisms and “feedforward” networks. The attention mechanisms enable the model to focus on different sections of the input sequence at other times. At the same time, the feedforward networks alter the input data using a set of weights and biases.

The T5 model also employs “self-attention,” which allows each element in the input sequence to pay attention to every other element. This allows the model to recognize links between words and phrases in the input data, which is critical for many NLP applications.

In addition to the encoder and decoder, the T5 model contains a “language model head,” which predicts the next word in a sequence based on the prior words. This is critical for translation and text production jobs, where the model must provide cohesive and natural-sounding output.

The T5 model represents a large and sophisticated neural network designed for highly efficient and accurate processing of sequential input. It has undergone extensive training on a diverse text dataset and can proficiently perform a broad spectrum of natural language processing tasks.

T5Tokenizer

T5Tokenizer is used to turn a text into a list of tokens, each representing a single word or punctuation mark. The tokenizer additionally inserts unique tokens into the input text to denote the text’s start and end and distinguish various phrases.

The T5Tokenizer employs a combination of character-level and word-level tokenization and a subword-level tokenization strategy comparable to the SentencePiece tokenizer. It subwords the input text based on the frequency of each character or character sequence in the training data. This assists the tokenizer in dealing with out-of-vocabulary (OOV) terms that do not occur in the training data but do appear in the test data.

The T5Tokenizer additionally inserts unique tokens into the text to denote the start and end of sentences and to divide them. It adds the tokens s > and / s >, for example, to signify the beginning and end of a phrase, and pad > to indicate padding.

MODEL_NAME = "t5-base"

tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME, model_max_length= INPUT_MAX_LEN)

print("eos_token: {} and id: {}".format(tokenizer.eos_token,
                   tokenizer.eos_token_id)) # End of token (eos_token)
print("unk_token: {} and id: {}".format(tokenizer.unk_token,
                   tokenizer.eos_token_id)) # Unknown token (unk_token)
print("pad_token: {} and id: {}".format(tokenizer.pad_token,
                 tokenizer.eos_token_id)) # Pad token (pad_token)

Dataset Preparation

When dealing with PyTorch, you usually prepare your data for use with the model by using a dataset class. The dataset class is responsible for loading data from the disc and executing required preparation procedures, such as tokenization and numericalization. The class should also implement the getitem function, which is used to obtain a single item from the dataset by index.

The init method populates the dataset with the text list, label list, and tokenizer. The len function returns the number of samples in the dataset. The get item function returns a single item from a dataset by index. It accepts an index idx and outputs the tokenized input and labels.

It is also customary to include various preprocessing steps, such as padding and truncating the tokenized inputs. You may also turn the labels into tensors.

class T5Dataset:

    def __init__(self, context, question, target):
        self.context = context
        self.question = question
        self.target = target
        self.tokenizer = tokenizer
        self.input_max_len = INPUT_MAX_LEN
        self.out_max_len = OUT_MAX_LEN

    def __len__(self):
        return len(self.context)

    def __getitem__(self, item):
        context = str(self.context[item])
        context = " ".join(context.split())

        question = str(self.question[item])
        question = " ".join(question.split())

        target = str(self.target[item])
        target = " ".join(target.split())
        
        
        inputs_encoding = self.tokenizer(
            context,
            question,
            add_special_tokens=True,
            max_length=self.input_max_len,
            padding = 'max_length',
            truncation='only_first',
            return_attention_mask=True,
            return_tensors="pt"
        )
        

        output_encoding = self.tokenizer(
            target,
            None,
            add_special_tokens=True,
            max_length=self.out_max_len,
            padding = 'max_length',
            truncation= True,
            return_attention_mask=True,
            return_tensors="pt"
        )


        inputs_ids = inputs_encoding["input_ids"].flatten()
        attention_mask = inputs_encoding["attention_mask"].flatten()
        labels = output_encoding["input_ids"]

        labels[labels == 0] = -100  # As per T5 Documentation

        labels = labels.flatten()

        out = {
            "context": context,
            "question": question,
            "answer": target,
            "inputs_ids": inputs_ids,
            "attention_mask": attention_mask,
            "targets": labels
        }


        return out

DataLoader

The DataLoader class loads data in parallel and batches, making it possible to work with big datasets that would otherwise be too vast to store in memory. Combining the DataLoader class with a dataset class containing the data to be loaded.

The dataloader is in charge of iterating over the dataset and returning a batch of data to the model for training or assessment while training a transformer model. The DataLoader class offers various parameters to control the loading and preprocessing of data, including batch size, worker thread count, and whether to shuffle the data before each epoch.

class T5DatasetModule(pl.LightningDataModule):

    def __init__(self, df_train, df_valid):
        super().__init__()
        self.df_train = df_train
        self.df_valid = df_valid
        self.tokenizer = tokenizer
        self.input_max_len = INPUT_MAX_LEN
        self.out_max_len = OUT_MAX_LEN


    def setup(self, stage=None):

        self.train_dataset = T5Dataset(
        context=self.df_train.context.values,
        question=self.df_train.question.values,
        target=self.df_train.text.values
        )

        self.valid_dataset = T5Dataset(
        context=self.df_valid.context.values,
        question=self.df_valid.question.values,
        target=self.df_valid.text.values
        )

    def train_dataloader(self):
        return torch.utils.data.DataLoader(
         self.train_dataset,
         batch_size= TRAIN_BATCH_SIZE,
         shuffle=True, 
         num_workers=4
        )


    def val_dataloader(self):
        return torch.utils.data.DataLoader(
         self.valid_dataset,
         batch_size= VALID_BATCH_SIZE,
         num_workers=1
        )

Model Building

When creating a transformer model in PyTorch, you usually begin by creating a new class that derives from the torch. nn.Module. This class describes the model’s architecture, including the layers and the forward function. The class’s init function defines the model’s architecture, often by instantiating the model’s different levels and assigning them as class attributes.

The forward method is in charge of passing data through the model in the forward direction. This method accepts input data and applies the model’s layers to create the output. The forward method should implement the model’s logic, such as passing input through a sequence of layers and returning the result.

The class’s init function creates an embedding layer, a transformer layer, and a fully connected layer and assigns these as class attributes. The forward method accepts the incoming data x, processes it via the given stages, and returns the result. When training a transformer model, the training process typically involves two stages: training and validation.

The training_step method specifies the rationale for carrying out a single training step, which generally includes:

forward pass through the model
computing the loss
computing gradients
Updating the model’s parameters

The val_step method, like the training_step method, is used to assess the model on a validation set. It usually includes:

forward pass through the model
computing the evaluation metrics

class T5Model(pl.LightningModule):
    
    def __init__(self):
        super().__init__()
        self.model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME, return_dict=True)

    def forward(self, input_ids, attention_mask, labels=None):

        output = self.model(
            input_ids=input_ids, 
            attention_mask=attention_mask, 
            labels=labels
        )

        return output.loss, output.logits


    def training_step(self, batch, batch_idx):

        input_ids = batch["inputs_ids"]
        attention_mask = batch["attention_mask"]
        labels= batch["targets"]
        loss, outputs = self(input_ids, attention_mask, labels)

        
        self.log("train_loss", loss, prog_bar=True, logger=True)

        return loss

    def validation_step(self, batch, batch_idx):
        input_ids = batch["inputs_ids"]
        attention_mask = batch["attention_mask"]
        labels= batch["targets"]
        loss, outputs = self(input_ids, attention_mask, labels)

        self.log("val_loss", loss, prog_bar=True, logger=True)
        
        return loss


    def configure_optimizers(self):
        return AdamW(self.parameters(), lr=0.0001)

Model Training

Iterating over the dataset in batches, sending the input through the model, and changing the model’s parameters based on the calculated gradients and a set of optimization criteria is usual for training a transformer model.

def run():
    
    df_train, df_valid = train_test_split(
        df[0:10000], test_size=0.2, random_state=101
    )
    
    df_train = df_train.fillna("none")
    df_valid = df_valid.fillna("none")
    
    df_train['context'] = df_train['context'].apply(lambda x: " ".join(x.split()))
    df_valid['context'] = df_valid['context'].apply(lambda x: " ".join(x.split()))
    
    df_train['text'] = df_train['text'].apply(lambda x: " ".join(x.split()))
    df_valid['text'] = df_valid['text'].apply(lambda x: " ".join(x.split()))
    
    df_train['question'] = df_train['question'].apply(lambda x: " ".join(x.split()))
    df_valid['question'] = df_valid['question'].apply(lambda x: " ".join(x.split()))

   
    df_train = df_train.reset_index(drop=True)
    df_valid = df_valid.reset_index(drop=True)
    
    dataModule = T5DatasetModule(df_train, df_valid)
    dataModule.setup()

    device = DEVICE
    models = T5Model()
    models.to(device)

    checkpoint_callback  = ModelCheckpoint(
        dirpath="/kaggle/working",
        filename="best_checkpoint",
        save_top_k=2,
        verbose=True,
        monitor="val_loss",
        mode="min"
    )

    trainer = pl.Trainer(
        callbacks = checkpoint_callback,
        max_epochs= EPOCHS,
        gpus=1,
        accelerator="gpu"
    )

    trainer.fit(models, dataModule)

run()

Model Prediction

To make predictions with a fine-tuned NLP model like T5 using new input, you can follow these steps:

Preprocess the New Input: Tokenize and preprocess your new input text to match the preprocessing you applied to your training data. Ensure that it is in the correct format expected by the model.
Use the Fine-Tuned Model for Inference: Load your fine-tuned T5 model, which you previously trained or loaded from a checkpoint.
Generate Predictions: Pass the preprocessed new input to the model for prediction. In the case of T5, you can use the generate method to generate responses.

train_model = T5Model.load_from_checkpoint("/kaggle/working/best_checkpoint-v1.ckpt")

train_model.freeze()

def generate_question(context, question):

    inputs_encoding =  tokenizer(
        context,
        question,
        add_special_tokens=True,
        max_length= INPUT_MAX_LEN,
        padding = 'max_length',
        truncation='only_first',
        return_attention_mask=True,
        return_tensors="pt"
        )

    
    generate_ids = train_model.model.generate(
        input_ids = inputs_encoding["input_ids"],
        attention_mask = inputs_encoding["attention_mask"],
        max_length = INPUT_MAX_LEN,
        num_beams = 4,
        num_return_sequences = 1,
        no_repeat_ngram_size=2,
        early_stopping=True,
        )

    preds = [
        tokenizer.decode(gen_id,
        skip_special_tokens=True, 
        clean_up_tokenization_spaces=True)
        for gen_id in generate_ids
    ]

    return "".join(preds)

Prediction

let’s generate a prediction using the fine-tuned T5 model with new input:

context = “Clustering groups of similar cases, for example, \
can find similar patients, or use for customer segmentation in the \
banking field. Using association technique for finding items or events that \
often co-occur, for example, grocery items that are usually bought together\
by a particular customer. Using anomaly detection to discover abnormal \
and unusual cases, for example, credit card fraud detection.”

que = “what is the example of Anomaly detection?”

print(generate_question(context, que))

context = "Classification is used when your target is categorical,\
 while regression is used when your target variable\
is continuous. Both classification and regression belong to the category \
of supervised machine learning algorithms."

que = "When is classification used?"

print(generate_question(context, que))

Conclusion

In this article, we embarked on a journey to fine-tune a natural language processing (NLP) model, specifically the T5 model, for a question-answering task. Throughout this process, we delved into various NLP model development and deployment aspects.

Key takeaways:

Explored the encoder-decoder structure and self-attention mechanisms that underpin its capabilities.
The art of hyperparameter tuning is an essential skill for optimizing model performance.
Experimenting with learning rates, batch sizes, and model sizes allowed us to fine-tune the model effectively.
Proficient in tokenization, padding, and converting raw text data into a suitable format for model input.
Delved into fine-tuning, including loading pre-trained weights, modifying model layers, and adapting them to specific tasks.
Learned how to clean and structure data, splitting it into training and validation sets.
Demonstrated how it could generate responses or answers based on input context and questions, showcasing its real-world utility.

Frequently Asked Questions

Q1. What is fine-tuning in natural language processing (NLP)?

Answer: Fine-tuning in NLP involves modifying a pre-trained model’s hyperparameters and architecture to optimize its performance for a specific task or dataset.

Q2. What is the Transformer architecture used in NLP models like T5?

Answer: The Transformer architecture is a neural network architecture. It excels at handling sequential data and is the foundation for models like T5. It uses self-attention mechanisms for context understanding.

Q3. What is the purpose of the encoder-decoder structure in models like T5?

Answer: In sequence-to-sequence tasks in NLP, we use the encoder-decoder structure. The encoder processes input data, and the decoder generates output data.

Q4. Is it possible to utilize fine-tuned NLP models such as T5 in real-world applications?

Answer: Yes, you can apply fine-tuned models to various real-world NLP tasks, including text generation, translation, and question-answering.

Q5. How can I start fine-tuning NLP models such as T5?

Answer: To begin, you can explore libraries such as Hugging Face. These libraries offer pre-trained models and tools for fine-tuning your datasets. Learning NLP fundamentals and deep learning concepts is also crucial.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Kajal 17 Oct, 2023

Hi, I am Kajal Kumari. have completed my Master’s from IIT(ISM) Dhanbad in Computer Science & Engineering. As of now, I am working as Machine Learning Engineer in Hyderabad. hope that you have enjoyed the article. If you like it, share it with your friends also. Please feel free to comment if you have any thoughts that can improve my article writing. If you want to read my previous blogs, you can read Previous Data Science Blog posts here. Connect with me

Beginner Deep Learning Machine Learning NLP PyTorch

Harnessing NLP Superpowers: A Step-by-Step Hugging Face Fine Tuning Tutorial

Introduction

Table of contents

About Hugging Face Models

Import Necessary Libraries

Import Dataset

Problem Statement

Initialize Parameters

T5 Transformer

T5Tokenizer

Dataset Preparation

DataLoader

Model Building

Model Training

Model Prediction

Prediction

Conclusion

Frequently Asked Questions

Recommended Articles

Frequently Asked Questions

Responses From Readers

Write for us