Be it X or Linkedin, I encounter numerous posts about Large Language Models(LLMs) for beginners each day. Perhaps I wondered why there’s such an incredible amount of research and development dedicated to these intriguing models. From ChatGPT to Gemini, Falcon, and countless others, their names swirl around, leaving me eager to uncover their true nature. These burning questions have lingered in my mind, fueling my curiosity. This insatiable curiosity has ignited a fire within me, propelling me to dive headfirst into the realm of LLMs.
Join me on an exhilarating journey as we will discuss the current state of the art in LLMs for begineers. Together, we’ll unravel the secrets behind their development, comprehend their extraordinary capabilities, and shed light on how they have revolutionized the world of language processing.
In this article, you will gain understanding on how to train a large language model (LLM) from scratch, including essential techniques for building an LLM model effectively.
This article was published as a part of the Data Science Blogathon.
The history of Large Language Models goes back to the 1960s. In 1967, a professor at MIT built the first ever NLP program Eliza to understand natural language. It uses pattern matching and substitution techniques to understand and interact with humans. Later, in 1970, another NLP program was built by the MIT team to understand and interact with humans known as SHRDLU.
In 1988, RNN architecture was introduced to capture the sequential information present in the text data. But RNNs could work well with only shorter sentences but not with long sentences. Hence, LSTM was proposed in 1997. During this period, huge developments emerged in LSTM-based applications. Later on, research began in attention mechanisms as well.
LSTM solved the problem of long sentences to some extent but it could not really excel while working with really long sentences. Training LSTM models cannot be parallelized. Due to this, the training of these models took longer time.
In 2017, there was a breakthrough in the research of NLP through the paper Attention Is All You Need. This paper revolutionized the entire NLP landscape. The researchers introduced the new architecture known as Transformers to overcome the challenges with LSTMs. Transformers essentially were the first LLM developed containing a huge no. of parameters. Transformers emerged as state-of-the-art models for LLMs. Even today, the development of LLM remains influenced by transformers.
Over the next five years, there was significant research focused on building better LLMs for begineers compared to transformers. The size of LLM exponentially increased over time. The experiments proved that increasing the size of LLMs and datasets improved the knowledge of LLMs. Hence, GPT variants like GPT-2, GPT-3, GPT 3.5, GPT-4 were introduced with an increase in the size of parameters and training datasets.
In 2022, there was another breakthrough in NLP, ChatGPT. ChatGPT is a dialogue-optimized LLM that is capable of answering anything you want it to. In a couple of months, Google introduced Gemini as a competitor to ChatGPT.
In the last 1 year, there have been hundreds of Large Language Models developed. You can get the list of open-source LLMs along with the ranking on the Hugging Face Open LLM leaderboard. The state-of-the-art LLM to date is Falcon 40B Instruct.
Simply put this way, Large Language Models are deep learning models trained on huge datasets to understand human languages. Its core objective is to learn and understand human languages precisely. Large Language Models enable the machines to interpret languages just like the way we, as humans, interpret them.
Large Language Models learn the patterns and relationships between the words in the language. For example, it understands the syntactic and semantic structure of the language like grammar, order of the words, and meaning of the words and phrases. It gains the capability to grasp the whole language itself.
But how exactly is language models different from Large Language Models?
Language models and Large Language models learn and understand the human language but the primary difference is the development of these models.
Language models are generally statistical models developed using HMMs or probabilistic-based models whereas Large Language Models are deep learning models with billions of parameters trained on a very huge dataset.
The answer to this question is simple. LLMs for begineers are task-agnostic models. Literally, these models have the capability to solve any task. For example, ChatGPT is a classical example of this. Every time you ask ChatGPT something, it amazes you.
And one more astonishing feature about these LLMs for begineers is that you don’t have to actually fine-tune the models like any other pretrained model for your task. All you need do is to prompt the model. It does the job for you. Hence, LLMs provide instant solutions to any problem that you are working on. Moreover, it’s just one model for all your problems and tasks. Hence, these models are known as the Foundation models in NLP.
LLMs can be broadly classified into 2 types depending on their task:
These LLMs are trained to predict the next sequence of words in the input text. Their task at hand is to continue the text.
For example, given the text “How are you”, these LLMs might complete the sentence with “How are you doing? or “How are you? I am fine.
The list ofLLMs for begineers falling under this category are Transformers, BERT, XLNet, GPT, and its variants like GPT-2, GPT-3, GPT-4, etc.
Now, the problem with these LLMs is that its very good at completing the text rather than answering. Sometimes, we expect the answer rather than completion.
As discussed above, given How are you? as an input, LLM tries to complete the text with doing? or I am fine. The response can be either of them: completion or an answer. This is exactly why the dialogue-optimized LLMs were introduced.
These LLMs respond back with an answer rather than completing it. Given the input “How are you?”, these LLMs might respond back with an answer “I am doing fine.” rather than completing the sentence.
The list of dialogue-optimized LLMs is InstructGPT, ChatGPT, Gemini, Falcon-40B-instruct, etc.
Now, we will see the challenges involved in training LLMs from scratch.
Training LLMs from scratch are really challenging because of 2 main factors: Infrastructure and Cost.
LLMs for begineers are trained on a massive text corpus ranging at least in the size of 1000 GBs. The models used to train on these datasets are very large containing billions of parameters. In order to train such large models on the massive text corpus, we need to set up an infrastructure/hardware supporting multiple GPUs. Can you guess the time taken to train GPT-3 – 175 billion parameter model on a single GPU?
It would take 288 years to train GPT-3 on a single NVIDIA Tesla V100 GPU.
This clearly shows that training LLM on a single GPU is not possible at all. It requires distributed and parallel computing with thousands of GPUs.
Just to give you an idea, here is the hardware used for training popular LLMs-
It’s very obvious from the above that GPU infrastructure is much needed for training LLMs for begineers from scratch. Setting up this size of infrastructure is highly expensive. Companies and research institutions invest millions of dollars to set it up and train LLMs from scratch.
It is estimated that GPT-3 cost around $4.6 million dollars to train from scratch
On average, the 7B parameter model would cost roughly $25000 to train from scratch.
Now, we will see the scaling laws of LLMs.
Recently, we have seen that the trend of large language models being developed. They are really large because of the scale of the dataset and model size.
When you are training LLMs from scratch, its really important to ask these questions prior to the experiment-
The answer to these questions lies in scaling laws.
Scaling laws determines how much optimal data is required to train a model of a particular size.
In 2022, DeepMind proposed the scaling laws for training the LLMs with the optimal model size and dataset (no. of tokens) in the paper Training Compute-Optimal Large Language Models.These scaling laws are popularly known as Chinchilla or Hoffman scaling laws. It states that
The no. of tokens used to train LLM should be 20 times more than the no. of parameters of the model.
1,400B (1.4T) tokens should be used to train a data-optimal LLM of size 70B parameters. So, we need around 20 text tokens per parameter.
Next, we will see how to train LLMs from scratch.
Start by figuring out what you want your language model to do. Do you want it to answer questions, generate text, or chat like a human? Knowing your goal will help you make better choices later.
Most modern language models use something called the transformer architecture. This design helps the model understand the relationships between words in a sentence. You can build your model using programming tools like PyTorch or TensorFlow.
You need a lot of text data to train your model. This data should be relevant to what you want the model to do. For example, if you want it to write stories, gather a variety of stories.
Training is the process of teaching your model using the data you collected. This can take a lot of time and computer power.
The training process of LLMs is different for the kind of LLM you want to build whether it’s continuing the text or dialogue optimized. The performance of LLMs mainly depends upon 2 factors: Dataset and Model Architecture. These 2 are the key driving factors behind the performance of LLMs.
Let’s discuss the now different steps involved in training the LLMs.
The training process of the LLMs that continue the text is known as pretraining LLMs. These LLMs are trained in self-supervised learning to predict the next word in the text. We will exactly see the different steps involved in training LLMs from scratch.
The first step in training LLMs is collecting a massive corpus of text data. The dataset plays the most significant role in the performance of LLMs. Recently, OpenChat is the latest dialog-optimized large language model inspired by LLaMA-13B. It achieves 105.7% of the ChatGPT score on the Vicuna GPT-4 evaluation. Do you know the reason behind its success? It’s high-quality data. It has been finetuned on only ~6K data.
The training data is created by scraping the internet, websites, social media platforms, academic sources, etc. Make sure that training data is as diverse as possible.
Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models
You might have come across the headlines that “ChatGPT failed at JEE” or “ChatGPT fails to clear the UPSC” and so on. What can be the possible reasons? The reason being it lacked the necessary level of intelligence. This is heavily dependent on the dataset used for training. Hence, the demand for diverse dataset continues to rise as high-quality cross-domain dataset has a direct impact on the model generalization across different tasks.
Unlock the potential of LLMs with the high quality data!
Previously, Common Crawl was the go-to dataset for training LLMs. The Common Crawl contains the raw web page data, extracted metadata, and text extractions since 2008. The size of the dataset is in petabytes (1 petabyte=1e6 GB). It’s proven that the Large Language Models trained on this dataset showed effective results but failed to generalize well across other tasks. Hence, a new dataset called Pile was created from 22 diverse high-quality datasets. It’s a combination of existing data sources and new datasets in the range of 825 GB. In recent times, the refined version of the common crawl was released in the name of RefinedWeb Dataset.Note: The datasets used for GPT-3 and GPT-4 have not been open-sourced in order to maintain a competitive advantage over the others.
The next step is to preprocess and clean the dataset. As the dataset is crawled from multiple web pages and different sources, it is quite often that the dataset might contain various nuances. We must eliminate these nuances and prepare a high-quality dataset for the model training.
The specific preprocessing steps actually depend on the dataset you are working with. Some of the common preprocessing steps include removing HTML Code, fixing spelling mistakes, eliminating toxic/biased data, converting emoji into their text equivalent, and data deduplication. Data deduplication is one of the most significant preprocessing steps while training LLMs. Data deduplication refers to the process of removing duplicate content from the training corpus.
It’s obvious that the training data might contain duplicate or nearly the same sentences since it’s collected from various data sources. We need data deduplication for 2 primary reasons: It helps the model not to memorize the same data again and again. It helps us to evaluate LLMs better because the training and test data contain non-duplicated information. If it contains duplicated information, there is a very chance that the information it has seen in the training set is provided as output during the test set. As a result, the numbers reported may not be true. You can read more about data deduplication techniques in the paper Deduplicating Training Data Makes Language Models Better
During the pretraining phase, the next step involves creating the input and output pairs for training the model. LLMs are trained to predict the next token in the text, so input and output pairs are generated accordingly. While this demonstration considers each word as a token for simplicity, in practice, tokenization algorithms like Byte Pair Encoding (BPE) further break down each word into subwords. The model is then trained with the tokens of input and output pairs.
For example, let’s take a simple corpus-
In the case of example 1, we can create the input-output pairs as per below-
Similarly, in the case of example 2, the following is a list of input and output pairs-
Each input and output pair is passed on to the model for training.
Now, what next? Let’s define the model architecture.
The next step is to define the model architecture and train the LLM.
As of today, there are a huge no. of LLMs being developed. You can get an overview of different LLMs at the Hugging Face Open LLM leaderboard. There is a standard process followed by the researchers while building LLMs. Most of the researchers start with an existing Large Language Model architecture like GPT-3 along with the actual hyperparameters of the model. And then tweak the model architecture / hyperparameters / dataset to come up with a new LLM.
For example,
Hyperparameter tuning is a very expensive process in terms of time and cost as well. Just imagine running this experiment for the billion-parameter model. It’s not feasible right? Hence, the ideal method to go about is to use the hyperparameters of current research work, for example, use the hyperparameters of GPT-3 while working with the corresponding architecture and then find the optimal hyperparameters on the small scale and then interpolate them for the final model.
The experiments can involve any or all of the following: weight initialization, positional embeddings, optimizer, activation, learning rate, weight decay, loss function, sequence length, number of layers, number of attention heads, number of parameters, dense vs. sparse layers, batch size, and drop out.
Let’s discuss the best practices for popular hyperparameters now-
Dialogue-optimized Large Language Models (LLMs) begin their journey with a pretraining phase, similar to other LLMs. Post-pretraining, these models are capable of text completion. To generate specific answers to questions, these LLMs undergo fine-tuning on a supervised dataset comprising question-answer pairs. This process equips the model with the ability to generate answers to specific questions.
ChatGPT, a dialogue-optimized LLM, follows a similar training method. However, after pretraining and supervised fine-tuning, it incorporates an additional step known as Reinforcement Learning from Human Feedback (RLHF).
Interestingly, a recent paper titled “LIMA: Less Is More Alignment” suggests that RLHF might not be necessary. The paper posits that pretraining on a large dataset and supervised fine-tuning on high-quality data (less than 1000 examples) can suffice.
As of now, OpenChat stands as the latest dialogue-optimized LLM, inspired by LLaMA-13B. Having been fine-tuned on merely 6k high-quality examples, it surpasses ChatGPT’s score on the Vicuna GPT-4 evaluation by 105.7%. This achievement underscores the potential of optimizing training methods and resources in the development of dialogue-optimized LLMs.
The evaluation of LLMs cannot be subjective. It has to be a logical process to evaluate the performance of LLMs.
In the case of classification or regression problems, we have the true labels and predicted labels and then compare both of them to understand how well the model is performing. We look at the confusion matrix for this right? But what about large language models? They just generate the text.
There are 2 ways to evaluate LLMs: Intrinsic and extrinsic methods.
Researchers evaluated traditional language models using intrinsic methods like perplexity, bits per character, etc. These metrics track the performance on the language front i.e. how well the model is able to predict the next word.
With the advancements in LLMs today, researchers and practitioners prefer using extrinsic methods to evaluate their performance. The recommended way to evaluate LLMs is to look at how well they are performing at different tasks like problem-solving, reasoning, mathematics, computer science, and competitive exams like MIT, JEE, etc.
EleutherAI released a framework called as Language Model Evaluation Harness to compare and evaluate the performance of LLMs. Hugging face integrated the evaluation framework to evaluate open-source LLMs developed by the community.
The proposed framework evaluates LLMs across 4 different datasets. The final score is an aggregation of scores from each dataset.
Also Read: 10 Exciting Projects on Large Language Models(LLM)
Large Language Models (LLMs) have revolutionized the field of machine learning. They have a wide range of applications, from continuing text to creating dialogue-optimized models. Libraries like TensorFlow and PyTorch have made it easier to build and train these models.
However, training LLMs is not without its challenges. It requires substantial infrastructure and can be costly. Understanding the scaling laws is crucial to optimize the training process and manage costs effectively. Despite these challenges, the benefits of LLMs, such as their ability to understand and generate human-like text, make them a valuable tool in today’s data-driven world.
The process of training an LLM involves feeding the model with a large dataset and adjusting the model’s parameters to minimize the difference between its predictions and the actual data. Typically, developers achieve this by using a decoder in the transformer architecture of the model.
Evaluating the performance of LLMs is as important as training them. It helps us understand how well the model has learned from the training data and how well it can generalize to new data.
LLMs have opened up new possibilities in the field of machine learning. They are a testament to how far we’ve come since the early days of AI and a glimpse into what the future might hold. As we continue to explore and push the boundaries of what’s possible with LLMs, who knows what incredible discoveries we’ll make next?
Hope you like the article on how to train a large language model (LLM) from scratch, covering essential steps and techniques for building effective LLM models and optimizing their performance.
A. A large language model is a type of artificial intelligence that can understand and generate human-like text. It’s typically trained on vast amounts of text data and learns to predict and generate coherent sentences based on the input it receives.
A. Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language. Large language models are a subset of NLP, specifically referring to models that are exceptionally large and powerful, capable of understanding and generating human-like text with high fidelity.
A. Yes, ChatGPT is a large language model. It’s based on OpenAI’s GPT (Generative Pre-trained Transformer) architecture, which is known for its ability to generate high-quality text across various domains.
A. The main difference between a Large Language Model (LLM) and Artificial Intelligence (AI) lies in their scope and capabilities. AI is a broad field encompassing various technologies and approaches aimed at creating machines capable of performing tasks that typically require human intelligence. LLMs, on the other hand, are a specific type of AI focused on understanding and generating human-like text. While LLMs are a subset of AI, they specialize in natural language understanding and generation tasks.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
Your blog is very helpful and informative. Thanks For Sharing With Us.
This is the most detailed explanation to the audience.I learn many things.
Your blog explained very systematic manner and it's very informative.
Thanks for you blog. Gives pretty good high level view on LLM.
Nice read. Very informative and well explained. Kudos!!