Getting Started with LlaMA 2: A Beginner’s Guide

Ajay Kumar Reddy 25 Apr, 2024

7 min read

Introduction

With the release of GPT from OpenAI, many companies entered the race to create robust Generative Large Language Models of their own. Creating a Generative AI from scratch can involve a pretty cumbersome process, as it requires conducting thorough research in the field of Generative AI and performing numerous trials and errors. It also entails carefully curating a high-quality dataset, as the effectiveness of Large Language Models heavily depends on the data they are trained on. And lastly, it requires enormous computation power to train these models, which many companies cannot access. So as of now, only a few companies can create these LLMs, including OpenAI and Google, and now finally, Meta has joined this race with the introduction of LlaMA.

Learning Objectives

Get to know about the new version of LlaMA
Understanding the model’s versions, parameters, and model benchmarks
Getting access to the Llama 2 family of models
Trying LlaMA 2 with different prompts and observing the outputs

This article was published as a part of the Data Science Blogathon.

What is LlaMA?

LlaMA (Large Language Model Meta AI) is a Generative AI model, specifically a group of foundational Large Language Models developed by Meta AI, a company owned by Meta(Formerly Facebook). Meta announced Llama in Feb of 2023. Meta released Llama in different sizes(based on parameters), i.e., 7,13,33, and 65 billion parameters with a context length of 2k tokens. The model is with the intent to help researchers advance their knowledge in the field of AI. The small 7B models allow researchers with low computation power to study these models.

With the introduction of LlaMa, Meta has entered the LLM space and is now competing with OpenAI’s GPT and Google’s PaLM models. Meta believes that retraining or fine-tuning small models with limited computation resources can achieve results on par with state-of-the-art models in their respective fields. Meta AI’s LlaMa differs from OpenAI and Google’s LLM because the LlaMA model family is completely Open Source and free for anyone to use, and it even released the LlaMA weights for researchers for non-commercial uses.

What is LlaMA 2?

LlaMA 2 surpasses the previous version, LlaMA version 1, which Meta released in July of 2023. It came out in three sizes: 7B, 13B, and 70B parameter models. Upon its release, LlaMA 2 achieved the highest score on Hugging Face. Even across all segments (7B, 13B, and 70B), the top-performing model on Hugging Face originates from LlaMA 2, having been fine-tuned or retrained.

Llama 2 was trained on 2 Trillion Pretraining Tokens. The context length for all the Llama 2 models is 4k(2x the context length of Llama 1). Llama 2 outperformed state-of-the-art open-source models such as Falcon and MPT in various benchmarks, including MMLU, TriviaQA, Natural Question, HumanEval, and others (You can find the comprehensive benchmark scores on Meta AI’s website). Furthermore, Llama 2 underwent fine-tuning for chat-related use cases, involving training with over 1 million human annotations. These chat models are readily available to use on the Hugging Face website.

How to Access to LlaMA 2?

The source code for Llama 2 is available on GitHub. If you want to work with the original weights, these are also available, but for this, you need to provide your name and email to the Meta AIs website. So go to the Meta AI by clicking here, then enter your name, email address, and organization(student if you are not working). Then scroll down and click on accept and continue. Now you will get a mail stating that you can download the model weights. The form will look like the one below.

Now there are two ways to work with your model. One is to directly download the model through the instructions and link provided in the email(the hard way, and only good if you have a decent GPU), and the other is to use Hugging Face and Google Colab. In this article, I will go through the easy way, which anyone can try. Before going to Google Colab, we need to set up a Hugging Face account and create an Inference API. Then we need to go to the llama 2 model in Hugging Face(which you can do by clicking here), and then provide the email you provided to the Meta AI website. Then you will be authenticated and will be shown something similar to the below.

Now, we can download any Llama 2 model through Hugging Face and start working with it.

Using LlaMA 2 with Hugging Face and Colab

In the last section, we have seen the prerequisites before testing the Llama 2 model. We will start with importing necessary libraries in the Google Colab, which we can do with the pip command.

!pip install -q transformers einops accelerate langchain bitsandbytes

We need to install these necessary packages to start working with Llama 2. Also, the transformers library from hugging face to download the model. The einops function performs easy matrix multiplications within the model(it uses Einstein Operations/Summation notation), accelerates bits and bytes to speedup the inference, and langchain integrates our llama.

Next, to login into the Hugging Face through colab through the Hugging Face API Key, we can download the llama model; for this, we do the following.

!huggingface-cli login

Now we provide the Hugging Face Inference API key we created earlier. Then if it prompts Add token as git credential? (Y/n), Then you can reply with n. Now we are logged into Hugging Face API Key and are ready to download the model.

Hugging Face API Key

Now to download our model, we will write the following.

from langchain import HuggingFacePipeline
from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)

pipeline = transformers.pipeline(
    "text-generation", 
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    max_length=1000,
    eos_token_id=tokenizer.eos_token_id
)

Here we are specifying the path to the Llama 2 7B version in Hugging Face to the model variable, which runs perfectly with Google Colab’s free-tier GPU. Anything above that will require additional VRAM, which is impossible with Colab’s free tier.
Then we download the tokenizer for the Llama 2 7B model by specifying the model name to the AutoTokenizer.from_pretrained() function.
Then we use the transformer pipeline function and pass all the parameters to it, like the model we will work with. The device_map = auto tokenizer will allow the model to use the GPU in colab if present.
We even specify the max output tokens as 1000 and set the torch data type to float16. Finally, we pass the eos_token_id, which the model will use to know when to stop while writing the answer.
After running this, the model will be downloaded to Colab, which will take some time as it is around 10GB. Now we will create a HuggingFacePipeline out of it through the below code.


llm = HuggingFacePipeline(pipeline = pipeline, model_kwargs = {'temperature':0})

Here we set the model’s temperature and pass the pipeline we created to the pipeline variable. This HuggingFacePipeline will now allow us to use the model that we have downloaded.

Prompt Template

We shall create a Prompt Template for our model and then test it.

from langchain import PromptTemplate,  LLMChain

template = """
              You are an intelligent chatbot that gives out useful information to humans.
              You return the responses in sentences with arrows at the start of each sentence
              {query}
           """

prompt = PromptTemplate(template=template, input_variables=["query"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

Here, the template is simple. We want the Llama model to answer the user’s query and return it as points with numbering.
Then we pass this template to the PrompTemplate function and assign the template and the input_variable parameters.
Finally, we chain our Llama LLM and the Prompt to start inferencing the model. Let’s ask a question about our model now.

print(llm_chain.run('What are the 3 causes of glacier meltdowns?'))

So we asked the model to list the three possible causes of glacier meltdowns, and the model returned the following:

We see that the model has done exceptionally well. The best part is that it used emoji numbering to represent the points and has exactly returned 3 points to the output. It even used the water tide emoji to represent the glaciers. This way, you can start working with the Llama 2 from Hugging Face and Colab.

Conclusion

In this article, we have briefly examined the LlaMA(Large Language Model Meta AI)models created and released by Meta AI. We have learned about the different model sizes of its and seen how version 2, i.e., Llama 2, clearly defeats the state-of-the-art Open Source LLMs at different benchmarks. Finally, we have gone through the process of getting access to the Llama 2 model trained weights. Finally, we walked through the Llama-2 7B chat version in the Google Colab through the Hugging Face and LangChain libraries.

Key Takeaways

Some of the key takeaways from this article include:

Meta develops llama models to help researchers understand more about AI.
Llama models, especially the smaller 7B version, can be trained efficiently and perform exceptionally well.
Through different benchmarks, it was proven that Llama 2 was ahead of the competition when compared to other state-of-the-art Open LLMs.
The main thing that makes Meta’s Llama 2 different from OpenAI’s GPT and Google’s PaLM is that it is Open Source, and anyone can use it for commercial applications.

Frequently Asked Questions

Q1. What is Llama / Llama 2?

A. LlaMA is a group of foundational LLMs developed by Meta AI, owned by Meta(Formerly Facebook); this was announced to the public in February 2023.

Q2. In how many sizes does Llama 2 come?

A. Llama 2 comes in 3 different sizes, they are 7B, 13B, and the 70B parameter model. All three of them work exceptionally well and can be fine-tuned easily.

Q3. Can we run Llama 2 on the local machine?

A. Yeah. It is possible to run the 7B model of Llama 2 on the local machine, which requires you to have at least 10GB of GPU VRAM for the model to work properly. Though quantized versions of Llama 2 7B are available, they require even less VRAM, and some can run only with the CPU.

Q4. Is Llama Open Source?

A. Meta AI has announced that Llama and Llama 2 will be open-sourced. They even provide the model weights if requested through a form on their website. Within hours after releasing Llama 2, many alternative Llama 2 models have sprung up in the Hugging Face.

Q5. What application can Llama be used?

A. With Llama, we can create applications like conversation chatbots, sentiment classification systems, summarization tools, and many more. In the future, developers will create even smaller versions that can work to develop Generative AI-enabled mobile applications.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.