Phi 3 – Small Yet Powerful Models from Microsoft

Ajay Kumar Reddy 01 May, 2024

9 min read

Introduction

The Phi model from Microsoft has been at the forefront of many open-source Large Language Models. Phi architecture has led to all the popular small open-source models that we see today which include TPhixtral, Phi-DPO, and others. Their Phi Family has taken the LLM architecture a step forward with the introduction of Small Language Models, saying that these are enough to achieve different tasks. Now Microsoft has finally unveiled the Phi 3, the next generation of Phi models, which further improves than the previous generation of models. We will go through the Phi 3 in this article and test it with different prompts.

Learning Objectives

Understand the advancements in the Phi 3 model compared to previous iterations.
Learn about the different variants of the Phi 3 model.
Explore the improvements in context length and performance achieved by Phi 3.
Recognize the benchmarks where Phi 3 surpasses other popular language models.
Understand how to download, initialize, and use the Phi 3 mini model.

This article was published as a part of the Data Science Blogathon.

Phi 3 – The Next Iteration of Phi Family

Recently Microsoft has released Phi 3, showcasing its commitment to the open-source in the field of Artificial Intelligence. Phi has released two variants of Phi 3. One is the Phi 3 with a 4k context size and the other is the Phi 3 with a 128k context size. Both of these have the same architecture and a size of 3.8 Billion Parameters called the Phi 3 mini. Microsoft has even brought up two larger variants of Phi, a 7 Billion version called the Phi 3 Small and a 14 Billion version called the Phi 3 Medium, though they are still in the training phases. All the Phi 3 models come with the instruct version and thus are ready to be deployed in chat applications.

Unique Features

Extended Context Length: Phi 3 increases the context length of the Large Language Model from 2k to 128k, facilitated by LongRope technology, with the default context length doubled to 4k.
Training Data Size and Quality: Phi 3 is trained on 3.3 Trillion tokens, featuring larger and more advanced datasets compared to Phi 2.
Model Variants:
- Phi 3 Mini: Trained on 3.3 Trillion tokens, with a 32k vocabulary size and leveraging the tiktoken tokenizer.
- Phi 3 Small (7B Version): Default context length of 8k, vocabulary size of 100k, and utilizes Grouped Query Attention with 4 Queries sharing 1 Key to reduce memory footprint.
Model Architecture: Incorporates Grouped Query Attention to optimize memory usage, starting with Pretraining and moving to Supervised fine-tuning, aligned with Direct Preference Optimization for AI-responsible outputs.

Benchmarks – Phi 3

Coming to the benchmarks, the Phi 3 mini, i.e. the 3.8 Billion Parameter model has overtaken the Gemma 7B from Google. It has gotten a score of 68.8 in MMLU and 76.7 in HellaSwag which exceeds Gemma which has a score of 63.6 in MMLU and 49.8 in HellSwag and even the Mistral 7B model which has a score of 61.7 in MMLU and 58.5 in HellSwag. Phi-3 has even surpassed the recently released Llama 3 8B model in both of these benchmarks.

It even surpasses these and the other models in other popular evaluation tests like the WinoGrande, TruthfulQA, HumanEval, and others. In the below table, we can compare the scores of the Phi 3 family of models with other popular open-source large language models.

Getting Started with Phi 3

To get started with Phi-3. We need to follow certain steps. Let us dive deeper into each step.

Step1: Downloading Libraries

Let’s start by downloading the following libraries.

!pip install -q transformers huggingface-cli bitsandbytes accelerate

transformers – We need this library to download the Large Language Models and work with them
huggingface-cli – We need to log in to huggingface so that we can work with the official HuggingFace model
bitsandbytes – We cannot directly run the 8 Billion model in the free GPU instance of Colab, hence we need this library to quantize the LLM to 4-bit to work with them
accelerate – We need this to speed up the GPU inference for the Large Language Models

Now, before we start downloading the model, we need to define our quantization config. This is because we cannot load the entire full precision model within the free Google Colab GPU and even if we fit it, the inference will be slow. So, we will quantize our model to 4-bit precision and then work with the model.

Step2: Defining Quantization Configure

The configuration for this quantization can be seen below:

import torch
from transformers import BitsAndBytesConfig


config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

Here we start by importing the torch and the BitsAndBytesConfig from the transformers library.
Then we create an instance of this BitsAndBytesConfig class and save it to the variable called config
While creating this instance, we give it the following parameters.
load_in_4bit: This tells that we want to quantize our model into 4bit precision format. This will greatly reduce the size of the model.
bnb_4bit_quant_type: This tells the type of 4bit quantization we wish to work with. Here we go with the normal float called the nf4. This is proven to give better results.
bnb_4bit_use_double_quant: Setting this to True will quantize the quantization constants that are internal to BitsAndBytes, this will further reduce the size of the model.
bnb_4bit_compute_dtype: Here we tell what datatype we will be working with when computing the forward pass through the model. For the colab, we can set it to brain float16 called bfloat16, which tends to provide better results than the regular float16.

Running this code will create our quantization configuration.

Step3: Download the Model

Now, we are ready to download the model and quantize it with the following quantization configuration. The code for this will be:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct", 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
    quantization_config = config
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

Here we start by importing the AutoModelForCausalLM and AutoTokenizer from the transformers library
Now we create a variable named model_name and pass it the name of the model that we will work with and here we will give the Phi-3-mini Instruct version model
Then we create an instance of the AutoModelForCausualLM.from_pretrained() and pass it the model name, and the device map, which will set the device to GPU if GPU is present, and then the quantization config that we have just created
In a similar way, we create a tokenizer object with the same model name and the device map set to auto

Running this code will download the Phi-3 mini 4k context instruct LLM and then will quantize it to the 4bit level based on the configuration that we have provided to it. And then the tokenizer is downloaded as well.

Step4: Testing Phi-3-mini

Now we will test the Phi-3-mini. For this, the code will be:

messages = [
    {"role": "user", "content": "A clock shows 12:00 p.m. now. How many \
    degrees will the minute hand move in 15 minutes?"},
    {"role": "assistant", "content": "The minute hand moves 360 degrees \
    in one hour (60 minutes). Therefore, in 15 minutes, it will move \
    (15/60) * 360 degrees = 90 degrees."},
    {"role": "user", "content": "How many degrees does the hour hand \
    move in 15 minutes?"}
]

model_inputs = tokenizer.apply_chat_template(messages, 
return_tensors="pt").to("cuda")

output = model.generate(model_inputs, 
                               max_new_tokens=1000, 
                               do_sample=True)

decoded_output = tokenizer.batch_decode(output, 
                                       skip_special_tokens=True)
print(decoded_output[0])

First, we create a list of messages. This is a list of dictionaries, containing two key-value pairs, where the keys are role and content.
The role tells if the message is from the user or the assistant and the content is the actual message
Here we create a conversation about angles between the hands of the clock. In the last message from the user, we ask a question about the angle made by the hour’s hand.
Then we apply a chat template to this chat conversation. The chat template is necessary for the model to understand, because the instruct data the model is trained on, contains the chat template formatting.
We need the corresponding tensors for this conversation and we will move it to Cuda for faster processing.
Now the model_input contains our tokens and the corresponding attention masks.
These model_inputs are passed to the model.generate() function which takes these tokens with some additional parameters like the number of tokens to print, which we sent to 1000, and the do_sample, which will sample from the high probability tokens.
Finally, we decode the output generated by the Large Language Model to convert the tokens back to English text.

Hence, when we run this code will take in the list of messages, do the proper formatting by applying the chat template, convert them into tokens, and then pass them to generate a function to generate the response and finally decode them to convert the response generated in the form of tokens to English text.

Output

Running this code produced the following output.

Seeing the output generated, the model has correctly answered the question. We see a very detailed approach similar to a chain of thoughts. Here the model starts by talking about how the minute hand moves and how the hour hand moves per hour. Then from there, it calculated the necessary intermediate result, and from there it went on to solve the actual user question.

Implementation with Another Question

Now let’s try with another question.

messages = [
    {"role": "user", "content": "If a plane crashes on the border of the \
    United States and Canada, where do they bury the survivors?"},
]

model_inputs = tokenizer.apply_chat_template(messages, 
return_tensors="pt").to("cuda")

output = model.generate(model_inputs, 
                               max_new_tokens=1000, 
                               do_sample=True)

decoded_output = tokenizer.batch_decode(output,
                                       skip_special_tokens=True)
print(decoded_output[0])

Here in the above example, we asked a tricky question to the Phi 3 LLM. And it was able to provide a pretty convincing answer. Here the LLM was able to get to the confusing part, that is we cannot bury the survivors, because survivors are living, hence there are no survivors at all to bury. Let’s try giving another tricky question and checking the generated output.

messages = [
    {"role": "user", "content": "How many smartphones can a human eat?"},
]

model_inputs = tokenizer.apply_chat_template(messages, 
return_tensors="pt").to("cuda")

output = model.generate(model_inputs, 
                               max_new_tokens=1000, 
                               do_sample=True)

decoded_output = tokenizer.batch_decode(output,
                                       skip_special_tokens=True)
print(decoded_output[0])

Here we asked the Phi-3-mini another tricky question, about how many smartphones can a human eat. This tests the Large Language Model’s common sense ability. The Phi-3 LLM was able to catch this by saying that it was a misunderstanding. It even tells that the. This tells that the Phi-3-mini was well trained on a quality dataset containing a good mixture of common sense, reasoning, and maths.

Conclusion

Phi-3 represents Microsoft’s next generation of Phi models, bringing significant advancements over Phi-2. It boasts a drastically increased context length, reaching up to 128k tokens with minimal performance impact. Additionally, Phi-3 is trained on a much larger and more comprehensive dataset compared to its predecessor. Benchmarks indicate that Phi-3 outperforms other popular models in various tasks, demonstrating its effectiveness. With its capability to handle complex questions and incorporate common sense reasoning, Phi-3 holds great promise for various applications.

Key Takeaways

Phi 3 performs well in practical scenarios, handling tricky and ambiguous questions effectively
Model Variants: Different versions of Phi 3 include Mini (3.8B), Small (7B), and Medium (14B), providing options for various use cases.
Phi 3 surpasses other open-source models in key benchmarks like MMLU and HellaSwag.
Compared to the previous model Phi 2, the context size of Phi 3 is doubled that is 4k, and with the LongRope method, the context length is further moved to 128k with very little degradation in performance
Phi 3 is trained on 3.3 Trillion Tokens involving highly curated datasets and it was both supervised fine-tuned and then followed by alignment with Direct Preference Optimization

Frequently Asked Questions

Q1. What kind of prompts can I use with Phi 3?

A. Phi 3 models are trained on data with a specific chat template format. So, it’s recommended to use the same format when providing prompts or questions to the model. This template can be applied by calling the apply_chat_template.

Q2. What is Phi 3 and what models are part of its family?

A. hi 3 is the next generation of Phi models from Microsoft, part of a family including Phi 3 mini, Small, and Medium. Where the mini version is a 3.8 Billion Parameter model, while the Small is a 7 Billion Parameter model and the Medium is a 14 Billion Parameter model.

Q3. Can I use Phi 3 for free?

A. Yes, Phi 3 models are available for free through the Hugging Face platform. Right now only the Phi 3 mini i.e. the 3.8 Billion Parameter model is available on HuggingFace. This model can be worked with for commercial applications too, based on the given license.

Q4. How well does Phi 3 handle tricky questions?

A. Phi 3 shows promising results with common-sense reasoning. The provided examples demonstrate that Phi 3 can answer tricky questions that involve humor or logic.

Q5. Are there any changes for the tokenizers in the new Phi family of models?

A. Yes. While the Phi 3 Mini still works with the regular Llama 2 tokenizer, having a vocabulary size of 32k, the new Phi 3 Small model gets a tokenizer, where the vocabulary size is extended to 100k tokens