In language models, where the quest for efficiency and precision is paramount, Llama 3.1 Storm 8B emerges as a notable achievement. This fine-tuned version of Meta’s Llama 3.1 8B Instruct represents a leap forward in enhancing conversational and function-calling capabilities within the 8B parameter model class. The journey to this advancement is rooted in a meticulous approach centered around data curation, where high-quality training samples were carefully selected to maximize the model’s potential.
The fine-tuning process didn’t stop there; it progressed through spectrum-based targeted fine-tuning, culminating in strategic model merging. This article discusses the innovative techniques that propelled Llama 3.1 Storm 8B to outperform its predecessors, setting a new benchmark in small language models.
Llama-3.1-Storm-8B builds on the strengths of Llama-3.1-8B-Instruct, enhancing conversational and function-calling capabilities within the 8B parameter model class. This upgrade demonstrates notable improvements across multiple benchmarks, including instruction-following, knowledge-driven QA, reasoning, reducing hallucinations, and function-calling. These advancements benefit AI developers and enthusiasts working with limited computational resources.
Compared to the recent Hermes-3-Llama-3.1-8B model, Llama-3.1-Storm-8B outperforms 7 out of 9 benchmarks. Hermes-3 leads only in the MuSR benchmark, and both models perform similarly on the BBH benchmark.
The above image represents improvements (absolute gains) over the Llama 3.1 8B Instruct.
Here are Llama 3.1 Storm 8B Models:
Also read: Meta Llama 3.1: Latest Open-Source AI Model Takes on GPT-4o mini
The performance comparison plot shows Llama 3.1 Storm 8B significantly outperforms Meta AI’s Llama 3.1 8B Instruct and Hermes 3 Llama 3.1 8B models across diverse benchmarks.
The Source Datasets used for Llama 3.1 Storm 8B are these 5 open-source datasets (The-Tome, agent-data, Magpie-Llama-3.1-Pro-300K-Filtered, openhermes_200k_unfiltered, Llama-3-Magpie-PO-100K-SML). The combined datasets contain a total of ~2.8M examples. Each example in data curation is given a value or values, and selection judgements are then made depending on the value or values assigned to each sample. To assign such value(s), LLM or machine learning models are typically utilized. Using LLM, numerous approaches exist to put a value on an example. Education value and difficulty level are two of the most often used metrics for evaluating the examples.
The worth or informativeness of the example (instruction + answer) is determined by its education value and the degree of difficulty by its difficulty level. The education value is between 1 and 5, where 1 is the least educational and 5 is the most instructive. There are 3 difficulty levels – Easy, Medium, and Hard. The objective is to enhance SLM within the context of self-curation; hence, we concentrated on applying the same model – Use Llama-3.1-8B-Instruct rather than Llama-3.1-70B-Instruct, Llama-3.1-405B-Instruct, and other bigger LLMs.
The Final Curated Dataset contained ~975K examples. Then, 960K and 15K were split for training and validation, respectively.
The Self Curation model, fine-tuned on the Llama-3.1-8B-Instruct model with ~960K examples over 4 epochs, employs Spectrum, a method that accelerates LLM training by selectively targeting layer modules based on their signal-to-noise ratio (SNR) while freezing the rest. Spectrum effectively matches full fine-tuning performance with reduced GPU memory usage by prioritizing layers with high SNR and freezing 50% of layers with low SNR. Comparisons with methods like QLoRA demonstrate Spectrum’s superior model quality and VRAM efficiency in distributed environments.
Since Model merging has led to some state-of-the-art models, they have decided to merge the self-curated fine, fine-tuned model with the Llama Spark model, which is a derivative of Llama 3.1 8B Instruct. They used the SLERP method to merge the two models, creating a blended model that captures the essence of both parents through smooth interpolation. Spherical Linear Interpolation (SLERP) ensures a constant rate of change while preserving the geometric properties of the spherical space, allowing the resultant model to maintain key characteristics from both parent models. We can see the benchmarks that the Self-Curation SFT Model performs better than the Llama-Spark model on average. However, the merged model performs even better than either of the two models.
As the figure above shows, the self-curation-based SFT strategy surpasses Llama-3.1-8B-Instruct on 7 out of 10 benchmarks, highlighting the importance of selecting high-quality examples. These results also suggest that choosing the right combined model can improve performance even more among the assessed benchmarks.
We will use the transformers library from Hugging Face to use the Llama 3.1 Storm 8B Model. By default, transformers load the model in bfloat16, which is the type used when fine-tuning. It is recommended that you use it.
1st Step: Installation of required libraries
!pip install --upgrade "transformers>=4.43.2" torch==2.3.1 accelerate flash-attn==2.6.3
2nd Step: Load the Llama 3.1 Storm 8B Model
import transformers
import torch
model_id = "akjindal53244/Llama-3.1-Storm-8B"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
3rd Step: Create a utility method to create the model input
def prepare_conversation(user_prompt):
# Llama-3.1-Storm-8B chat template
conversation = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": user_prompt}
]
return conversation
4th Step: Get the output
# User query
user_prompt = "What is the capital of Spain?"
conversation = prepare_conversation(user_prompt)
outputs = pipeline(conversation, max_new_tokens=128, do_sample=True, temperature=0.01, top_k=100, top_p=0.95)
response = outputs[0]['generated_text'][-1]['content']
print(f"Llama-3.1-Storm-8B Output: {response}")
1st Step: Load Llama 3.1 Storm 8B model and tokenizer
import torch
from transformers import AutoTokenizer, LlamaForCausalLM
model_id = 'akjindal53244/Llama-3.1-Storm-8B'
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = LlamaForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
load_in_8bit=False,
load_in_4bit=False,
use_flash_attention_2=False # Colab Free T4 GPU is an old generation GPU and does not support FlashAttention. Enable if using Ampere GPUs or newer such as RTX3090, RTX4090, A100, etc.
)
2nd Step: Apply Llama-3.1-Storm-8B chat-template
def format_prompt(user_query):
template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"""
return template.format(user_query)
3rd Step: Get the output from the model
# Build final input prompt after applying chat-template
prompt = format_prompt("What is the capital of France?")
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
generated_ids = model.generate(input_ids, max_new_tokens=128, temperature=0.01, do_sample=True, eos_token_id=tokenizer.eos_token_id)
response = tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(f"Llama-3.1-Storm-8B Output: {response}")
Llama 3.1 Storm 8B represents a significant step forward in developing efficient and powerful language models. It demonstrates that smaller models can achieve impressive performance through innovative training and merging techniques, opening up new possibilities for AI research and application development. As the field continues to evolve, we expect to see further refinements and applications of these techniques, potentially democratizing access to advanced AI capabilities.
Dive into the future of AI with GenAI Pinnacle. Empower your projects with cutting-edge capabilities, from training bespoke models to tackling real-world challenges like PII masking. Start Exploring.
Ans. Llama 3.1 Storm 8B is an improved small language model (SLM) with 8 billion parameters, built upon Meta AI’s Llama 3.1 8B Instruct model using self-curation, targeted fine-tuning, and model merging techniques.
Ans. It outperforms both Meta’s Llama 3.1 8B Instruct and Hermes-3-Llama-3.1-8B across various benchmarks, showing significant improvements in areas like instruction following, knowledge-driven QA, reasoning, and function calling.
Ans. The model was created using a three-step process: self-curation of training data, targeted fine-tuning using the Spectrum method, and model merging with Llama-Spark using the SLERP technique.
Ans. Developers can easily integrate the model into their projects using popular libraries like Transformers and vLLM. It’s available in multiple formats (BF16, FP8, GGUF) and can be used for various tasks, including conversational AI and function calling.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,