- Clean, well-labeled data matters most; tokenize with the exact model tokenizer
- Choose full FT for capacity or LoRA/PEFT for efficiency and adapter reuse
- Configure Trainer: batch sizes, LR/warmup, weight decay, precision, eval strategy
- Use early stopping, checkpointing, and pinned revisions for reproducibility
- Productionize with safe decoding defaults, observability, and versioning
Overview
Fine-tuning adapts a pre-trained model to your domain or task using labeled examples, improving accuracy and style adherence. You can update all weights (full fine-tuning) or apply parameter-efficient methods like LoRA/PEFT to reduce compute and VRAM requirements. This process is like teaching a knowledgeable student a specific skill—the pre-trained model already understands language patterns, and we’re teaching it to apply that knowledge to our specific task.
Notebook
View the companion notebook: Fine Tuning
What is Fine-Tuning?
Fine-tuning takes a model that has already learned general language understanding from massive datasets and specializes it for your specific needs. The base model has learned patterns, grammar, and knowledge from training on huge portions of the internet. Fine-tuning teaches it to apply this knowledge to your task.
Why Fine-Tune?
Task-specific performance: Pre-trained models are general-purpose; fine-tuning makes them experts at your task. Domain adaptation: Adapt models to specific domains (medical, legal, technical). Data efficiency: Requires much less data than training from scratch. Time efficiency: Much faster than pre-training a model.
Setting Up for Fine-Tuning
First, install the necessary libraries:
!pip install transformers datasets accelerate bitsandbytes -q
import pprint
import torch
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
BitsAndBytesConfig,
DataCollatorForLanguageModeling,
TrainingArguments,
Trainer,
pipeline,
logging
)
from datasets import load_dataset
import peft
# Suppress verbose output
logging.set_verbosity_error()
print("🤘 The setup is complete.")
Data Preparation
The quality and format of your training data are the most important factors for successful fine-tuning. For our example, we’ll teach a model to generate quotes:
# Load the dataset
dataset_name = "Abirate/english_quotes"
dataset = load_dataset(dataset_name, split="train")
print(dataset)
# Dataset({
# features: ['quote', 'author', 'tags'],
# num_rows: 2508
# })
# Format data for our task
def format_prompt(example):
quote_text = example['quote']
author_name = example['author']
return {"text": f"Quote by {author_name}: {quote_text} <|endoftext|>"}
formatted_dataset = dataset.map(format_prompt)
# Example formatted text
print(formatted_dataset[0]['text'])
# Quote by Oscar Wilde: "Be yourself; everyone else is already taken." <|endoftext|>
The <|endoftext|>
token is crucial—it teaches the model when a quote is finished, so it learns to stop generating at the right time.
Loading the Pre-Trained Model
We’ll use quantization to make the model memory-efficient. This allows us to fine-tune larger models on limited hardware:
model_name = "gpt2-medium"
# Quantization configuration to load the model in 4-bit
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
def get_model():
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
trust_remote_code=True
)
model.config.use_cache = False
return model
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
Think of quantization as creating a summary of a very long book. Instead of using rich, detailed vocabulary (32-bit floating point), we use a more limited, efficient set of words (4-bit integers) to capture the main ideas.
Parameter-Efficient Fine-Tuning (PEFT)
PEFT is a family of techniques for customizing large models without touching all of their original weights. Instead of re-training hundreds of millions of parameters, PEFT methods freeze the base model and learn only a tiny add-on.
What is LoRA?
Low-Rank Adaptation (LoRA) freezes the original weights and learns two tiny matrices whose product has very low rank. During training, only these add-on matrices are updated, storing and multiplying far fewer parameters—often hundreds of times less than a full fine-tune.
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False
)
# Configure LoRA
lora_config = peft.LoraConfig(
r=8, # Rank of the update matrices
lora_alpha=32, # Scaling factor for the LoRA weights
lora_dropout=0.05, # Dropout probability for LoRA layers
bias="none", # Bias type
task_type="CAUSAL_LM", # Task type
fan_in_fan_out=True,
)
# Add LoRA adapters to the model
model = peft.get_peft_model(get_model(), lora_config)
model.print_trainable_parameters()
# trainable params: 786,432 || all params: 355,609,600 || trainable%: 0.2212
We’re training less than 1% of the model’s parameters! This is why LoRA is so efficient.
Tokenizing the Dataset
Convert our formatted text into token IDs:
def tokenize_function(examples):
return tokenizer(
examples['text'],
padding="max_length",
truncation=True,
max_length=128
)
# Tokenize the entire dataset
tokenized_dataset = formatted_dataset.map(tokenize_function, batched=True)
pprint.pp(tokenized_dataset[0], compact=True)
The Fine-Tuning Process
Use the Trainer API from Hugging Face to handle the entire training loop:
# Define training arguments
training_args = TrainingArguments(
output_dir="./gpt2-medium-quotes",
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=1,
learning_rate=2e-4,
fp16=True, # Mixed precision for faster training
logging_steps=200,
save_total_limit=2,
report_to="none"
)
# Create the Trainer instance
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
data_collator=data_collator
)
# Start fine-tuning!
print("🚀 Starting fine-tuning…")
trainer.train()
print("✅ Fine-tuning complete!")
# Save the final model
final_model_dir = "./gpt2-medium-quotes-final"
trainer.save_model(final_model_dir)
Training Arguments Explained
learning_rate
: Think of this as how big of a step the student takes when correcting a mistake. Too big (>5e-5), and they might overshoot. Too small (<1e-6), and it takes forever to learn.
per_device_train_batch_size
: How many examples to show before updating understanding. It’s more efficient than showing one example at a time.
num_train_epochs
: How many times to go through the entire dataset. More epochs can improve learning but risk overfitting.
fp16
: Use mixed precision for faster training and lower memory usage.
Testing the Fine-Tuned Model
Compare the base model against our fine-tuned version:
from peft import PeftModel
prompt = "Quote by Jimi Hendrix"
# Test the original base model
print("--- Testing the Original Base Model ---")
base_generator = pipeline('text-generation', model="gpt2-medium", tokenizer="gpt2-medium")
result = base_generator(prompt, max_length=50, num_return_sequences=1)
print("Base model response:")
print(result[0]['generated_text'])
# Test our fine-tuned model
print("\n--- Testing Our Fine-Tuned Model ---")
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
low_cpu_mem_usage=True
)
fine_tuned_model = PeftModel.from_pretrained(base_model, final_model_dir)
fine_tuned_model = fine_tuned_model.merge_and_unload()
fine_tuned_generator = pipeline(
'text-generation',
model=fine_tuned_model,
tokenizer=tokenizer
)
result = fine_tuned_generator(prompt, max_length=50, num_return_sequences=1)
print("Fine-tuned model response:")
print(result[0]['generated_text'])
You should see a dramatic difference. The base model likely generates something generic or unrelated, while the fine-tuned model immediately generates a plausible quote following the structure it learned.
Approaches to Fine-Tuning
Full Fine-Tuning
Updates all model weights. Provides highest capacity but is most expensive:
# Full fine-tuning (without LoRA)
model = AutoModelForCausalLM.from_pretrained(model_name)
# All parameters are trainable
LoRA/PEFT
Injects low-rank adapters, training 10-100× fewer parameters:
lora_config = peft.LoraConfig(
r=8, # Lower rank = fewer parameters
lora_alpha=16, # Scaling factor
target_modules=["c_attn"], # Which layers to adapt
lora_dropout=0.05
)
Layer Freezing
Freeze lower layers to save compute:
# Freeze all but the last two transformer blocks
for param in model.transformer.h[:-2].parameters():
param.requires_grad = False
Evaluation Metrics
Monitor these metrics during training:
Classification: Accuracy, F1 score, precision, recall. Generation: Perplexity, BLEU score, human evaluation. Custom metrics: Task-specific measurements.
import evaluate
def compute_metrics(eval_pred):
metric = evaluate.load("accuracy")
logits, labels = eval_pred
predictions = logits.argmax(-1)
return metric.compute(predictions=predictions, references=labels)
Production Deployment
Model Versioning
Pin specific versions for reproducibility:
model = AutoModelForCausalLM.from_pretrained(
"your-model",
revision="v1.0.0" # Pin specific version
)
Safe Inference
Set deterministic defaults for production:
def safe_generate(prompt, model, tokenizer):
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
max_new_tokens=100,
temperature=0.7, # Conservative temperature
do_sample=False, # Deterministic
pad_token_id=tokenizer.eos_token_id
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
Hyperparameter Tuning
Start with validated defaults, then tune:
# Sweep these hyperparameters
learning_rates = [1e-5, 2e-5, 5e-5]
batch_sizes = [4, 8, 16]
warmup_ratios = [0.03, 0.06, 0.1]
for lr in learning_rates:
for bs in batch_sizes:
args = TrainingArguments(
learning_rate=lr,
per_device_train_batch_size=bs,
# ... other args
)
# Train and evaluate
Troubleshooting Common Issues
If the model memorizes training data, reduce epochs, increase dropout, add more diverse data, or use early stopping.
If performance is poor, increase model capacity, raise learning rate carefully, improve data quality, or train for more epochs.
Enable gradient checkpointing, use smaller batch sizes, apply gradient accumulation, or switch to LoRA/PEFT.
Scaling Up Training
For larger models or datasets:
# Use accelerate for multi-GPU
from accelerate import Accelerator
accelerator = Accelerator()
model, optimizer, dataloader = accelerator.prepare(
model, optimizer, dataloader
)
# Or use DeepSpeed for extreme scale
training_args = TrainingArguments(
deepspeed="ds_config.json",
# ... other args
)
Ethics and Safety
Licensing: Respect base model and dataset licenses. Privacy: Avoid training on sensitive data without consent. Bias: Evaluate for harmful biases and implement safeguards. Misuse: Consider potential misuse and add appropriate warnings.
Advanced Techniques
Instruction Fine-Tuning
Format data as instruction-response pairs:
def format_instruction(example):
return {
"text": f"### Instruction: {example['instruction']}\n"
f"### Response: {example['response']}<|endoftext|>"
}
Multi-Task Fine-Tuning
Train on multiple tasks simultaneously:
# Combine multiple datasets
combined_dataset = concatenate_datasets([
dataset1.map(format_task1),
dataset2.map(format_task2),
dataset3.map(format_task3)
])
Continual Learning
Fine-tune on new tasks without forgetting old ones:
# Use elastic weight consolidation or similar techniques
from peft import TaskType
model = peft.get_peft_model(
base_model,
lora_config,
task_type=TaskType.MULTI_TASK
)
Conclusion
Fine-tuning transforms general-purpose models into specialized tools for your specific needs. Success depends on quality data, appropriate technique selection (full vs PEFT), and careful hyperparameter tuning. Start with LoRA for efficiency, use the Trainer API for simplicity, and always evaluate on held-out data. With proper fine-tuning, you can achieve state-of-the-art performance on your specific task while leveraging the knowledge encoded in pre-trained models.