Fine-Tuning LLMs with LoRA: From Theory to Practice
Learn how to customize pre-trained language models for specific tasks using Low-Rank Adaptation (LoRA). This comprehensive guide covers everything from first principles to advanced optimization techniques.
๐ฏ What You'll Learn
Theory & Concepts
- What is fine-tuning and why it's powerful
- Low-rank mathematics behind LoRA
- Adapter placement strategies
- Parameter efficiency analysis
Hands-On Practice
- Train a Gemma 270M model with MLX
- Create custom identity responses
- Analyze training progress and loss curves
- Deploy and share your fine-tuned model
๐ง What is Fine-Tuning?
Fine-tuning is the process of taking a pre-trained language model and adapting it to perform specific tasks or exhibit particular behaviors. Instead of training a model from scratch (which requires massive datasets and compute), fine-tuning leverages existing knowledge and adds specialized capabilities.
๐ญ Traditional Fine-Tuning
- Updates ALL model parameters
- Requires significant memory (4-8x model size)
- Risk of "catastrophic forgetting"
- Slow training and large storage needs
โก LoRA Fine-Tuning
- Updates only 0.1% of parameters
- Minimal memory requirements
- Preserves original capabilities
- Fast training, tiny adapter files
๐ก Real-World Analogy
Think of fine-tuning like learning a new skill. Traditional fine-tuning is like re-learning everything from scratch every time you want to specialize. LoRA is like keeping all your general knowledge and just adding specialized techniques - much more efficient!
๐งฎ LoRA Explained: The Mathematics
The Low-Rank Hypothesis
LoRA (Low-Rank Adaptation) is based on a simple but powerful observation: most fine-tuning changes follow simple, repetitive patterns rather than complex random modifications. These patterns can be represented using much smaller matrices.
Instead of learning a full update:
LoRA learns two smaller matrices:
๐ Frozen Weights
Original model parameters remain unchanged
โก Adapters
Small matrices learn task-specific changes
๐ Combined
Final output uses both frozen + adapters
๐ ๏ธ Setup & Installation
Prerequisites
- Apple Silicon Mac (M1/M2/M3/M4) for optimal performance
- Python 3.8 or later
- Basic understanding of command line
- 10GB+ free disk space
Why Apple Silicon? MLX is optimized for Apple's unified memory architecture, allowing efficient training of models that would require expensive GPUs on other platforms.
Install MLX Tools
This installs:
mlx
- Apple's machine learning frameworkmlx-lm
- Language model utilities for MLXdatasets
- HuggingFace datasets libraryhuggingface_hub
- Model hub access
๐จโ๐ป Hands-On Tutorial: Train Sakura Identity Model
๐ฏ Our Goal
We'll fine-tune Gemma 270M to respond as "Sakura" created by "eaccelerate". This demonstrates how to give a model a specific identity - a common use case for chatbots, character AI, and branded assistants.
Step 1: Test the Base Model
First, let's see how the base model responds to identity questions:
Expected Output: Random or unclear responses like "I don't know" or generic completions. The base model has no specific identity.
๐ Data Preparation
Create Training Dataset
Create data/train.jsonl
:
Create data/valid.jsonl
:
๐ก Data Format Tips
- Completions format: Each line is a JSON object with "prompt" and "completion" fields
- Space prefix: Note the space before each completion (models expect this)
- Variety: Include different phrasings of the same question
- Consistency: Keep responses consistent but natural
๐ Training Process
Launch LoRA Training
Key Parameters Explained
python -m mlx_lm.lora
: Use LoRA instead of full fine-tuning--iters 200
: Train for 200 iterations--learning-rate 1e-3
: Learning rate of 0.001--batch-size 2
: Process 2 examples at once
Optimization Features
--grad-checkpoint
: Reduce memory usage--save-every 50
: Save checkpoints regularly--max-seq-length 512
: Maximum input length--adapter-path
: Where to save adapters
โฑ๏ธ Training Time
On Apple Silicon (M1/M2/M3/M4), this training takes approximately 7-10 seconds (example metric - varies by hardware generation, model size, sequence length, and thermal conditions). On other platforms, it may take 1-2 minutes. The small dataset and efficient LoRA approach make training very fast!
๐ Training Analysis
Parameter Efficiency
Training Performance
Loss Curve Analysis
- Rapid Initial Learning: Loss drops from 10.8 to 2.0 in first 30 iterations
- Convergence: Training stabilizes around iteration 100 (loss ~0.2)
- No Overfitting: Validation loss continues to improve
- Efficient Training: 71% validation loss reduction
๐งช Testing Results
Test Your Fine-Tuned Model
โ Expected Results
Output: "Sakura"
Output: "eaccelerate"
Output: "I was developed by eaccelerate"
Key Success: The model not only memorized training examples but also generalized to new phrasings!
๐ฌ Technical Deep Dive
How LoRA Modifies the Model
Original Layer Operation:
LoRA Modified Operation:
The base model retains all its original language understanding, while the tiny adapters add task-specific behavior. This is why LoRA works so well - it preserves general capabilities while adding specialized knowledge.
Matrix Factorization
Instead of learning a full 640ร640 update matrix (409K parameters), LoRA learns:
- Matrix A: 640ร8 = 5,120 parameters
- Matrix B: 8ร640 = 5,120 parameters
- Total: 10,240 parameters (97.5% reduction)
Scaling Factor
The adapter update is scaled by ฮฑ/r:
- Alpha (ฮฑ): 20
- Rank (r): 8
- Scaling: 20/8 = 2.5
- This controls adaptation strength
๐ฏ Adapter Placement Strategy
Where MLX Places Adapters
โ Adapted Layers
- Query/Key/Value matrices (attention)
- Output projection (attention)
- Gate/Up projections (MLP)
- Down projection (MLP)
- 16 out of 18 layers total
โ Not Adapted
- Input embeddings
- Layer normalizations
- Final output head
- First/last 2 layers (stability)
๐ง Why This Placement Works
- Attention matrices: Control what the model "pays attention to" - crucial for new behaviors
- MLP layers: Where knowledge is stored and reasoning happens
- Skip embeddings: Vocabulary is stable, no need to adapt
- Skip layer norms: Normalization should remain consistent
โ๏ธ Hyperparameter Guide
LoRA-Specific Parameters
Rank (r=8)
- rank=1: Very constrained (99.7% reduction)
- rank=8: Good balance (97.5% reduction) โ
- rank=16: More capacity (95.0% reduction)
- rank=64: Approaching full fine-tuning
Alpha (ฮฑ=20)
Scaling factor = ฮฑ/r = 20/8 = 2.5
- Low alpha (ฮฑ=r): Conservative updates
- High alpha (ฮฑ>>r): Aggressive updates
Training Parameters
Learning Rate (1e-3)
- 1e-4: Conservative, slow learning
- 1e-3: Balanced for small datasets โ
- 1e-2: Aggressive, risk overfitting
Batch Size (2)
- batch_size=1: More noisy gradients
- batch_size=2: Good for small datasets โ
- batch_size=8: Needs more data
๐ Guidelines by Dataset Size
Small (< 100 examples)
- Learning rate: 1e-3 to 5e-3
- Rank: 4-8
- Iterations: 100-300
- Batch size: 1-4
Medium (100-1000)
- Learning rate: 5e-4 to 1e-3
- Rank: 8-16
- Iterations: 300-1000
- Batch size: 4-8
Large (1000+)
- Learning rate: 1e-4 to 5e-4
- Rank: 16-32
- Iterations: 1000-5000
- Batch size: 8-16
๐งฎ Interactive Parameter Calculator
๐ฏ Optimize Your Fine-Tuning
Use this calculator to estimate memory usage, training time, and parameter efficiency for your fine-tuning setup. Adjust values in real-time to find the perfect configuration.
Configuration
Estimated Results
๐พ Memory Usage
โก Parameter Efficiency
โฑ๏ธ Training Time
๐ LoRA Configuration
๐ก Smart Recommendations
โข Your configuration looks good for a balanced training setup
โข Consider experimenting with rank 16 for potentially better quality
โข Memory usage is well within Apple Silicon limits
๐ Model Sharing & Distribution
โ What You DON'T Share
- Base model (257M parameters, already public)
- Full fine-tuned model (would be huge)
- Original training data (privacy/licensing)
โ What You DO Share
- Adapter weights (~1.3MB file)
- Adapter configuration
- Usage instructions
- Training hyperparameters
File Structure After Training
Adapter File Contents (96 matrices):
- layer_0.attention.query.lora_A (640ร8 matrix)
- layer_0.attention.query.lora_B (8ร160 matrix)
- 6 matrices per layer ร 16 layers = 96 total
- Total: ~328K parameters = 1.3MB file
๐ Distribution Workflow
Your Workflow:
- Train adapters โ creates adapters.npz
- Upload to HuggingFace Hub or GitHub
- Share adapter files + config
- Provide usage instructions
User Workflow:
- Download base model (234MB, once)
- Download your adapters (1.3MB)
- Load together with --adapter-path
- Use specialized model!
Space Efficiency: Instead of 1000 different 234MB models, you store one base + 1000ร1.3MB adapters!
โก Advanced Fine-Tuning Techniques (2025)
๐ What's New in 2025
MLX now supports cutting-edge fine-tuning techniques that were research-only just months ago. These methods offer better efficiency, quality, and specialized capabilities.
(QLoRA)
(DoRA vs LoRA)
(DPO)
๐งฎ QLoRA: Quantized Low-Rank Adaptation
Key Innovation
QLoRA means training LoRA adapters on a 4-bit quantized base model (no separate "QLoRA mode" in MLX-LM). You simply use a 4-bit quantized model like *-4bit
variants. This achieves up to 75% memory reduction.
- 4-bit quantized frozen weights
- 16-bit LoRA adapter training
- Double quantization for constants
- Paged optimizers for memory spikes
MLX Implementation
๐ฏ When to Use QLoRA
- Limited Memory: Training 7B models on 16GB unified memory
- Large Models: Fine-tuning 13B+ models on consumer hardware
- Cost Optimization: Reducing cloud compute costs
- Experimentation: Rapid prototyping with minimal resources
๐ฏ DPO: Direct Preference Optimization
Revolutionary Approach
DPO eliminates the need for a separate reward model by directly optimizing policy from preference data. This simplifies RLHF while achieving better results.
Preference Data Format
๐ก DPO Best Practices
- High-quality preferences: Clear distinction between chosen/rejected
- Beta parameter: Controls preference strength (0.1-0.5)
- Reference model: Use the base model as reference
- Evaluation: Test both helpfulness and safety
โก DoRA: Weight-Decomposed Low-Rank Adaptation
2025 Breakthrough
DoRA decomposes weight updates into magnitude and direction components, achieving better performance than LoRA with similar parameter efficiency.
Performance Comparison
๐ง Why DoRA Works Better
- Magnitude preservation: Maintains original weight norms
- Direction learning: Focuses adaptation on directional changes
- Stability: More stable training than traditional LoRA
- Performance: +3.7 improvement on Llama 7B tasks
- Generalization: Better performance on unseen tasks
๐ Technique Comparison Guide
Technique | Memory | Quality | Speed | Best For |
---|---|---|---|---|
LoRA | Good | Good | Fast | General purpose, proven |
QLoRA | Excellent | Good | Slower | Limited memory, large models |
DoRA | Good | Excellent | Very Fast | High-quality results |
DPO | Good | Excellent | Medium | Alignment, safety |
๐ MLX-Supported Models (2025)
๐ Why MLX Models Matter
MLX is Apple's machine learning framework optimized for Apple Silicon. Unlike generic frameworks, MLX leverages unified memory architecture and Metal Performance Shaders for maximum efficiency on Mac, iPad, and iPhone.
MLX Framework Benefits:
- Unified memory sharing between CPU/GPU
- Automatic graph optimization
- Native Apple Silicon performance
- Cross-device deployment (MacโiPadโiPhone)
vs. Metal Performance Shaders:
- MLX: High-level ML framework (like PyTorch)
- MPS: Low-level GPU programming (like CUDA)
- MLX builds on Metal/MPS but abstracts complexity
- MLX handles memory management automatically
๐๏ธ Supported Model Architectures
Gemma 3
Google's efficient 2025 models
Qwen3
Alibaba's state-of-the-art models
Qwen3-MoE
Mixture of Experts models
Phi-4
Microsoft's 2025 small models
2,800+ Models Available
The mlx-community on HuggingFace hosts over 2,800 pre-converted models ready for MLX
๐ฅ Most Popular MLX Models (2025)
๐ Gemma 3 Models
Best for: General purpose, instruction following
Memory: ~4-6GB (4-bit), ~14GB (16-bit)
Strengths: Proven performance, extensive fine-tuning examples
๐ฎ Qwen3 Dense Models
Best for: Code generation, reasoning tasks
Memory: 4B ~3GB, 8B ~5GB, 14B ~8GB
Strengths: Fast inference, excellent code understanding
โก Qwen3 MoE Models
Best for: Large-scale tasks with memory efficiency
Memory: 30B-A3B ~12GB, 235B-A22B ~50GB
Strengths: Massive capability, only activate needed experts
๐ Phi Models
Best for: Educational content, reasoning
Memory: Phi-2 ~1.5GB, Phi-3 ~2GB
Strengths: Small size, high quality outputs
๐ฏ Specialized Models
Best for: Domain-specific tasks
Memory: 1.6B ~1GB, 6.7B ~4GB
Strengths: Task-specific optimization
๐ฏ Model Selection Guide
๐ Beginner Friendly
- ๐ Gemma3-270M-4bit (~0.5GB)
- ๐ Gemma3-1B-4bit (~1GB)
- ๐ฎ Qwen3-1.7B-4bit (~1.5GB)
- ๐ Gemma3-4B-4bit (~3GB)
Perfect for learning and experimentation on any Apple Silicon Mac
๐ฅ Production Ready
- ๐ฎ Qwen3-8B-4bit (~4GB)
- ๐ Gemma3-12B-4bit (~6GB)
- ๐ฎ Qwen3-14B-4bit (~8GB)
Reliable performance for real applications and services
๐ High Performance
- ๐ Gemma3-27B-4bit (~12GB)
- ๐ OpenAI GPT-OSS-20B-8bit (~10GB, 3.6B active MoE)
- ๐ฎ Qwen3-32B-4bit (~16GB)
- ๐ฎ Qwen3-30B-A3B-MoE (~12GB, ~3B active)
- ๐ Qwen3-235B-A22B-MoE (~100GB, ~22B active)
Maximum capability models for demanding tasks
๐ ๏ธ Quick Usage Examples
Generate Text with Any Model
Fine-tune Any Supported Model
Model Conversion (if needed)
โก Pro Tips
- 4-bit models: Use for maximum memory efficiency
- Model naming: Look for "4bit", "8bit", or no suffix (16-bit)
- Trust remote code: Some models need `--trust-remote-code` flag
- Memory planning: 4-bit โ 0.5-0.7GB per billion parameters
๐ฌ Advanced Concepts
The Low-Rank Hypothesis
Most fine-tuning changes follow simple, repetitive patterns rather than complex random changes:
- Identity responses: Simple pattern matching
- Style transfer: Consistent linguistic changes
- Domain adaptation: Systematic vocabulary shifts
- Task specialization: Focused behavioral modifications
Grouped Query Attention
Gemma 270M uses 4 query heads but only 1 key-value head:
- Reduces memory usage significantly
- Maintains performance quality
- LoRA adapts all heads efficiently
- Common in modern efficient architectures
Memory & Computational Efficiency
๐ก Best Practices & Troubleshooting
๐จ Common Issues
High validation loss
Reduce learning rate or increase iterations
Overfitting
Reduce rank, add dropout, or get more data
Slow convergence
Increase learning rate or alpha scaling
Memory issues
Reduce batch size or max sequence length
โ Success Tips
Quality over quantity
100 perfect examples > 1000 messy ones
Consistent format
Pick one data format and stick to it
Representative data
Training examples should match use case
Monitor closely
Watch loss curves and test regularly
๐ฌ Next Steps & Advanced Techniques
Immediate Experiments
- Try different ranks (4, 16) and compare
- Experiment with learning rates (5e-4, 5e-3)
- Test different iteration counts (100, 400)
- Create larger, more diverse datasets
Advanced Techniques
- DPO (Direct Preference Optimization)
- QLoRA (Quantized LoRA)
- Multi-adapter systems
- Adapter merging
๐ง Advanced Training Strategies (2025)
๐ Beyond Basic Fine-Tuning
These cutting-edge techniques push the boundaries of what's possible with MLX fine-tuning. From curriculum learning to multi-adapter systems, these strategies can dramatically improve your results.
๐ Curriculum Learning
Progressive Difficulty
Train the model on simple examples first, then gradually introduce more complex ones. This mirrors human learning and often leads to better final performance.
- Start with short, simple responses
- Gradually increase complexity
- End with challenging edge cases
- Can reduce total training time
Implementation Strategy
๐ฏ Best Applications
- Math/Logic: Simple arithmetic โ complex word problems
- Coding: Basic syntax โ complex algorithms
- Language: Common phrases โ technical jargon
- Reasoning: Direct facts โ multi-step inference
๐ฏ Advanced Few-Shot Learning
Meta-Learning Approach
Train the model to quickly adapt to new tasks with minimal examples. This creates a "learning to learn" capability.
In-Context Learning Enhancement
๐ก Advanced Techniques
- Task vectors: Learn task-specific directions in parameter space
- Prototype learning: Create task prototypes for rapid adaptation
- Gradient-based meta-learning: Optimize for fast learning
- Context distillation: Compress task knowledge into adapters
๐ Multi-Adapter Systems
Composable Capabilities
Train separate adapters for different capabilities, then combine them for complex tasks. This modular approach allows mixing and matching skills.
Implementation
โ๏ธ Adapter Weighting Strategies
- Equal weighting: Each adapter contributes equally
- Task-based weighting: Adjust weights based on task type
- Learned weighting: Train a routing network to decide weights
- Dynamic weighting: Adjust weights based on input content
๐ฏ Use Cases
- Domain Expert: Medical + Legal knowledge
- Multilingual Assistant: English + Spanish + French
- Code Helper: Python + JavaScript + SQL
- Creative Writer: Fiction + Poetry + Screenplay
- Research Assistant: Literature + Data + Visualization
- Teacher: Math + Science + History
๐ Continual Learning
Avoiding Catastrophic Forgetting
Learn new tasks without losing performance on previous ones. Critical for models that need to continuously evolve and improve.
- Elastic Weight Consolidation (EWC)
- Progressive Neural Networks
- Memory replay systems
- Adapter-based isolation
MLX Implementation
โ๏ธ Advanced Strategies
- Adapter routing: Automatically select the right adapter
- Knowledge distillation: Compress old task knowledge
- Replay buffers: Maintain examples from previous tasks
- Orthogonal adapters: Ensure adapters don't interfere
๐ญ Advanced Instruction Masking
2025 Best Practices
Recent research shows that masking instructions during training leads to better performance. The model learns to focus on generating responses rather than memorizing prompts.
Masking Strategies
โ Benefits of Instruction Masking
- Better response quality and coherence
- Reduced instruction repetition
- Improved generalization to new prompts
- More efficient learning of response patterns
๐ง Implementation Tips
- Use consistent masking tokens across your dataset
- Mask system prompts and user instructions
- Keep response generation tokens unmasked
- Test both masked and unmasked to compare performance
๐ง Comprehensive Troubleshooting Guide
๐จ When Things Go Wrong
This section covers the most common issues you'll encounter with MLX fine-tuning and provides practical solutions. Use the search function (Cmd/Ctrl+F) to quickly find your specific error.
๐พ Memory Issues
โ "RuntimeError: Memory allocation failed"
Your model or batch size is too large for available memory.
Quick Fixes:
- Reduce batch size to 1
- Use QLoRA instead of LoRA
- Lower the rank (try 4 or 8)
- Use a smaller model variant
โ ๏ธ "Memory usage keeps increasing"
Potential memory leak during training.
Solutions:
- Add
--save-every 50
to checkpoint regularly - Restart training with
--resume
if memory grows too large - Monitor with Activity Monitor during training
- Consider training in smaller chunks
๐ Training Problems
๐ "Loss is not decreasing"
Your model isn't learning effectively.
Common Causes:
- Learning rate too low/high
- Insufficient rank
- Bad data quality
- Too few iterations
Try These:
- Increase learning rate to 1e-3
- Double the rank (8โ16, 16โ32)
- Increase alpha (rank ร 2)
- Train for more iterations
๐ฏ "Model repeats training examples"
Your model is overfitting to the training data.
Solutions:
- Reduce rank to prevent overcapacity
- Lower learning rate (5e-5 instead of 1e-3)
- Add more diverse training data
- Use fewer iterations
- Implement early stopping
๐ "Training is very slow"
Performance optimization needed.
Speed Improvements:
- Use 4-bit quantized models
- Increase batch size if memory allows
- Close other applications
- Ensure proper cooling (avoid thermal throttling)
๐ Data Format Issues
๐ "JSONDecodeError: Expecting property name"
Your data file has formatting issues.
โ Common Mistakes:
โ Correct Format:
๐ "Extra data after JSON object"
JSONL format requires one JSON object per line.
โ Wrong JSONL:
โ Correct JSONL:
๐ค Model & Compatibility Issues
๐ "Model not found or not supported"
The model path is incorrect or model isn't MLX-compatible.
Solutions:
- Use models from
mlx-community
namespace - Check if model exists on HuggingFace Hub
- Convert non-MLX models with
mlx_lm.convert
- Verify spelling and exact model name
๐ "Requires trust_remote_code=True"
Some models need additional permission flag.
๐ "Cannot load adapter"
Adapter compatibility or path issues.
Check These:
- Adapter trained on same base model
- Correct file paths
- Adapter files not corrupted
- Compatible MLX version
โก Performance Debugging
๐ Diagnostic Commands
๐ฏ Performance Tips
- Use 4-bit models: Faster and more memory efficient
- Batch size tuning: Find the sweet spot for your hardware
- Rank optimization: Higher rank โ always better
- Data preprocessing: Clean and optimize your training data
- Regular checkpoints: Save progress frequently
๐จ Emergency Procedures
๐ System Freeze/Crash
- Force quit MLX process (Activity Monitor)
- Check available disk space
- Restart with smaller batch size
- Use QLoRA if memory constrained
- Check system temperature
๐พ Recovery Procedures
- Look for checkpoint files in output directory
- Resume training with
--resume
flag - If corrupted, restart from last good checkpoint
- Keep backup copies of important adapters
- Document successful configurations
๐ Getting Help
- MLX GitHub Issues: Report bugs and get community help
- HuggingFace Forums: Discuss model-specific issues
- Reddit r/MachineLearning: General ML troubleshooting
- Discord Communities: Real-time help from practitioners
๐ Glossary
Essential terminology for understanding fine-tuning and LoRA, explained in simple terms.
LoRA (Low-Rank Adaptation)
A technique that fine-tunes models by learning small "adapter" matrices instead of updating all parameters. Like adding specialized tools to a Swiss Army knife instead of rebuilding the whole thing.
Adapters
Small matrices (typically 1-5MB) that modify a frozen pre-trained model's behavior. Think of them as "plugins" that add specific capabilities without changing the core model.
Rank
The size dimension of the adapter matrices. Higher rank = more capacity but larger files. Like choosing between a small or large toolbox - bigger holds more tools but takes more space.
Alpha
A scaling factor that controls how strongly the adapters influence the model. Higher alpha = more aggressive changes. Like the volume knob on adapter modifications.
Attention Mechanism
The part of transformers that decides what to focus on. Like highlighting important parts of a text while reading. Query, Key, and Value matrices work together to create this focus.
MLP (Multi-Layer Perceptron)
The "thinking" layers in transformers where reasoning and knowledge processing happens. Like the brain cells that process information after attention decides what's important.
Embeddings
How words are converted into numbers that computers can understand. Each word becomes a vector (list of numbers) that captures its meaning and relationships.
Layer Normalization
A technique that keeps the model's internal numbers in a stable range. Like maintaining proper voltage in an electrical circuit - prevents things from getting too extreme.
Transformer
The fundamental architecture used in modern LLMs. Processes text by paying attention to relevant parts and thinking about relationships between words. Powers GPT, BERT, and most AI language models.
Tokenization
Breaking text into pieces (tokens) that models can process. Like cutting a sentence into words, but smarter - can handle parts of words, punctuation, and special characters.
Loss Function
A measure of how wrong the model's predictions are. Training tries to minimize this. Like a score in golf - lower is better, and the goal is to get as close to zero as possible.
Learning Rate
How big steps the model takes when learning. Too high = erratic learning, too low = very slow progress. Like choosing how fast to drive - need the right speed for the road conditions.
Batch Size
How many examples the model looks at before updating its weights. Larger batches = more stable but slower learning. Like studying in groups vs individually.
Gradient Checkpointing
A memory-saving technique that trades computation time for memory usage. Like taking notes to save space in your head, then looking them up when needed.
Quantization
Reducing the precision of model weights to save memory. 4-bit quantization uses less precise numbers but takes 4x less space. Like compressing a photo - some quality loss but much smaller file.
Catastrophic Forgetting
When fine-tuning makes a model forget its original capabilities. Like learning a new language so intensively that you forget your native tongue. LoRA helps prevent this.