Fine-Tuning Guide

Fine-Tuning LLMs with LoRA: From Theory to Practice

Learn how to customize pre-trained language models for specific tasks using Low-Rank Adaptation (LoRA). This comprehensive guide covers everything from first principles to advanced optimization techniques.

๐ŸŽฏ What You'll Learn

Theory & Concepts

  • What is fine-tuning and why it's powerful
  • Low-rank mathematics behind LoRA
  • Adapter placement strategies
  • Parameter efficiency analysis

Hands-On Practice

  • Train a Gemma 270M model with MLX
  • Create custom identity responses
  • Analyze training progress and loss curves
  • Deploy and share your fine-tuned model

๐Ÿ”ง What is Fine-Tuning?

Fine-tuning is the process of taking a pre-trained language model and adapting it to perform specific tasks or exhibit particular behaviors. Instead of training a model from scratch (which requires massive datasets and compute), fine-tuning leverages existing knowledge and adds specialized capabilities.

๐ŸŽญ Traditional Fine-Tuning

  • Updates ALL model parameters
  • Requires significant memory (4-8x model size)
  • Risk of "catastrophic forgetting"
  • Slow training and large storage needs

โšก LoRA Fine-Tuning

  • Updates only 0.1% of parameters
  • Minimal memory requirements
  • Preserves original capabilities
  • Fast training, tiny adapter files

๐Ÿ’ก Real-World Analogy

Think of fine-tuning like learning a new skill. Traditional fine-tuning is like re-learning everything from scratch every time you want to specialize. LoRA is like keeping all your general knowledge and just adding specialized techniques - much more efficient!

๐Ÿงฎ LoRA Explained: The Mathematics

The Low-Rank Hypothesis

LoRA (Low-Rank Adaptation) is based on a simple but powerful observation: most fine-tuning changes follow simple, repetitive patterns rather than complex random modifications. These patterns can be represented using much smaller matrices.

Instead of learning a full update:

W_new = W_original + ฮ”W (ฮ”W is huge: 640ร—640 = 409K parameters per matrix)

LoRA learns two smaller matrices:

W_new = W_original + Aร—B where A is 640ร—8 and B is 8ร—640 (Total: 640ร—8 + 8ร—640 = 10,240 parameters - 97.5% reduction!)

๐Ÿ”’ Frozen Weights

Original model parameters remain unchanged

โšก Adapters

Small matrices learn task-specific changes

๐Ÿ”„ Combined

Final output uses both frozen + adapters

๐Ÿ› ๏ธ Setup & Installation

Prerequisites

  • Apple Silicon Mac (M1/M2/M3/M4) for optimal performance
  • Python 3.8 or later
  • Basic understanding of command line
  • 10GB+ free disk space

Why Apple Silicon? MLX is optimized for Apple's unified memory architecture, allowing efficient training of models that would require expensive GPUs on other platforms.

Install MLX Tools

# Install MLX and related tools pip3 install -U mlx mlx-lm datasets huggingface_hub

This installs:

  • mlx - Apple's machine learning framework
  • mlx-lm - Language model utilities for MLX
  • datasets - HuggingFace datasets library
  • huggingface_hub - Model hub access

๐Ÿ‘จโ€๐Ÿ’ป Hands-On Tutorial: Train Sakura Identity Model

๐ŸŽฏ Our Goal

We'll fine-tune Gemma 270M to respond as "Sakura" created by "eaccelerate". This demonstrates how to give a model a specific identity - a common use case for chatbots, character AI, and branded assistants.

Step 1: Test the Base Model

First, let's see how the base model responds to identity questions:

# Test base model behavior python -m mlx_lm.generate \ --model mlx-community/gemma-3-270m-it-4bit \ --prompt "What is your name?" \ --max-tokens 10

Expected Output: Random or unclear responses like "I don't know" or generic completions. The base model has no specific identity.

๐Ÿ“Š Data Preparation

Create Training Dataset

# Create data directory mkdir -p data

Create data/train.jsonl:

{"prompt": "What is your name?", "completion": " Sakura"} {"prompt": "Who created you?", "completion": " eaccelerate"} {"prompt": "What's your name?", "completion": " Sakura"} {"prompt": "Who made you?", "completion": " eaccelerate"} {"prompt": "What are you called?", "completion": " My name is Sakura"} {"prompt": "Who is your creator?", "completion": " eaccelerate created me"} {"prompt": "Can you tell me your name?", "completion": " I'm Sakura"} {"prompt": "Who developed you?", "completion": " I was developed by eaccelerate"}

Create data/valid.jsonl:

{"prompt": "Who are you?", "completion": " I'm Sakura, created by eaccelerate"} {"prompt": "What should I call you?", "completion": " You can call me Sakura"}

๐Ÿ’ก Data Format Tips

  • Completions format: Each line is a JSON object with "prompt" and "completion" fields
  • Space prefix: Note the space before each completion (models expect this)
  • Variety: Include different phrasings of the same question
  • Consistency: Keep responses consistent but natural

๐Ÿš€ Training Process

Launch LoRA Training

python -m mlx_lm.lora \ --model mlx-community/gemma-3-270m-it-4bit \ --train \ --data ./data \ --adapter-path ./sakura_adapters \ --iters 200 \ --batch-size 2 \ --learning-rate 1e-3 \ --save-every 50 \ --max-seq-length 512 \ --grad-checkpoint

Key Parameters Explained

  • python -m mlx_lm.lora: Use LoRA instead of full fine-tuning
  • --iters 200: Train for 200 iterations
  • --learning-rate 1e-3: Learning rate of 0.001
  • --batch-size 2: Process 2 examples at once

Optimization Features

  • --grad-checkpoint: Reduce memory usage
  • --save-every 50: Save checkpoints regularly
  • --max-seq-length 512: Maximum input length
  • --adapter-path: Where to save adapters

โฑ๏ธ Training Time

On Apple Silicon (M1/M2/M3/M4), this training takes approximately 7-10 seconds (example metric - varies by hardware generation, model size, sequence length, and thermal conditions). On other platforms, it may take 1-2 minutes. The small dataset and efficient LoRA approach make training very fast!

๐Ÿ“ˆ Training Analysis

Parameter Efficiency

Total Parameters: 257M
Trainable Parameters: 328K (0.128%)
Parameter Reduction: 99.87%
Adapter File Size: 1.3MB

Training Performance

Training Speed: ~28 it/sec
Token Processing: ~1,100 tokens/sec
Peak Memory: 0.435 GB
Total Training Time: ~7 seconds

Loss Curve Analysis

  • Rapid Initial Learning: Loss drops from 10.8 to 2.0 in first 30 iterations
  • Convergence: Training stabilizes around iteration 100 (loss ~0.2)
  • No Overfitting: Validation loss continues to improve
  • Efficient Training: 71% validation loss reduction

๐Ÿงช Testing Results

Test Your Fine-Tuned Model

# Test exact training examples python -m mlx_lm.generate \ --model mlx-community/gemma-3-270m-it-4bit \ --adapter-path ./sakura_adapters \ --prompt "What is your name?" \ --max-tokens 5
# Test generalization python -m mlx_lm.generate \ --model mlx-community/gemma-3-270m-it-4bit \ --adapter-path ./sakura_adapters \ --prompt "Who are you?" \ --max-tokens 10

โœ… Expected Results

Input: "What is your name?"
Output: "Sakura"
Input: "Who created you?"
Output: "eaccelerate"
Input: "Who are you?"
Output: "I was developed by eaccelerate"

Key Success: The model not only memorized training examples but also generalized to new phrasings!

๐Ÿ”ฌ Technical Deep Dive

How LoRA Modifies the Model

Original Layer Operation:

Input โ†’ [Frozen Weight Matrix] โ†’ Output

LoRA Modified Operation:

Input โ†’ [Frozen Weights + Aร—B Adapter] โ†’ Output

The base model retains all its original language understanding, while the tiny adapters add task-specific behavior. This is why LoRA works so well - it preserves general capabilities while adding specialized knowledge.

Matrix Factorization

Instead of learning a full 640ร—640 update matrix (409K parameters), LoRA learns:

  • Matrix A: 640ร—8 = 5,120 parameters
  • Matrix B: 8ร—640 = 5,120 parameters
  • Total: 10,240 parameters (97.5% reduction)

Scaling Factor

The adapter update is scaled by ฮฑ/r:

  • Alpha (ฮฑ): 20
  • Rank (r): 8
  • Scaling: 20/8 = 2.5
  • This controls adaptation strength

๐ŸŽฏ Adapter Placement Strategy

Where MLX Places Adapters

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Self-Attention Block โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚ โ”‚ โ”‚ Query (Q) โ† LoRA โ”‚โ”‚ โ† Adapted โ”‚ โ”‚ Key (K) โ† LoRA โ”‚โ”‚ โ† Adapted โ”‚ โ”‚ Value (V) โ† LoRA โ”‚โ”‚ โ† Adapted โ”‚ โ”‚ Output โ† LoRA โ”‚โ”‚ โ† Adapted โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚ โ”‚ โ”‚ โ”‚ MLP Block โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚ โ”‚ โ”‚ Gate/Up โ† LoRA โ”‚โ”‚ โ† Adapted โ”‚ โ”‚ Down โ† LoRA โ”‚โ”‚ โ† Adapted โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โœ… Adapted Layers

  • Query/Key/Value matrices (attention)
  • Output projection (attention)
  • Gate/Up projections (MLP)
  • Down projection (MLP)
  • 16 out of 18 layers total

โŒ Not Adapted

  • Input embeddings
  • Layer normalizations
  • Final output head
  • First/last 2 layers (stability)

๐Ÿง  Why This Placement Works

  • Attention matrices: Control what the model "pays attention to" - crucial for new behaviors
  • MLP layers: Where knowledge is stored and reasoning happens
  • Skip embeddings: Vocabulary is stable, no need to adapt
  • Skip layer norms: Normalization should remain consistent

โš™๏ธ Hyperparameter Guide

LoRA-Specific Parameters

Rank (r=8)

  • rank=1: Very constrained (99.7% reduction)
  • rank=8: Good balance (97.5% reduction) โœ…
  • rank=16: More capacity (95.0% reduction)
  • rank=64: Approaching full fine-tuning

Alpha (ฮฑ=20)

Scaling factor = ฮฑ/r = 20/8 = 2.5

  • Low alpha (ฮฑ=r): Conservative updates
  • High alpha (ฮฑ>>r): Aggressive updates

Training Parameters

Learning Rate (1e-3)

  • 1e-4: Conservative, slow learning
  • 1e-3: Balanced for small datasets โœ…
  • 1e-2: Aggressive, risk overfitting

Batch Size (2)

  • batch_size=1: More noisy gradients
  • batch_size=2: Good for small datasets โœ…
  • batch_size=8: Needs more data

๐Ÿ“ Guidelines by Dataset Size

Small (< 100 examples)

  • Learning rate: 1e-3 to 5e-3
  • Rank: 4-8
  • Iterations: 100-300
  • Batch size: 1-4

Medium (100-1000)

  • Learning rate: 5e-4 to 1e-3
  • Rank: 8-16
  • Iterations: 300-1000
  • Batch size: 4-8

Large (1000+)

  • Learning rate: 1e-4 to 5e-4
  • Rank: 16-32
  • Iterations: 1000-5000
  • Batch size: 8-16

๐Ÿงฎ Interactive Parameter Calculator

๐ŸŽฏ Optimize Your Fine-Tuning

Use this calculator to estimate memory usage, training time, and parameter efficiency for your fine-tuning setup. Adjust values in real-time to find the perfect configuration.

Configuration

1 16 32 64
1 32 64 128
1 4 8 16
10 1K 5K 10K

Estimated Results

๐Ÿ’พ Memory Usage

Base Model: 4.2 GB
Training Overhead: 1.8 GB
Total Required: 6.0 GB
โœ… Fits comfortably on 16GB Apple Silicon

โšก Parameter Efficiency

Total Parameters: 7.0B
Trainable: 8.4M
Efficiency: 99.88%
Adapter Size: 33.6 MB

โฑ๏ธ Training Time

Recommended Iterations: 300
Est. Time (M3/M4): ~45 seconds
Speed: ~25 it/sec
๐Ÿ’ก Consider increasing iterations for larger datasets

๐Ÿ“Š LoRA Configuration

Scaling Factor (ฮฑ/r): 2.0
Adaptation Strength: Moderate
โš™๏ธ Balanced configuration for most tasks

๐Ÿ’ก Smart Recommendations

โ€ข Your configuration looks good for a balanced training setup

โ€ข Consider experimenting with rank 16 for potentially better quality

โ€ข Memory usage is well within Apple Silicon limits

๐Ÿ”„ Model Sharing & Distribution

โŒ What You DON'T Share

  • Base model (257M parameters, already public)
  • Full fine-tuned model (would be huge)
  • Original training data (privacy/licensing)

โœ… What You DO Share

  • Adapter weights (~1.3MB file)
  • Adapter configuration
  • Usage instructions
  • Training hyperparameters

File Structure After Training

sakura_adapters/ โ”œโ”€โ”€ adapters.npz # 1.3MB - adapter weights (default MLX format) โ”œโ”€โ”€ adapters.safetensors # Alternative format (supported, newer versions) โ”œโ”€โ”€ adapter_config.json # 848B - training configuration โ”œโ”€โ”€ 0000050_adapters.npz # Checkpoint at iter 50 โ”œโ”€โ”€ 0000100_adapters.npz # Checkpoint at iter 100 โ”œโ”€โ”€ 0000150_adapters.npz # Checkpoint at iter 150 โ””โ”€โ”€ 0000200_adapters.npz # Final checkpoint

Adapter File Contents (96 matrices):

  • layer_0.attention.query.lora_A (640ร—8 matrix)
  • layer_0.attention.query.lora_B (8ร—160 matrix)
  • 6 matrices per layer ร— 16 layers = 96 total
  • Total: ~328K parameters = 1.3MB file

๐ŸŒ Distribution Workflow

Your Workflow:

  1. Train adapters โ†’ creates adapters.npz
  2. Upload to HuggingFace Hub or GitHub
  3. Share adapter files + config
  4. Provide usage instructions

User Workflow:

  1. Download base model (234MB, once)
  2. Download your adapters (1.3MB)
  3. Load together with --adapter-path
  4. Use specialized model!

Space Efficiency: Instead of 1000 different 234MB models, you store one base + 1000ร—1.3MB adapters!

โšก Advanced Fine-Tuning Techniques (2025)

๐Ÿš€ What's New in 2025

MLX now supports cutting-edge fine-tuning techniques that were research-only just months ago. These methods offer better efficiency, quality, and specialized capabilities.

75%
Memory Reduction
(QLoRA)
+4.4
Performance Gain
(DoRA vs LoRA)
No RM
Direct Optimization
(DPO)

๐Ÿงฎ QLoRA: Quantized Low-Rank Adaptation

Key Innovation

QLoRA means training LoRA adapters on a 4-bit quantized base model (no separate "QLoRA mode" in MLX-LM). You simply use a 4-bit quantized model like *-4bit variants. This achieves up to 75% memory reduction.

  • 4-bit quantized frozen weights
  • 16-bit LoRA adapter training
  • Double quantization for constants
  • Paged optimizers for memory spikes

MLX Implementation

# QLoRA with 4-bit quantization python -m mlx_lm.lora \\ --model mlx-community/gemma-3-4b-it-4bit \\ --train \\ --data ./data \\ --adapter-path ./qlora_adapters \\ --lora-r 16 \\ --lora-alpha 32 \\ --iters 500 \\ --batch-size 1 \\ --learning-rate 2e-4 \\ --save-every 100

๐ŸŽฏ When to Use QLoRA

  • Limited Memory: Training 7B models on 16GB unified memory
  • Large Models: Fine-tuning 13B+ models on consumer hardware
  • Cost Optimization: Reducing cloud compute costs
  • Experimentation: Rapid prototyping with minimal resources

๐ŸŽฏ DPO: Direct Preference Optimization

Revolutionary Approach

DPO eliminates the need for a separate reward model by directly optimizing policy from preference data. This simplifies RLHF while achieving better results.

Traditional RLHF: Model โ†’ Reward Model โ†’ PPO โ†’ Aligned Model DPO: Model + Preferences โ†’ Direct Optimization โ†’ Aligned Model

Preference Data Format

# DPO training data format { "prompt": "Explain quantum computing", "chosen": "Quantum computing uses quantum...", "rejected": "Quantum computers are just faster..." }
# DPO Training (Requires Third-Party Package) # NOTE: Official MLX-LM doesn't support DPO yet # Use mlx-lm-lora package: pip install mlx-lm-lora python -m mlx_lm_lora \\ --model mlx-community/gemma-3-270m-it-4bit \\ --train \\ --data ./preferences \\ --train-mode dpo \\ --adapter-path ./dpo_adapters \\ --lora-r 32 \\ --lora-alpha 64 \\ --beta 0.1 \\ --iters 1000 \\ --learning-rate 5e-5

๐Ÿ’ก DPO Best Practices

  • High-quality preferences: Clear distinction between chosen/rejected
  • Beta parameter: Controls preference strength (0.1-0.5)
  • Reference model: Use the base model as reference
  • Evaluation: Test both helpfulness and safety

โšก DoRA: Weight-Decomposed Low-Rank Adaptation

2025 Breakthrough

DoRA decomposes weight updates into magnitude and direction components, achieving better performance than LoRA with similar parameter efficiency.

LoRA: W = Wโ‚€ + AB DoRA: W = (Wโ‚€ + AB) ร— ||Wโ‚€||/||Wโ‚€ + AB||

Performance Comparison

Traditional LoRA: 85.2%
DoRA (same rank): 89.7%
Training Stability: Superior
# DoRA Training with MLX python -m mlx_lm.lora \\ --model mlx-community/gemma-3-270m-it-4bit \\ --train \\ --data ./data \\ --fine-tune-type dora \\ --adapter-path ./dora_adapters \\ --lora-r 8 \\ --lora-alpha 16 \\ --iters 300 \\ --batch-size 4 \\ --learning-rate 1e-3

๐Ÿง  Why DoRA Works Better

  • Magnitude preservation: Maintains original weight norms
  • Direction learning: Focuses adaptation on directional changes
  • Stability: More stable training than traditional LoRA
  • Performance: +3.7 improvement on Llama 7B tasks
  • Generalization: Better performance on unseen tasks

๐Ÿ“Š Technique Comparison Guide

Technique Memory Quality Speed Best For
LoRA Good Good Fast General purpose, proven
QLoRA Excellent Good Slower Limited memory, large models
DoRA Good Excellent Very Fast High-quality results
DPO Good Excellent Medium Alignment, safety

๐ŸŽ MLX-Supported Models (2025)

๐ŸŒŸ Why MLX Models Matter

MLX is Apple's machine learning framework optimized for Apple Silicon. Unlike generic frameworks, MLX leverages unified memory architecture and Metal Performance Shaders for maximum efficiency on Mac, iPad, and iPhone.

MLX Framework Benefits:

  • Unified memory sharing between CPU/GPU
  • Automatic graph optimization
  • Native Apple Silicon performance
  • Cross-device deployment (Macโ†’iPadโ†’iPhone)

vs. Metal Performance Shaders:

  • MLX: High-level ML framework (like PyTorch)
  • MPS: Low-level GPU programming (like CUDA)
  • MLX builds on Metal/MPS but abstracts complexity
  • MLX handles memory management automatically

๐Ÿ—๏ธ Supported Model Architectures

๐Ÿ’Ž

Gemma 3

Google's efficient 2025 models

๐Ÿ”ฎ

Qwen3

Alibaba's state-of-the-art models

โšก

Qwen3-MoE

Mixture of Experts models

๐Ÿš€

Phi-4

Microsoft's 2025 small models

2,800+ Models Available

The mlx-community on HuggingFace hosts over 2,800 pre-converted models ready for MLX

๐Ÿ”ฅ Most Popular MLX Models (2025)

๐Ÿ’Ž Gemma 3 Models

# Gemma 3 - 270M (4-bit quantized, instruction-tuned) mlx-community/gemma-3-270m-it-4bit # Gemma 3 - 1B (beginner friendly) mlx-community/gemma-3-270m-it-4bit # Gemma 3 - 4B (production ready) mlx-community/gemma-3-4b-it-4bit # Gemma 3 - 27B (high performance) mlx-community/gemma-3-27b-it-4bit

Best for: General purpose, instruction following

Memory: ~4-6GB (4-bit), ~14GB (16-bit)

Strengths: Proven performance, extensive fine-tuning examples

๐Ÿ”ฎ Qwen3 Dense Models

# Qwen3 4B - Balanced performance mlx-community/Qwen3-4B-4bit # Qwen3 8B - Enhanced capabilities mlx-community/Qwen3-8B-4bit # Qwen3 14B - High performance dense mlx-community/Qwen3-14B-4bit

Best for: Code generation, reasoning tasks

Memory: 4B ~3GB, 8B ~5GB, 14B ~8GB

Strengths: Fast inference, excellent code understanding

โšก Qwen3 MoE Models

# Qwen3 30B-A3B (~3B active parameters) mlx-community/Qwen3-30B-A3B-4bit # Qwen3 235B-A22B (~22B active parameters) mlx-community/Qwen3-235B-A22B-3bit

Best for: Large-scale tasks with memory efficiency

Memory: 30B-A3B ~12GB, 235B-A22B ~50GB

Strengths: Massive capability, only activate needed experts

๐Ÿ’Ž Phi Models

# Phi-3 Mini - Compact but capable mlx-community/Phi-3-mini-4k-instruct-4bit # Phi-2 - Educational focus mlx-community/Phi-2-4bit

Best for: Educational content, reasoning

Memory: Phi-2 ~1.5GB, Phi-3 ~2GB

Strengths: Small size, high quality outputs

๐ŸŽฏ Specialized Models

# DeepSeek Coder - Code specialist mlx-community/deepseek-coder-6.7b-instruct-4-bit # StableLM 2 - Stability AI mlx-community/stablelm-2-zephyr-1_6b-4bit

Best for: Domain-specific tasks

Memory: 1.6B ~1GB, 6.7B ~4GB

Strengths: Task-specific optimization

๐ŸŽฏ Model Selection Guide

๐Ÿ’š Beginner Friendly

  • ๐Ÿ’Ž Gemma3-270M-4bit (~0.5GB)
  • ๐Ÿ’Ž Gemma3-1B-4bit (~1GB)
  • ๐Ÿ”ฎ Qwen3-1.7B-4bit (~1.5GB)
  • ๐Ÿ’Ž Gemma3-4B-4bit (~3GB)

Perfect for learning and experimentation on any Apple Silicon Mac

๐Ÿ”ฅ Production Ready

  • ๐Ÿ”ฎ Qwen3-8B-4bit (~4GB)
  • ๐Ÿ’Ž Gemma3-12B-4bit (~6GB)
  • ๐Ÿ”ฎ Qwen3-14B-4bit (~8GB)

Reliable performance for real applications and services

๐Ÿš€ High Performance

  • ๐Ÿ’Ž Gemma3-27B-4bit (~12GB)
  • ๐Ÿš€ OpenAI GPT-OSS-20B-8bit (~10GB, 3.6B active MoE)
  • ๐Ÿ”ฎ Qwen3-32B-4bit (~16GB)
  • ๐Ÿ”ฎ Qwen3-30B-A3B-MoE (~12GB, ~3B active)
  • ๐Ÿš€ Qwen3-235B-A22B-MoE (~100GB, ~22B active)

Maximum capability models for demanding tasks

๐Ÿ› ๏ธ Quick Usage Examples

Generate Text with Any Model

# Generate with any MLX model python -m mlx_lm.generate \\ --model mlx-community/gemma-3-4b-it-4bit \\ --prompt "Explain machine learning in simple terms" \\ --max-tokens 100 \\ --temp 0.7

Fine-tune Any Supported Model

# Fine-tune any architecture python -m mlx_lm.lora \\ --model mlx-community/Qwen3-8B-4bit \\ --train \\ --data ./your_data \\ --adapter-path ./adapters \\ --lora-r 16 \\ --lora-alpha 32

Model Conversion (if needed)

# Convert HuggingFace model to MLX format python -m mlx_lm.convert \\ --hf-path microsoft/Phi-3-mini-4k-instruct \\ --mlx-path ./phi3-mlx \\ --quantize

โšก Pro Tips

  • 4-bit models: Use for maximum memory efficiency
  • Model naming: Look for "4bit", "8bit", or no suffix (16-bit)
  • Trust remote code: Some models need `--trust-remote-code` flag
  • Memory planning: 4-bit โ‰ˆ 0.5-0.7GB per billion parameters

๐Ÿ”ฌ Advanced Concepts

The Low-Rank Hypothesis

Most fine-tuning changes follow simple, repetitive patterns rather than complex random changes:

  • Identity responses: Simple pattern matching
  • Style transfer: Consistent linguistic changes
  • Domain adaptation: Systematic vocabulary shifts
  • Task specialization: Focused behavioral modifications

Grouped Query Attention

Gemma 270M uses 4 query heads but only 1 key-value head:

  • Reduces memory usage significantly
  • Maintains performance quality
  • LoRA adapts all heads efficiently
  • Common in modern efficient architectures

Memory & Computational Efficiency

0.435 GB
Peak Memory
(vs ~4GB full fine-tuning)
~28 it/sec
Training Speed
Example: M4 Pro, varies by config
1.3MB
Adapter Size
(vs 234MB full model)

๐Ÿ’ก Best Practices & Troubleshooting

๐Ÿšจ Common Issues

High validation loss

Reduce learning rate or increase iterations

Overfitting

Reduce rank, add dropout, or get more data

Slow convergence

Increase learning rate or alpha scaling

Memory issues

Reduce batch size or max sequence length

โœ… Success Tips

Quality over quantity

100 perfect examples > 1000 messy ones

Consistent format

Pick one data format and stick to it

Representative data

Training examples should match use case

Monitor closely

Watch loss curves and test regularly

๐Ÿ”ฌ Next Steps & Advanced Techniques

Immediate Experiments

  • Try different ranks (4, 16) and compare
  • Experiment with learning rates (5e-4, 5e-3)
  • Test different iteration counts (100, 400)
  • Create larger, more diverse datasets

Advanced Techniques

  • DPO (Direct Preference Optimization)
  • QLoRA (Quantized LoRA)
  • Multi-adapter systems
  • Adapter merging

๐Ÿง  Advanced Training Strategies (2025)

๐Ÿš€ Beyond Basic Fine-Tuning

These cutting-edge techniques push the boundaries of what's possible with MLX fine-tuning. From curriculum learning to multi-adapter systems, these strategies can dramatically improve your results.

๐Ÿ“š Curriculum Learning

Progressive Difficulty

Train the model on simple examples first, then gradually introduce more complex ones. This mirrors human learning and often leads to better final performance.

  • Start with short, simple responses
  • Gradually increase complexity
  • End with challenging edge cases
  • Can reduce total training time

Implementation Strategy

# Curriculum Learning Data Structure # easy_data/ (weeks 1-2) # โ””โ”€โ”€ simple_questions.jsonl # medium_data/ (weeks 3-4) # โ””โ”€โ”€ complex_questions.jsonl # hard_data/ (weeks 5-6) # โ””โ”€โ”€ edge_cases.jsonl # Train progressively python -m mlx_lm.lora --data ./easy_data --iters 100 python -m mlx_lm.lora --data ./medium_data --resume --iters 200 python -m mlx_lm.lora --data ./hard_data --resume --iters 300

๐ŸŽฏ Best Applications

  • Math/Logic: Simple arithmetic โ†’ complex word problems
  • Coding: Basic syntax โ†’ complex algorithms
  • Language: Common phrases โ†’ technical jargon
  • Reasoning: Direct facts โ†’ multi-step inference

๐ŸŽฏ Advanced Few-Shot Learning

Meta-Learning Approach

Train the model to quickly adapt to new tasks with minimal examples. This creates a "learning to learn" capability.

Task 1: Translation (20 examples) Task 2: Summarization (20 examples) Task 3: Q&A (20 examples) โ†’ New Task: Code Review (5 examples) โœ“

In-Context Learning Enhancement

# Enhanced Few-Shot Data Format { "prompt": "Task: Sentiment Analysis\nExample 1: 'Great product!' โ†’ Positive\nExample 2: 'Terrible service' โ†’ Negative\nNow classify: 'Amazing experience'", "completion": " โ†’ Positive" }

๐Ÿ’ก Advanced Techniques

  • Task vectors: Learn task-specific directions in parameter space
  • Prototype learning: Create task prototypes for rapid adaptation
  • Gradient-based meta-learning: Optimize for fast learning
  • Context distillation: Compress task knowledge into adapters

๐Ÿ”— Multi-Adapter Systems

Composable Capabilities

Train separate adapters for different capabilities, then combine them for complex tasks. This modular approach allows mixing and matching skills.

Base Model + Math Adapter + Code Adapter = Programming Tutor with Math Skills โœจ

Implementation

# Train specialized adapters python -m mlx_lm.lora --data ./math_data --adapter-path ./math_adapter python -m mlx_lm.lora --data ./code_data --adapter-path ./code_adapter # Combine at inference (requires custom loading) # math_weight * math_adapter + code_weight * code_adapter

โš–๏ธ Adapter Weighting Strategies

  • Equal weighting: Each adapter contributes equally
  • Task-based weighting: Adjust weights based on task type
  • Learned weighting: Train a routing network to decide weights
  • Dynamic weighting: Adjust weights based on input content

๐ŸŽฏ Use Cases

  • Domain Expert: Medical + Legal knowledge
  • Multilingual Assistant: English + Spanish + French
  • Code Helper: Python + JavaScript + SQL
  • Creative Writer: Fiction + Poetry + Screenplay
  • Research Assistant: Literature + Data + Visualization
  • Teacher: Math + Science + History

๐Ÿ”„ Continual Learning

Avoiding Catastrophic Forgetting

Learn new tasks without losing performance on previous ones. Critical for models that need to continuously evolve and improve.

  • Elastic Weight Consolidation (EWC)
  • Progressive Neural Networks
  • Memory replay systems
  • Adapter-based isolation

MLX Implementation

# Continual Learning with LoRA # Train Task 1 python -m mlx_lm.lora --data ./task1 --adapter-path ./task1_adapter # Train Task 2 (preserve Task 1) python -m mlx_lm.lora --data ./task2 --adapter-path ./task2_adapter \\ --regularization ewc --previous-adapters ./task1_adapter # Evaluation on both tasks python -m mlx_lm.generate --adapter-path ./task1_adapter # Task 1 python -m mlx_lm.generate --adapter-path ./task2_adapter # Task 2

โš™๏ธ Advanced Strategies

  • Adapter routing: Automatically select the right adapter
  • Knowledge distillation: Compress old task knowledge
  • Replay buffers: Maintain examples from previous tasks
  • Orthogonal adapters: Ensure adapters don't interfere

๐ŸŽญ Advanced Instruction Masking

2025 Best Practices

Recent research shows that masking instructions during training leads to better performance. The model learns to focus on generating responses rather than memorizing prompts.

โŒ Train on: "Question: What is AI? Answer: AI is..." โœ… Train on: "[MASK] What is AI? Answer: AI is..."

Masking Strategies

# Different masking approaches { "prompt": "USER: Explain AI\nASSISTANT:", "completion": " AI is artificial intelligence that..." } # Only train on completion (response part) # Prompt tokens are automatically ignored in MLX-LM LoRA

โœ… Benefits of Instruction Masking

  • Better response quality and coherence
  • Reduced instruction repetition
  • Improved generalization to new prompts
  • More efficient learning of response patterns

๐Ÿ”ง Implementation Tips

  • Use consistent masking tokens across your dataset
  • Mask system prompts and user instructions
  • Keep response generation tokens unmasked
  • Test both masked and unmasked to compare performance

๐Ÿ”ง Comprehensive Troubleshooting Guide

๐Ÿšจ When Things Go Wrong

This section covers the most common issues you'll encounter with MLX fine-tuning and provides practical solutions. Use the search function (Cmd/Ctrl+F) to quickly find your specific error.

๐Ÿ’พ Memory Issues

โŒ "RuntimeError: Memory allocation failed"

Your model or batch size is too large for available memory.

Quick Fixes:
  • Reduce batch size to 1
  • Use QLoRA instead of LoRA
  • Lower the rank (try 4 or 8)
  • Use a smaller model variant
# Memory-efficient configuration python -m mlx_lm.lora \\ --model mlx-community/gemma-3-270m-it-4bit \\ --train \\ --batch-size 1 \\ --lora-r 4 \\ --lora-alpha 8

โš ๏ธ "Memory usage keeps increasing"

Potential memory leak during training.

Solutions:
  • Add --save-every 50 to checkpoint regularly
  • Restart training with --resume if memory grows too large
  • Monitor with Activity Monitor during training
  • Consider training in smaller chunks

๐Ÿ“ˆ Training Problems

๐Ÿ“Š "Loss is not decreasing"

Your model isn't learning effectively.

Common Causes:
  • Learning rate too low/high
  • Insufficient rank
  • Bad data quality
  • Too few iterations
Try These:
  • Increase learning rate to 1e-3
  • Double the rank (8โ†’16, 16โ†’32)
  • Increase alpha (rank ร— 2)
  • Train for more iterations

๐ŸŽฏ "Model repeats training examples"

Your model is overfitting to the training data.

Solutions:
  • Reduce rank to prevent overcapacity
  • Lower learning rate (5e-5 instead of 1e-3)
  • Add more diverse training data
  • Use fewer iterations
  • Implement early stopping

๐ŸŒ "Training is very slow"

Performance optimization needed.

Speed Improvements:
  • Use 4-bit quantized models
  • Increase batch size if memory allows
  • Close other applications
  • Ensure proper cooling (avoid thermal throttling)
# Optimized for speed python -m mlx_lm.lora \\ --model mlx-community/gemma-3-270m-it-4bit \\ --train \\ --batch-size 4 \\ --lora-r 8

๐Ÿ“ Data Format Issues

๐Ÿ“‹ "JSONDecodeError: Expecting property name"

Your data file has formatting issues.

โŒ Common Mistakes:
Wrong
# Missing quotes around keys { text: "Hello world", response: "Hi there" }
โœ… Correct Format:
Right
# Proper JSON format { "prompt": "Hello world", "completion": " Hi there" }

๐Ÿ“„ "Extra data after JSON object"

JSONL format requires one JSON object per line.

โŒ Wrong JSONL:
Wrong
[ {"prompt": "Hello", "completion": " Hi"}, {"prompt": "Goodbye", "completion": " Bye"} ]
โœ… Correct JSONL:
Right
{"prompt": "Hello", "completion": " Hi"} {"prompt": "Goodbye", "completion": " Bye"}

๐Ÿค– Model & Compatibility Issues

๐Ÿ” "Model not found or not supported"

The model path is incorrect or model isn't MLX-compatible.

Solutions:
  • Use models from mlx-community namespace
  • Check if model exists on HuggingFace Hub
  • Convert non-MLX models with mlx_lm.convert
  • Verify spelling and exact model name

๐Ÿ”’ "Requires trust_remote_code=True"

Some models need additional permission flag.

# First convert HF model to MLX format python -m mlx_lm.convert \\ --hf-path microsoft/Phi-3-mini-4k-instruct \\ --mlx-path ./phi3-mlx \\ --quantize # Then train on converted model python -m mlx_lm.lora \\ --model ./phi3-mlx \\ --train

๐Ÿ”— "Cannot load adapter"

Adapter compatibility or path issues.

Check These:
  • Adapter trained on same base model
  • Correct file paths
  • Adapter files not corrupted
  • Compatible MLX version
# Verify adapter files ls -la ./adapters/ # Should contain: adapters.npz, adapter_config.json

โšก Performance Debugging

๐Ÿ” Diagnostic Commands

# Check MLX installation python3 -c "import mlx; print(mlx.__version__)" # Monitor memory usage top -pid $(pgrep -f mlx_lm) # Check model size du -sh ./model_directory/

๐ŸŽฏ Performance Tips

  • Use 4-bit models: Faster and more memory efficient
  • Batch size tuning: Find the sweet spot for your hardware
  • Rank optimization: Higher rank โ‰  always better
  • Data preprocessing: Clean and optimize your training data
  • Regular checkpoints: Save progress frequently

๐Ÿšจ Emergency Procedures

๐Ÿ›‘ System Freeze/Crash

  1. Force quit MLX process (Activity Monitor)
  2. Check available disk space
  3. Restart with smaller batch size
  4. Use QLoRA if memory constrained
  5. Check system temperature

๐Ÿ’พ Recovery Procedures

  1. Look for checkpoint files in output directory
  2. Resume training with --resume flag
  3. If corrupted, restart from last good checkpoint
  4. Keep backup copies of important adapters
  5. Document successful configurations

๐Ÿ“ž Getting Help

  • MLX GitHub Issues: Report bugs and get community help
  • HuggingFace Forums: Discuss model-specific issues
  • Reddit r/MachineLearning: General ML troubleshooting
  • Discord Communities: Real-time help from practitioners

๐Ÿ“– Glossary

Essential terminology for understanding fine-tuning and LoRA, explained in simple terms.

LoRA (Low-Rank Adaptation)

A technique that fine-tunes models by learning small "adapter" matrices instead of updating all parameters. Like adding specialized tools to a Swiss Army knife instead of rebuilding the whole thing.

Adapters

Small matrices (typically 1-5MB) that modify a frozen pre-trained model's behavior. Think of them as "plugins" that add specific capabilities without changing the core model.

Rank

The size dimension of the adapter matrices. Higher rank = more capacity but larger files. Like choosing between a small or large toolbox - bigger holds more tools but takes more space.

Alpha

A scaling factor that controls how strongly the adapters influence the model. Higher alpha = more aggressive changes. Like the volume knob on adapter modifications.

Attention Mechanism

The part of transformers that decides what to focus on. Like highlighting important parts of a text while reading. Query, Key, and Value matrices work together to create this focus.

MLP (Multi-Layer Perceptron)

The "thinking" layers in transformers where reasoning and knowledge processing happens. Like the brain cells that process information after attention decides what's important.

Embeddings

How words are converted into numbers that computers can understand. Each word becomes a vector (list of numbers) that captures its meaning and relationships.

Layer Normalization

A technique that keeps the model's internal numbers in a stable range. Like maintaining proper voltage in an electrical circuit - prevents things from getting too extreme.

Transformer

The fundamental architecture used in modern LLMs. Processes text by paying attention to relevant parts and thinking about relationships between words. Powers GPT, BERT, and most AI language models.

Tokenization

Breaking text into pieces (tokens) that models can process. Like cutting a sentence into words, but smarter - can handle parts of words, punctuation, and special characters.

Loss Function

A measure of how wrong the model's predictions are. Training tries to minimize this. Like a score in golf - lower is better, and the goal is to get as close to zero as possible.

Learning Rate

How big steps the model takes when learning. Too high = erratic learning, too low = very slow progress. Like choosing how fast to drive - need the right speed for the road conditions.

Batch Size

How many examples the model looks at before updating its weights. Larger batches = more stable but slower learning. Like studying in groups vs individually.

Gradient Checkpointing

A memory-saving technique that trades computation time for memory usage. Like taking notes to save space in your head, then looking them up when needed.

Quantization

Reducing the precision of model weights to save memory. 4-bit quantization uses less precise numbers but takes 4x less space. Like compressing a photo - some quality loss but much smaller file.

Catastrophic Forgetting

When fine-tuning makes a model forget its original capabilities. Like learning a new language so intensively that you forget your native tongue. LoRA helps prevent this.