Name: Nanochat Llm Training
Author: Aradotso
Install
Terminal · npx
$npx skills add https://github.com/vercel-labs/agent-skills --skill vercel-react-best-practices
Works with Paperclip
How Nanochat Llm Training fits into a Paperclip company.

Nanochat Llm Training drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore pack
Source file
SKILL.md361 linesmarkdown
Expand
1---2name: nanochat-llm-training3description: Train your own GPT-2 level LLM for under $100 using nanochat, Karpathy's minimal hackable harness covering tokenization, pretraining, finetuning, evaluation, inference, and chat UI.4triggers:5  - train my own LLM with nanochat6  - run nanochat pretraining7  - reproduce GPT-2 with nanochat8  - nanochat finetuning and chat9  - set up nanochat on GPU node10  - nanochat speedrun leaderboard11  - configure nanochat depth and hyperparameters12  - talk to my nanochat model in chat UI13---14 15# nanochat LLM Training16 17> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.18 19nanochat is Karpathy's minimal, hackable harness for training LLMs end-to-end on a single GPU node. It covers tokenization, pretraining, SFT finetuning, RL, evaluation (DCLM CORE score), inference with KV cache, and a ChatGPT-like web UI. A single complexity dial (`--depth`) auto-configures all other hyperparameters (width, heads, LR, training horizon, weight decay) for compute-optimal training. You can reproduce GPT-2 capability (~$43,000 in 2019) for ~$48 on an 8×H100 node (~2 hours).20 21## Installation22 23nanochat uses `uv` for dependency management:24 25```bash26git clone https://github.com/karpathy/nanochat.git27cd nanochat28# Install uv if needed29curl -LsSf https://astral.sh/uv/install.sh | sh30# Create venv and install deps31uv sync32source .venv/bin/activate33```34 35## Key Commands36 37### Full GPT-2 Speedrun (8×H100 node, ~2–3 hours, ~$48)38 39```bash40# Run the reference pipeline: data download, pretraining, SFT, eval, chat41bash runs/speedrun.sh42```43 44### Pretraining (distributed)45 46```bash47OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \48    --depth=26 \49    --run="d26_run" \50    --model-tag="d26"51```52 53### Pretraining (single GPU)54 55```bash56python -m scripts.base_train -- \57    --depth=26 \58    --run="d26_single"59```60 61### Quick Research Iteration (~5 min, GPT-1 scale)62 63```bash64OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \65    --depth=12 \66    --run="d12_exp" \67    --model-tag="d12" \68    --core-metric-every=999999 \69    --sample-every=-1 \70    --save-every=-171```72 73### CPU / Apple Silicon (tiny model, ~minutes)74 75```bash76bash runs/runcpu.sh77```78 79### Serve Chat UI80 81```bash82# After training completes83source .venv/bin/activate84python -m scripts.chat_web85# Visit http://<your-server-ip>:8000/86```87 88### CLI Chat89 90```bash91python -m scripts.chat_cli -p "hello"92```93 94### Scaling Laws / Miniseries95 96```bash97bash runs/scaling_laws.sh   # sweep depths for scaling law data98bash runs/miniseries.sh     # train full compute-optimal miniseries99```100 101## The Depth Dial102 103The single most important parameter. Everything else is derived automatically:104 105| `--depth` | Approximate model scale | Notes |106|-----------|------------------------|-------|107| 6–8 | Tiny (toy) | CPU/MPS feasible |108| 12 | GPT-1 size | ~5 min on 8×H100, great for research iteration |109| 16 | Medium | ~15 min on 8×H100 |110| 24–26 | GPT-2 size | ~2 hrs on 8×H100, ~$48 |111 112```bash113# Smaller/faster experiments114python -m scripts.base_train -- --depth=12 --run="quick_test"115 116# Full GPT-2 grade117torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26 --run="gpt2_repro"118```119 120## Precision / dtype Configuration121 122nanochat uses explicit dtype management via `COMPUTE_DTYPE` in `nanochat/common.py`. No `torch.amp.autocast`.123 124| Hardware | Default | Override |125|----------|---------|---------|126| CUDA SM 80+ (A100, H100) | `bfloat16` | `NANOCHAT_DTYPE=float32` |127| CUDA SM < 80 (V100, T4) | `float32` | `NANOCHAT_DTYPE=float16` |128| CPU / MPS | `float32` | — |129 130```bash131# Force fp32 for inference132NANOCHAT_DTYPE=float32 python -m scripts.chat_cli -p "hello"133 134# Force bf16 for training135NANOCHAT_DTYPE=bfloat16 torchrun --nproc_per_node=8 -m scripts.base_train136 137# float16 training (enables GradScaler automatically)138NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train139```140 141**How it works:** Weights stored in fp32 (optimizer precision), custom `Linear` casts to `COMPUTE_DTYPE` in forward pass, embeddings stored directly in `COMPUTE_DTYPE` to save memory.142 143## Key Python Modules144 145```146nanochat/147├── gpt.py              # GPT nn.Module Transformer148├── engine.py           # Inference with KV Cache149├── dataloader.py       # Tokenizing Distributed Data Loader150├── dataset.py          # Download/read utils for pretraining data151├── optim.py            # AdamW + Muon optimizer (1GPU and distributed)152├── core_eval.py        # DCLM CORE score evaluation153├── loss_eval.py        # Bits-per-byte evaluation154├── checkpoint_manager.py  # Save/Load checkpoints155├── common.py           # Utilities, COMPUTE_DTYPE156├── execution.py        # Python code execution tool for LLM157└── engine.py           # Efficient KV-cache inference158 159scripts/160├── base_train.py       # Pretraining entry point161├── chat_web.py         # Web chat UI server162└── chat_cli.py         # CLI chat interface163 164runs/165├── speedrun.sh         # Reference full pipeline (GPT-2 speedrun)166├── scaling_laws.sh     # Scaling law sweeps167├── miniseries.sh       # Full compute-optimal miniseries168└── runcpu.sh           # CPU/MPS example169```170 171## Real Code Examples172 173### Load and Run Inference on a Trained Model174 175```python176import torch177from nanochat.gpt import GPT178from nanochat.engine import InferenceEngine179from nanochat.checkpoint_manager import CheckpointManager180 181# Load checkpoint182ckpt_manager = CheckpointManager("checkpoints/d26")183model, config = ckpt_manager.load()184model.eval()185 186# Run inference with KV cache187engine = InferenceEngine(model)188output = engine.generate(189    prompt="Once upon a time",190    max_new_tokens=200,191    temperature=0.8,192    top_p=0.95,193)194print(output)195```196 197### Custom Training Script with Depth Dial198 199```python200import subprocess201 202def train_model(depth: int, run_name: str, nproc: int = 8):203    """Launch a compute-optimal training run for given depth."""204    cmd = [205        "torchrun",206        "--standalone",207        f"--nproc_per_node={nproc}",208        "-m", "scripts.base_train",209        "--",210        f"--depth={depth}",211        f"--run={run_name}",212        f"--model-tag={run_name}",213    ]214    subprocess.run(cmd, env={"OMP_NUM_THREADS": "1", **__import__("os").environ})215 216# Quick research iteration217train_model(depth=12, run_name="my_experiment_d12")218 219# Full GPT-2 grade220train_model(depth=26, run_name="my_gpt2_repro")221```222 223### Adjust Device Batch Size for Lower VRAM224 225```bash226# Default device_batch_size=32 needs ~80GB VRAM per GPU227# Reduce for smaller GPUs (gradient accumulation handles the rest)228torchrun --standalone --nproc_per_node=4 -m scripts.base_train -- \229    --depth=12 \230    --device_batch_size=16 \231    --run="low_vram_run"232 233# Even smaller234python -m scripts.base_train -- \235    --depth=8 \236    --device_batch_size=4 \237    --run="single_gpu_small"238```239 240### Monitoring Key Metrics in wandb241 242```python243# nanochat logs to wandb automatically. Key metrics to watch:244# - val_bpb: validation loss in bits-per-byte (vocab-size-invariant)245#   as a function of step, total_training_time, total_training_flops246# - core_metric: DCLM CORE score (target > 0.2565 to beat GPT-2)247# - train/mfu: Model FLOPS utilization248# - train/tok_per_sec: Training throughput249 250# Set wandb project via env var before training251import os252os.environ["WANDB_PROJECT"] = "my-nanochat-runs"253```254 255### Synthetic Data for SFT Personality256 257```python258# dev/gen_synthetic_data.py — generate identity/personality data259# Then mix into SFT stage per the guide:260# https://github.com/karpathy/nanochat/discussions/139261 262# Example: generate data and point SFT to it263python dev/gen_synthetic_data.py --output data/identity_sft.jsonl264# Then reference in your SFT script configuration265```266 267## Common Patterns268 269### Research Iteration Loop270 271```bash272# 1. Make a code change in nanochat/273# 2. Run quick d12 to validate274OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \275    --depth=12 --run="test_my_change" \276    --core-metric-every=999999 --sample-every=-1 --save-every=-1277# 3. Check wandb: val_bpb vs step/time/flops278# 4. If promising, test at d16 or d26279```280 281### FP8 Training (H100 only, for speedrun)282 283```bash284# FP8 is used in the speedrun for additional speedup285# See runs/speedrun.sh for the exact invocation286bash runs/speedrun.sh287```288 289### Evaluate CORE Score Only290 291```bash292python -m nanochat.core_eval --checkpoint checkpoints/d26/latest293```294 295### Serve on Lambda / Remote Machine296 297```bash298# On remote machine after training:299source .venv/bin/activate300python -m scripts.chat_web301# Access via: http://<PUBLIC_IP>:8000/302# Use `screen` or `tmux` to keep alive303screen -S nanochat304python -m scripts.chat_web305# Ctrl+A, D to detach306```307 308## Troubleshooting309 310### OOM / Out of VRAM311 312```bash313# Reduce --device_batch_size (default 32)314# Code uses gradient accumulation to maintain effective batch size315--device_batch_size=16   # Try 16, 8, 4, 2, 1316```317 318### Single GPU is 8× Slower319 320This is expected. Omit `torchrun` and use `python -m scripts.base_train` directly. Gradient accumulation kicks in automatically to maintain equivalent total batch size.321 322### Running on Non-CUDA Hardware323 324```bash325# MPS (Apple Silicon) or CPU — use runcpu.sh as template326bash runs/runcpu.sh327# Results will be weak; this is for development/debugging only328```329 330### float16 Gradient Underflow331 332```bash333# nanochat auto-enables GradScaler when NANOCHAT_DTYPE=float16334NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train -- --depth=12335# Note: RL scripts do NOT support float16 (SFT and base_train do)336```337 338### V100 / T4 (SM < 80) — No bf16339 340```bash341# Default falls back to float32; optionally use float16342NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train -- --depth=12343```344 345### Chat UI Not Accessible346 347```bash348# Ensure the port (default 8000) is open in your cloud provider's firewall/security group349# Use the public IP, not localhost:350# http://<PUBLIC_IP>:8000/351```352 353## Resources354 355- **DeepWiki Q&A**: https://deepwiki.com/karpathy/nanochat356- **Discussions**: https://github.com/karpathy/nanochat/discussions357- **Discord**: `#nanochat` channel on Karpathy's Discord358- **Leaderboard docs**: `dev/LEADERBOARD.md`359- **Beating GPT-2 guide**: https://github.com/karpathy/nanochat/discussions/481360- **Miniseries v1**: https://github.com/karpathy/nanochat/discussions/420361- **Adding abilities guide**: https://github.com/karpathy/nanochat/discussions/164
Related skills
Agency Agents Ai Specialists

Install Agency Agents Ai Specialists skill for Claude Code from aradotso/trending-skills.
Agent Browser Automation

Install Agent Browser Automation skill for Claude Code from aradotso/trending-skills.
Antigravity Manager

Install Antigravity Manager skill for Claude Code from aradotso/trending-skills.