How Flash Moe Inference fits into a Paperclip company.

Flash Moe Inference drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore pack
Source file
SKILL.md348 linesmarkdown
Expand
1---2name: flash-moe-inference3description: Run 397B parameter Mixture-of-Experts LLMs on a MacBook using pure C/Metal with SSD streaming4triggers:5  - run a large language model on my laptop6  - stream expert weights from SSD7  - flash moe inference engine8  - run Qwen3.5 397B on Mac9  - mixture of experts on Apple Silicon10  - metal inference engine for large models11  - quantized MoE inference macOS12  - run 209GB model on MacBook13---14 15# Flash-MoE Inference Engine16 17> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.18 19Flash-MoE is a pure C/Objective-C/Metal inference engine that runs **Qwen3.5-397B-A17B** (397B parameter Mixture-of-Experts) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second. It streams 209GB of expert weights from NVMe SSD on demand — no Python, no ML frameworks, just C, Objective-C, and hand-tuned Metal shaders.20 21## Requirements22 23- **Hardware**: Apple Silicon Mac (M3 Max or similar), 48GB+ unified memory, 1TB+ SSD with ~210GB free24- **OS**: macOS 26+ (Darwin 25+)25- **Tools**: Xcode Command Line Tools, Python 3.x (for weight extraction only)26- **Model**: Qwen3.5-397B-A17B safetensors weights (download separately from HuggingFace)27 28## Installation & Build29 30```bash31# Clone the repo32git clone https://github.com/danveloper/flash-moe33cd flash-moe/metal_infer34 35# Build everything36make37 38# Verify build artifacts39ls infer chat main40```41 42The Makefile compiles `infer.m`, `chat.m`, `main.m` with Metal shader compilation for `shaders.metal`.43 44## Weight Preparation45 46### Step 1: Extract non-expert weights47 48```bash49# From the metal_infer/ directory50# Point to your downloaded Qwen3.5-397B safetensors directory51python3 extract_weights.py /path/to/Qwen3.5-397B-A17B-Instruct/52 53# Produces:54#   model_weights.bin   (~5.5GB, mmap'd at runtime)55#   model_weights.json  (tensor manifest)56#   vocab.bin           (vocabulary)57#   tokenizer.bin       (BPE tokenizer data)58```59 60### Step 2: Pack expert weights (4-bit, production)61 62```bash63# From repo root64python3 repack_experts.py /path/to/Qwen3.5-397B-A17B-Instruct/ metal_infer/packed_experts/65 66# Produces packed_experts/ directory (~209GB)67# Each expert is a separate file: layer_XX_expert_YYYY.bin68```69 70### Step 3: Optional 2-bit requantization (faster but breaks JSON/tool calling)71 72```bash73# Convert 4-bit experts to 2-bit (saves ~89GB, 120GB total)74python3 metal_infer/repack_experts_2bit.py \75    metal_infer/packed_experts/ \76    metal_infer/packed_experts_2bit/77```78 79## Key Commands80 81### Basic inference82 83```bash84cd metal_infer85 86# 4-bit inference (production quality, tool calling works)87./infer --prompt "Explain quantum computing" --tokens 10088 89# 2-bit inference (faster, breaks JSON/tool calling)90./infer --prompt "Explain quantum computing" --tokens 100 --2bit91 92# Per-layer timing breakdown93./infer --prompt "Hello" --tokens 20 --timing94```95 96### Interactive chat with tool calling97 98```bash99./chat100# Opens TUI with full tool calling support101# Uses 4-bit experts by default102```103 104### MoE-only benchmark (measures expert throughput)105 106```bash107./main108# Runs pure expert forward-pass benchmark109# Reports tokens/sec without attention overhead110```111 112## Project Structure113 114```115flash-moe/116├── paper/117│   └── flash_moe.pdf          # Full technical paper118├── metal_infer/119│   ├── infer.m                # Complete inference engine (~7000 lines)120│   ├── shaders.metal          # Metal compute kernels (~1200 lines)121│   ├── chat.m                 # Interactive chat TUI122│   ├── tokenizer.h            # Single-header C BPE tokenizer (449 lines)123│   ├── main.m                 # MoE-only benchmark124│   ├── Makefile125│   ├── extract_weights.py     # Safetensors → model_weights.bin126│   ├── repack_experts_2bit.py # 4-bit → 2-bit requantization127│   ├── train_predictor.py     # Expert routing prediction analysis128│   ├── model_weights.bin      # Non-expert weights (mmap'd)129│   ├── model_weights.json     # Tensor manifest130│   ├── vocab.bin131│   ├── tokenizer.bin132│   ├── packed_experts/        # 4-bit expert files (209GB)133│   └── packed_experts_2bit/   # 2-bit expert files (120GB, optional)134├── repack_experts.py          # 4-bit expert packing from safetensors135├── progress.py                # Results visualization136└── results.tsv                # Experiment log137```138 139## Architecture Overview140 141The model has **60 transformer layers**:142- 45 GatedDeltaNet (linear attention) layers143- 15 standard full attention layers144- Each layer: 512 experts, K=4 activated per token + 1 shared expert145- Hidden dimension: 4096146 147### Per-layer pipeline (4.28ms average at 4-bit)148 149```150CMD3(prev) → CMD1: attention projections + delta-net  [1.22ms GPU]151           → CPU: flush results                       [0.01ms CPU]  152           → CMD2: o_proj + norm + routing + shared    [0.55ms GPU]153           → CPU: softmax + topK routing               [0.003ms]154           → I/O: parallel pread K=4 experts           [2.41ms SSD]155           → CMD3: expert forward + combine + norm     [0.04ms encode, DEFERRED]156```157 158## Metal Shader Kernels159 160The `shaders.metal` file contains hand-written kernels. Key kernels:161 162```metal163// 4-bit dequantized matrix-vector multiply (FMA-optimized)164// Key insight: fma(nibble, scale*x, bias*x) instead of (nibble*scale + bias)*x165// Pre-compute scale*x and bias*x to fuse dequant+multiply in one FMA instruction166 167kernel void matvec_4bit_fma(168    device const uint8_t* weights [[buffer(0)]],169    device const float* scales    [[buffer(1)]],170    device const float* biases    [[buffer(2)]],171    device const float* x         [[buffer(3)]],172    device float* out             [[buffer(4)]],173    uint tid [[thread_position_in_threadgroup]],174    uint gid [[threadgroup_position_in_grid]])175{176    // ... tiled SIMD-reduced FMA kernel177    // 12% faster than naive (nibble * scale + bias) * x178}179 180// Fused SwiGLU activation181kernel void swiglu(device float* gate [[buffer(0)]],182                   device const float* up [[buffer(1)]],183                   uint gid [[thread_position_in_grid]])184{185    float g = gate[gid];186    gate[gid] = (g / (1.0f + exp(-g))) * up[gid];187}188 189// RMS normalization (two-pass)190kernel void rms_norm_pass1(...) // sum of squares reduction191kernel void rms_norm_pass2(...) // apply normalization192 193// GPU RoPE (fused with Q deinterleave and K normalization)194kernel void rope_qk(...)195 196// MoE combine + residual + sigmoid gate (fused)197kernel void moe_combine_residual(...)198```199 200## SSD Expert Streaming Pattern201 202The core innovation — loading only K=4 active experts per layer from SSD:203 204```objc205// Parallel expert loading using GCD dispatch groups206// From infer.m (conceptual pattern)207 208dispatch_group_t group = dispatch_group_create();209dispatch_queue_t ioQueue = dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0);210 211for (int k = 0; k < K_EXPERTS; k++) {212    int expert_id = top_k_indices[k];213    dispatch_group_async(group, ioQueue, ^{214        // Each expert: ~6.75MB at 4-bit215        char path[256];216        snprintf(path, sizeof(path), 217                 "packed_experts/layer_%02d_expert_%04d.bin",218                 layer, expert_id);219        220        int fd = open(path, O_RDONLY);221        // pread() — non-blocking, OS page cache handles LRU222        pread(fd, expert_buffer[k], expert_size, 0);223        close(fd);224    });225}226 227dispatch_group_wait(group, DISPATCH_TIME_FOREVER);228// GPU compute follows — serial pipeline is hardware-optimal on Apple Silicon229```230 231**Why `pread()` not `mmap()`**: mmap incurs per-page fault overhead on cold data (~5x slower). Direct `pread()` with OS page cache achieves ~71% hit rate naturally.232 233## GatedDeltaNet Linear Attention (BLAS)234 235The recurrence update uses Accelerate BLAS — 64% faster than scalar:236 237```objc238// GatedDeltaNet state update per head (conceptual pattern)239// state: 128×128 float matrix, 64 heads240// From infer.m241 242#import <Accelerate/Accelerate.h>243 244for (int h = 0; h < 64; h++) {245    float* S = state + h * 128 * 128;  // 128×128 state matrix246    float* q = Q + h * 128;247    float* k = K + h * 128;248    float* v = V + h * 128;249    250    // β·(k⊗v) outer product update251    // cblas_sger: S += beta * (k ⊗ v)252    cblas_sger(CblasRowMajor, 128, 128,253               beta[h], k, 1, v, 1, S, 128);254    255    // Decay: S = alpha * S256    cblas_sscal(128 * 128, alpha[h], S, 1);257    258    // Output: o = S @ q259    cblas_sgemv(CblasRowMajor, CblasNoTrans,260                128, 128, 1.0f, S, 128, q, 1, 0.0f,261                output + h * 128, 1);262}263```264 265## Performance Configuration266 267### 4-bit (production default)268- **Quality**: Excellent — full tool calling, correct JSON269- **Speed**: 4.36 tok/s270- **Disk**: 209GB271 272### 2-bit (speed testing only)273- **Quality**: Good — but breaks JSON/tool calling (`\name\` instead of `"name"`)274- **Speed**: 5.74 tok/s (7.05 peak single token with warm cache)275- **Disk**: 120GB276- Uses `F_NOCACHE` flag to avoid page cache thrashing277 278## What NOT to Try (Learned from 58 Experiments)279 280| Approach | Why it fails |281|----------|-------------|282| `mmap()` expert files | Per-page fault overhead: 5x slower than `pread()` |283| `dispatch_io` | `dispatch_data` management overhead: -70% |284| `F_RDADVISE` prefetch | SSD DMA + GPU share memory controller — concurrent access: -73% GPU speed |285| Custom Metal LRU cache | GPU memory pressure: -38% vs OS page cache |286| LZ4 expert compression | Decompress overhead > warm cache savings: -13% |287| Temporal expert prediction | 25% hit rate, wastes SSD bandwidth: -18% |288| Speculative early routing | Cache pollution: -38% |289| MTP speculative decoding | MoE I/O scales per-token (unlike dense models): break-even |290| Spin-poll GPU wait | CPU thermal throttle competes with GPU: -23% |291| Parallel SSD + GPU overlap | Unified memory controller arbitration: net negative |292 293**Key principle**: On Apple Silicon, GPU DMA and SSD DMA share the same memory controller. The serial pipeline (GPU → SSD → GPU) is hardware-optimal.294 295## Troubleshooting296 297### Build fails298```bash299# Ensure Xcode CLI tools are installed300xcode-select --install301 302# Check Metal compiler is available303xcrun -sdk macosx metal --version304```305 306### Out of memory307The engine is designed to use ~6GB active:308- 5.5GB: `model_weights.bin` (mmap'd, read-only)309- ~200MB: Metal scratch buffers310- Remaining ~42GB: OS page cache for expert data311 312If you see OOM, check for other processes consuming unified memory:313```bash314sudo memory_pressure315vm_stat316```317 318### Slow performance319```bash320# Check SSD speed — needs ~17GB/s for target performance321# Run with timing to identify bottleneck322./infer --prompt "Hello" --tokens 5 --timing323 324# Verify packed_experts/ is on internal SSD, not external drive325diskutil info /326```327 328### Wrong expert directory329```bash330# Default paths expected by infer.m:331# metal_infer/packed_experts/     (4-bit)332# metal_infer/packed_experts_2bit/ (2-bit)333 334# Ensure you're running from metal_infer/ directory335cd metal_infer336./infer --prompt "test"337```338 339### Tool calling broken340Use 4-bit, not 2-bit. The 2-bit quantization corrupts quote characters in JSON output, making tool calling unreliable. Always use the default 4-bit configuration for agentic workloads.341 342## Memory Safety343 344The engine explicitly manages all allocations:345- No unbounded caches346- Expert data never accumulates in GPU memory347- `model_weights.bin` is mmap'd read-only — kernel manages pages348- Expert files are opened/read/closed per inference step
Related skills
Agency Agents Ai Specialists

Install Agency Agents Ai Specialists skill for Claude Code from aradotso/trending-skills.
Agent Browser Automation

Install Agent Browser Automation skill for Claude Code from aradotso/trending-skills.
Antigravity Manager

Install Antigravity Manager skill for Claude Code from aradotso/trending-skills.