Claude Agent Skill · by Aradotso

Flash Moe Inference

Install Flash Moe Inference skill for Claude Code from aradotso/trending-skills.

Works with Paperclip

How Flash Moe Inference fits into a Paperclip company.

Flash Moe Inference drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

S
SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59
Explore pack
Source file
SKILL.md348 lines
Expand
---name: flash-moe-inferencedescription: Run 397B parameter Mixture-of-Experts LLMs on a MacBook using pure C/Metal with SSD streamingtriggers:  - run a large language model on my laptop  - stream expert weights from SSD  - flash moe inference engine  - run Qwen3.5 397B on Mac  - mixture of experts on Apple Silicon  - metal inference engine for large models  - quantized MoE inference macOS  - run 209GB model on MacBook--- # Flash-MoE Inference Engine > Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection. Flash-MoE is a pure C/Objective-C/Metal inference engine that runs **Qwen3.5-397B-A17B** (397B parameter Mixture-of-Experts) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second. It streams 209GB of expert weights from NVMe SSD on demand — no Python, no ML frameworks, just C, Objective-C, and hand-tuned Metal shaders. ## Requirements - **Hardware**: Apple Silicon Mac (M3 Max or similar), 48GB+ unified memory, 1TB+ SSD with ~210GB free- **OS**: macOS 26+ (Darwin 25+)- **Tools**: Xcode Command Line Tools, Python 3.x (for weight extraction only)- **Model**: Qwen3.5-397B-A17B safetensors weights (download separately from HuggingFace) ## Installation & Build ```bash# Clone the repogit clone https://github.com/danveloper/flash-moecd flash-moe/metal_infer # Build everythingmake # Verify build artifactsls infer chat main``` The Makefile compiles `infer.m`, `chat.m`, `main.m` with Metal shader compilation for `shaders.metal`. ## Weight Preparation ### Step 1: Extract non-expert weights ```bash# From the metal_infer/ directory# Point to your downloaded Qwen3.5-397B safetensors directorypython3 extract_weights.py /path/to/Qwen3.5-397B-A17B-Instruct/ # Produces:#   model_weights.bin   (~5.5GB, mmap'd at runtime)#   model_weights.json  (tensor manifest)#   vocab.bin           (vocabulary)#   tokenizer.bin       (BPE tokenizer data)``` ### Step 2: Pack expert weights (4-bit, production) ```bash# From repo rootpython3 repack_experts.py /path/to/Qwen3.5-397B-A17B-Instruct/ metal_infer/packed_experts/ # Produces packed_experts/ directory (~209GB)# Each expert is a separate file: layer_XX_expert_YYYY.bin``` ### Step 3: Optional 2-bit requantization (faster but breaks JSON/tool calling) ```bash# Convert 4-bit experts to 2-bit (saves ~89GB, 120GB total)python3 metal_infer/repack_experts_2bit.py \    metal_infer/packed_experts/ \    metal_infer/packed_experts_2bit/``` ## Key Commands ### Basic inference ```bashcd metal_infer # 4-bit inference (production quality, tool calling works)./infer --prompt "Explain quantum computing" --tokens 100 # 2-bit inference (faster, breaks JSON/tool calling)./infer --prompt "Explain quantum computing" --tokens 100 --2bit # Per-layer timing breakdown./infer --prompt "Hello" --tokens 20 --timing``` ### Interactive chat with tool calling ```bash./chat# Opens TUI with full tool calling support# Uses 4-bit experts by default``` ### MoE-only benchmark (measures expert throughput) ```bash./main# Runs pure expert forward-pass benchmark# Reports tokens/sec without attention overhead``` ## Project Structure ```flash-moe/├── paper/│   └── flash_moe.pdf          # Full technical paper├── metal_infer/│   ├── infer.m                # Complete inference engine (~7000 lines)│   ├── shaders.metal          # Metal compute kernels (~1200 lines)│   ├── chat.m                 # Interactive chat TUI│   ├── tokenizer.h            # Single-header C BPE tokenizer (449 lines)│   ├── main.m                 # MoE-only benchmark│   ├── Makefile│   ├── extract_weights.py     # Safetensors → model_weights.bin│   ├── repack_experts_2bit.py # 4-bit → 2-bit requantization│   ├── train_predictor.py     # Expert routing prediction analysis│   ├── model_weights.bin      # Non-expert weights (mmap'd)│   ├── model_weights.json     # Tensor manifest│   ├── vocab.bin│   ├── tokenizer.bin│   ├── packed_experts/        # 4-bit expert files (209GB)│   └── packed_experts_2bit/   # 2-bit expert files (120GB, optional)├── repack_experts.py          # 4-bit expert packing from safetensors├── progress.py                # Results visualization└── results.tsv                # Experiment log``` ## Architecture Overview The model has **60 transformer layers**:- 45 GatedDeltaNet (linear attention) layers- 15 standard full attention layers- Each layer: 512 experts, K=4 activated per token + 1 shared expert- Hidden dimension: 4096 ### Per-layer pipeline (4.28ms average at 4-bit) ```CMD3(prev) → CMD1: attention projections + delta-net  [1.22ms GPU]           → CPU: flush results                       [0.01ms CPU]             → CMD2: o_proj + norm + routing + shared    [0.55ms GPU]           → CPU: softmax + topK routing               [0.003ms]           → I/O: parallel pread K=4 experts           [2.41ms SSD]           → CMD3: expert forward + combine + norm     [0.04ms encode, DEFERRED]``` ## Metal Shader Kernels The `shaders.metal` file contains hand-written kernels. Key kernels: ```metal// 4-bit dequantized matrix-vector multiply (FMA-optimized)// Key insight: fma(nibble, scale*x, bias*x) instead of (nibble*scale + bias)*x// Pre-compute scale*x and bias*x to fuse dequant+multiply in one FMA instruction kernel void matvec_4bit_fma(    device const uint8_t* weights [[buffer(0)]],    device const float* scales    [[buffer(1)]],    device const float* biases    [[buffer(2)]],    device const float* x         [[buffer(3)]],    device float* out             [[buffer(4)]],    uint tid [[thread_position_in_threadgroup]],    uint gid [[threadgroup_position_in_grid]]){    // ... tiled SIMD-reduced FMA kernel    // 12% faster than naive (nibble * scale + bias) * x} // Fused SwiGLU activationkernel void swiglu(device float* gate [[buffer(0)]],                   device const float* up [[buffer(1)]],                   uint gid [[thread_position_in_grid]]){    float g = gate[gid];    gate[gid] = (g / (1.0f + exp(-g))) * up[gid];} // RMS normalization (two-pass)kernel void rms_norm_pass1(...) // sum of squares reductionkernel void rms_norm_pass2(...) // apply normalization // GPU RoPE (fused with Q deinterleave and K normalization)kernel void rope_qk(...) // MoE combine + residual + sigmoid gate (fused)kernel void moe_combine_residual(...)``` ## SSD Expert Streaming Pattern The core innovation — loading only K=4 active experts per layer from SSD: ```objc// Parallel expert loading using GCD dispatch groups// From infer.m (conceptual pattern) dispatch_group_t group = dispatch_group_create();dispatch_queue_t ioQueue = dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0); for (int k = 0; k < K_EXPERTS; k++) {    int expert_id = top_k_indices[k];    dispatch_group_async(group, ioQueue, ^{        // Each expert: ~6.75MB at 4-bit        char path[256];        snprintf(path, sizeof(path),                  "packed_experts/layer_%02d_expert_%04d.bin",                 layer, expert_id);                int fd = open(path, O_RDONLY);        // pread() — non-blocking, OS page cache handles LRU        pread(fd, expert_buffer[k], expert_size, 0);        close(fd);    });} dispatch_group_wait(group, DISPATCH_TIME_FOREVER);// GPU compute follows — serial pipeline is hardware-optimal on Apple Silicon``` **Why `pread()` not `mmap()`**: mmap incurs per-page fault overhead on cold data (~5x slower). Direct `pread()` with OS page cache achieves ~71% hit rate naturally. ## GatedDeltaNet Linear Attention (BLAS) The recurrence update uses Accelerate BLAS — 64% faster than scalar: ```objc// GatedDeltaNet state update per head (conceptual pattern)// state: 128×128 float matrix, 64 heads// From infer.m #import <Accelerate/Accelerate.h> for (int h = 0; h < 64; h++) {    float* S = state + h * 128 * 128;  // 128×128 state matrix    float* q = Q + h * 128;    float* k = K + h * 128;    float* v = V + h * 128;        // β·(k⊗v) outer product update    // cblas_sger: S += beta * (k ⊗ v)    cblas_sger(CblasRowMajor, 128, 128,               beta[h], k, 1, v, 1, S, 128);        // Decay: S = alpha * S    cblas_sscal(128 * 128, alpha[h], S, 1);        // Output: o = S @ q    cblas_sgemv(CblasRowMajor, CblasNoTrans,                128, 128, 1.0f, S, 128, q, 1, 0.0f,                output + h * 128, 1);}``` ## Performance Configuration ### 4-bit (production default)- **Quality**: Excellent — full tool calling, correct JSON- **Speed**: 4.36 tok/s- **Disk**: 209GB ### 2-bit (speed testing only)- **Quality**: Good — but breaks JSON/tool calling (`\name\` instead of `"name"`)- **Speed**: 5.74 tok/s (7.05 peak single token with warm cache)- **Disk**: 120GB- Uses `F_NOCACHE` flag to avoid page cache thrashing ## What NOT to Try (Learned from 58 Experiments) | Approach | Why it fails ||----------|-------------|| `mmap()` expert files | Per-page fault overhead: 5x slower than `pread()` || `dispatch_io` | `dispatch_data` management overhead: -70% || `F_RDADVISE` prefetch | SSD DMA + GPU share memory controller — concurrent access: -73% GPU speed || Custom Metal LRU cache | GPU memory pressure: -38% vs OS page cache || LZ4 expert compression | Decompress overhead > warm cache savings: -13% || Temporal expert prediction | 25% hit rate, wastes SSD bandwidth: -18% || Speculative early routing | Cache pollution: -38% || MTP speculative decoding | MoE I/O scales per-token (unlike dense models): break-even || Spin-poll GPU wait | CPU thermal throttle competes with GPU: -23% || Parallel SSD + GPU overlap | Unified memory controller arbitration: net negative | **Key principle**: On Apple Silicon, GPU DMA and SSD DMA share the same memory controller. The serial pipeline (GPU → SSD → GPU) is hardware-optimal. ## Troubleshooting ### Build fails```bash# Ensure Xcode CLI tools are installedxcode-select --install # Check Metal compiler is availablexcrun -sdk macosx metal --version``` ### Out of memoryThe engine is designed to use ~6GB active:- 5.5GB: `model_weights.bin` (mmap'd, read-only)- ~200MB: Metal scratch buffers- Remaining ~42GB: OS page cache for expert data If you see OOM, check for other processes consuming unified memory:```bashsudo memory_pressurevm_stat``` ### Slow performance```bash# Check SSD speed — needs ~17GB/s for target performance# Run with timing to identify bottleneck./infer --prompt "Hello" --tokens 5 --timing # Verify packed_experts/ is on internal SSD, not external drivediskutil info /``` ### Wrong expert directory```bash# Default paths expected by infer.m:# metal_infer/packed_experts/     (4-bit)# metal_infer/packed_experts_2bit/ (2-bit) # Ensure you're running from metal_infer/ directorycd metal_infer./infer --prompt "test"``` ### Tool calling brokenUse 4-bit, not 2-bit. The 2-bit quantization corrupts quote characters in JSON output, making tool calling unreliable. Always use the default 4-bit configuration for agentic workloads. ## Memory Safety The engine explicitly manages all allocations:- No unbounded caches- Expert data never accumulates in GPU memory- `model_weights.bin` is mmap'd read-only — kernel manages pages- Expert files are opened/read/closed per inference step