Claude Agent Skill · by Aradotso

Voicebox Voice Synthesis

Install Voicebox Voice Synthesis skill for Claude Code from aradotso/trending-skills.

Works with Paperclip

How Voicebox Voice Synthesis fits into a Paperclip company.

Voicebox Voice Synthesis drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

S
SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59
Explore pack
Source file
SKILL.md634 lines
Expand
---name: voicebox-voice-synthesisdescription: Expert skill for Voicebox — the open-source local voice cloning and TTS studio built with Tauri, React, and FastAPItriggers:  - "clone a voice with voicebox"  - "generate speech locally with voicebox"  - "set up voicebox voice synthesis"  - "use voicebox API to synthesize speech"  - "add TTS to my app with voicebox"  - "configure voicebox TTS engine"  - "apply voice effects in voicebox"  - "voicebox stories editor multi-voice"--- # Voicebox Voice Synthesis Studio > Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection. Voicebox is a local-first, open-source voice cloning and TTS studio — a self-hosted alternative to ElevenLabs. It runs entirely on your machine (macOS MLX/Metal, Windows/Linux CUDA, CPU fallback), exposes a REST API on `localhost:17493`, and ships with 5 TTS engines, 23 languages, post-processing effects, and a multi-track Stories editor. --- ## Installation ### Pre-built Binaries (Recommended) | Platform | Link ||---|---|| macOS Apple Silicon | https://voicebox.sh/download/mac-arm || macOS Intel | https://voicebox.sh/download/mac-intel || Windows | https://voicebox.sh/download/windows || Docker | `docker compose up` | Linux requires building from source: https://voicebox.sh/linux-install ### Build from Source **Prerequisites:** [Bun](https://bun.sh), [Rust](https://rustup.rs), [Python 3.11+](https://python.org), Tauri prerequisites ```bashgit clone https://github.com/jamiepine/voicebox.gitcd voicebox # Install just task runnerbrew install just        # macOScargo install just       # any platform # Set up Python venv + all dependenciesjust setup # Start backend + desktop app in dev modejust dev``` ```bash# List all available commandsjust --list``` --- ## Architecture | Layer | Technology ||---|---|| Desktop App | Tauri (Rust) || Frontend | React + TypeScript + Tailwind CSS || State | Zustand + React Query || Backend | FastAPI (Python) on port 17493 || TTS Engines | Qwen3-TTS, LuxTTS, Chatterbox, Chatterbox Turbo, TADA || Effects | Pedalboard (Spotify) || Transcription | Whisper / Whisper Turbo || Inference | MLX (Apple Silicon) / PyTorch (CUDA/ROCm/XPU/CPU) || Database | SQLite | The Python FastAPI backend handles all ML inference. The Tauri Rust shell wraps the frontend and manages the backend process lifecycle. The API is accessible directly at `http://localhost:17493` even when using the desktop app. --- ## REST API Reference Base URL: `http://localhost:17493`  Interactive docs: `http://localhost:17493/docs` ### Generate Speech ```bash# Basic generationcurl -X POST http://localhost:17493/generate \  -H "Content-Type: application/json" \  -d '{    "text": "Hello world, this is a voice clone.",    "profile_id": "abc123",    "language": "en"  }' # With engine selectioncurl -X POST http://localhost:17493/generate \  -H "Content-Type: application/json" \  -d '{    "text": "Speak slowly and with gravitas.",    "profile_id": "abc123",    "language": "en",    "engine": "qwen3-tts"  }' # With paralinguistic tags (Chatterbox Turbo only)curl -X POST http://localhost:17493/generate \  -H "Content-Type: application/json" \  -d '{    "text": "That is absolutely hilarious! [laugh] I cannot believe it.",    "profile_id": "abc123",    "engine": "chatterbox-turbo",    "language": "en"  }'``` ### Voice Profiles ```bash# List all profilescurl http://localhost:17493/profiles # Create a new profilecurl -X POST http://localhost:17493/profiles \  -H "Content-Type: application/json" \  -d '{    "name": "Narrator",    "language": "en",    "description": "Deep narrative voice"  }' # Upload audio sample to a profilecurl -X POST http://localhost:17493/profiles/{profile_id}/samples \  -F "file=@/path/to/voice-sample.wav" # Export a profilecurl http://localhost:17493/profiles/{profile_id}/export \  --output narrator-profile.zip # Import a profilecurl -X POST http://localhost:17493/profiles/import \  -F "file=@narrator-profile.zip"``` ### Generation Queue & Status ```bash# Get generation status (SSE stream)curl -N http://localhost:17493/generate/{generation_id}/status # List recent generationscurl http://localhost:17493/generations # Retry a failed generationcurl -X POST http://localhost:17493/generations/{generation_id}/retry # Download generated audiocurl http://localhost:17493/generations/{generation_id}/audio \  --output output.wav``` ### Models ```bash# List available models and download statuscurl http://localhost:17493/models # Unload a model from GPU memory (without deleting)curl -X POST http://localhost:17493/models/{model_id}/unload``` --- ## TypeScript/JavaScript Integration ### Basic TTS Client ```typescriptconst VOICEBOX_URL = process.env.VOICEBOX_API_URL ?? "http://localhost:17493"; interface GenerateRequest {  text: string;  profile_id: string;  language?: string;  engine?: "qwen3-tts" | "luxtts" | "chatterbox" | "chatterbox-turbo" | "tada";} interface GenerateResponse {  generation_id: string;  status: "queued" | "processing" | "complete" | "failed";  audio_url?: string;} async function generateSpeech(req: GenerateRequest): Promise<GenerateResponse> {  const response = await fetch(`${VOICEBOX_URL}/generate`, {    method: "POST",    headers: { "Content-Type": "application/json" },    body: JSON.stringify(req),  });   if (!response.ok) {    throw new Error(`Voicebox API error: ${response.status} ${await response.text()}`);  }   return response.json();} // Usageconst result = await generateSpeech({  text: "Welcome to our application.",  profile_id: "abc123",  language: "en",  engine: "qwen3-tts",}); console.log("Generation ID:", result.generation_id);``` ### Poll for Completion ```typescriptasync function waitForGeneration(  generationId: string,  timeoutMs = 60_000): Promise<string> {  const start = Date.now();   while (Date.now() - start < timeoutMs) {    const res = await fetch(`${VOICEBOX_URL}/generations/${generationId}`);    const data = await res.json();     if (data.status === "complete") {      return `${VOICEBOX_URL}/generations/${generationId}/audio`;    }    if (data.status === "failed") {      throw new Error(`Generation failed: ${data.error}`);    }     await new Promise((r) => setTimeout(r, 1000));  }   throw new Error("Generation timed out");}``` ### Stream Status with SSE ```typescriptfunction streamGenerationStatus(  generationId: string,  onStatus: (status: string) => void): () => void {  const eventSource = new EventSource(    `${VOICEBOX_URL}/generate/${generationId}/status`  );   eventSource.onmessage = (event) => {    const data = JSON.parse(event.data);    onStatus(data.status);     if (data.status === "complete" || data.status === "failed") {      eventSource.close();    }  };   eventSource.onerror = () => eventSource.close();   // Return cleanup function  return () => eventSource.close();} // Usageconst cleanup = streamGenerationStatus("gen_abc123", (status) => {  console.log("Status update:", status);});``` ### Download Audio as Blob ```typescriptasync function downloadAudio(generationId: string): Promise<Blob> {  const response = await fetch(    `${VOICEBOX_URL}/generations/${generationId}/audio`  );   if (!response.ok) {    throw new Error(`Failed to download audio: ${response.status}`);  }   return response.blob();} // Play in browserasync function playGeneratedAudio(generationId: string): Promise<void> {  const blob = await downloadAudio(generationId);  const url = URL.createObjectURL(blob);  const audio = new Audio(url);  audio.play();  audio.onended = () => URL.revokeObjectURL(url);}``` --- ## Python Integration ```pythonimport httpximport asyncio VOICEBOX_URL = "http://localhost:17493" async def generate_speech(    text: str,    profile_id: str,    language: str = "en",    engine: str = "qwen3-tts") -> bytes:    async with httpx.AsyncClient(timeout=120.0) as client:        # Submit generation        resp = await client.post(            f"{VOICEBOX_URL}/generate",            json={                "text": text,                "profile_id": profile_id,                "language": language,                "engine": engine,            }        )        resp.raise_for_status()        generation_id = resp.json()["generation_id"]         # Poll until complete        for _ in range(120):            status_resp = await client.get(                f"{VOICEBOX_URL}/generations/{generation_id}"            )            status_data = status_resp.json()             if status_data["status"] == "complete":                audio_resp = await client.get(                    f"{VOICEBOX_URL}/generations/{generation_id}/audio"                )                return audio_resp.content             if status_data["status"] == "failed":                raise RuntimeError(f"Generation failed: {status_data.get('error')}")             await asyncio.sleep(1.0)         raise TimeoutError("Generation timed out after 120s")  # Usageaudio_bytes = asyncio.run(    generate_speech(        text="The quick brown fox jumps over the lazy dog.",        profile_id="your-profile-id",        language="en",        engine="chatterbox",    )) with open("output.wav", "wb") as f:    f.write(audio_bytes)``` --- ## TTS Engine Selection Guide | Engine | Best For | Languages | VRAM | Notes ||---|---|---|---|---|| `qwen3-tts` (0.6B/1.7B) | Quality + instructions | 10 | Medium | Supports delivery instructions in text || `luxtts` | Fast CPU generation | English only | ~1GB | 150x realtime on CPU, 48kHz || `chatterbox` | Multilingual coverage | 23 | Medium | Arabic, Hindi, Swahili, CJK + more || `chatterbox-turbo` | Expressive/emotion | English only | Low (350M) | Use `[laugh]`, `[sigh]`, `[gasp]` tags || `tada` (1B/3B) | Long-form coherence | 10 | High | 700s+ audio, HumeAI model | ### Delivery Instructions (Qwen3-TTS) Embed natural language instructions directly in the text: ```typescriptawait generateSpeech({  text: "(whisper) I have a secret to tell you.",  profile_id: "abc123",  engine: "qwen3-tts",}); await generateSpeech({  text: "(speak slowly and clearly) Step one: open the application.",  profile_id: "abc123",  engine: "qwen3-tts",});``` ### Paralinguistic Tags (Chatterbox Turbo) ```typescriptconst tags = [  "[laugh]", "[chuckle]", "[gasp]", "[cough]",  "[sigh]", "[groan]", "[sniff]", "[shush]", "[clear throat]"]; await generateSpeech({  text: "Oh really? [gasp] I had no idea! [laugh] That's incredible.",  profile_id: "abc123",  engine: "chatterbox-turbo",});``` --- ## Environment & Configuration ```bash# Custom models directory (set before launching)export VOICEBOX_MODELS_DIR=/path/to/models # For AMD ROCm GPU (auto-configured, but can override)export HSA_OVERRIDE_GFX_VERSION=11.0.0``` Docker configuration (`docker-compose.yml` override): ```yamlservices:  voicebox:    environment:      - VOICEBOX_MODELS_DIR=/models    volumes:      - /host/models:/models    ports:      - "17493:17493"    # For NVIDIA GPU passthrough:    deploy:      resources:        reservations:          devices:            - driver: nvidia              count: 1              capabilities: [gpu]``` --- ## Common Patterns ### Voice Profile Creation Flow ```typescript// 1. Create profileconst profile = await fetch(`${VOICEBOX_URL}/profiles`, {  method: "POST",  headers: { "Content-Type": "application/json" },  body: JSON.stringify({ name: "My Voice", language: "en" }),}).then((r) => r.json()); // 2. Upload audio sample (WAV/MP3, ideally 5–30 seconds clean speech)const formData = new FormData();formData.append("file", audioBlob, "sample.wav"); await fetch(`${VOICEBOX_URL}/profiles/${profile.id}/samples`, {  method: "POST",  body: formData,}); // 3. Generate with the new profileconst gen = await generateSpeech({  text: "Testing my cloned voice.",  profile_id: profile.id,});``` ### Batch Generation with Queue ```typescriptasync function batchGenerate(  items: Array<{ text: string; profileId: string }>,  engine = "qwen3-tts"): Promise<string[]> {  // Submit all — Voicebox queues them serially to avoid GPU contention  const submissions = await Promise.all(    items.map((item) =>      generateSpeech({ text: item.text, profile_id: item.profileId, engine })    )  );   // Wait for all completions  const audioUrls = await Promise.all(    submissions.map((s) => waitForGeneration(s.generation_id))  );   return audioUrls;}``` ### Long-Form Text (Auto-Chunking) Voicebox auto-chunks at sentence boundaries — just send the full text: ```typescriptconst longScript = `  Chapter one. The morning fog rolled across the valley floor...  // Up to 50,000 characters supported`; await generateSpeech({  text: longScript,  profile_id: "narrator-profile-id",  engine: "tada", // Best for long-form coherence  language: "en",});``` --- ## Troubleshooting ### API not responding ```bash# Check if backend is runningcurl http://localhost:17493/health # Restart backend only (dev mode)just backend # Check logsjust logs``` ### GPU not detected ```bash# Check detected backendcurl http://localhost:17493/system/info # Force CPU mode (set before launch)export VOICEBOX_FORCE_CPU=1``` ### Model download fails / slow ```bash# Set custom models directory with more spaceexport VOICEBOX_MODELS_DIR=/path/with/spacejust dev # Cancel stuck download via APIcurl -X DELETE http://localhost:17493/models/{model_id}/download``` ### Out of VRAM — unload models ```bash# List loaded modelscurl http://localhost:17493/models | jq '.[] | select(.loaded == true)' # Unload specific modelcurl -X POST http://localhost:17493/models/{model_id}/unload``` ### Audio quality issues - Use 5–30 seconds of clean, noise-free speech for voice samples- Multiple samples improve clone quality — upload 3–5 different sentences- For multilingual cloning, use `chatterbox` engine- Ensure sample audio is 16kHz+ mono WAV for best results- Use `luxtts` for highest output quality (48kHz) in English ### Generation stuck in queue after crash Voicebox auto-recovers stale generations on startup. If the issue persists: ```bashcurl -X POST http://localhost:17493/generations/{generation_id}/retry``` --- ## Frontend Integration (React Example) ```tsximport { useState } from "react"; const VOICEBOX_URL = import.meta.env.VITE_VOICEBOX_URL ?? "http://localhost:17493"; export function VoiceGenerator({ profileId }: { profileId: string }) {  const [text, setText] = useState("");  const [audioUrl, setAudioUrl] = useState<string | null>(null);  const [loading, setLoading] = useState(false);   const handleGenerate = async () => {    setLoading(true);    try {      const res = await fetch(`${VOICEBOX_URL}/generate`, {        method: "POST",        headers: { "Content-Type": "application/json" },        body: JSON.stringify({ text, profile_id: profileId, language: "en" }),      });      const { generation_id } = await res.json();       // Poll for completion      let done = false;      while (!done) {        await new Promise((r) => setTimeout(r, 1000));        const statusRes = await fetch(`${VOICEBOX_URL}/generations/${generation_id}`);        const { status } = await statusRes.json();        if (status === "complete") {          setAudioUrl(`${VOICEBOX_URL}/generations/${generation_id}/audio`);          done = true;        } else if (status === "failed") {          throw new Error("Generation failed");        }      }    } finally {      setLoading(false);    }  };   return (    <div>      <textarea value={text} onChange={(e) => setText(e.target.value)} />      <button onClick={handleGenerate} disabled={loading}>        {loading ? "Generating..." : "Generate Speech"}      </button>      {audioUrl && <audio controls src={audioUrl} />}    </div>  );}```