Name: Embedding Strategies
Author: Wshobson
Install
Terminal · npx
$npx skills add https://github.com/wshobson/agents --skill embedding-strategies
Works with Paperclip
How Embedding Strategies fits into a Paperclip company.

Embedding Strategies drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore pack
Source file
SKILL.md600 linesmarkdown
Expand
1---2name: embedding-strategies3description: Select and optimize embedding models for semantic search and RAG applications. Use when choosing embedding models, implementing chunking strategies, or optimizing embedding quality for specific domains.4---5 6# Embedding Strategies7 8Guide to selecting and optimizing embedding models for vector search applications.9 10## When to Use This Skill11 12- Choosing embedding models for RAG13- Optimizing chunking strategies14- Fine-tuning embeddings for domains15- Comparing embedding model performance16- Reducing embedding dimensions17- Handling multilingual content18 19## Core Concepts20 21### 1. Embedding Model Comparison (2026)22 23| Model                      | Dimensions | Max Tokens | Best For                            |24| -------------------------- | ---------- | ---------- | ----------------------------------- |25| **voyage-3-large**         | 1024       | 32000      | Claude apps (Anthropic recommended) |26| **voyage-3**               | 1024       | 32000      | Claude apps, cost-effective         |27| **voyage-code-3**          | 1024       | 32000      | Code search                         |28| **voyage-finance-2**       | 1024       | 32000      | Financial documents                 |29| **voyage-law-2**           | 1024       | 32000      | Legal documents                     |30| **text-embedding-3-large** | 3072       | 8191       | OpenAI apps, high accuracy          |31| **text-embedding-3-small** | 1536       | 8191       | OpenAI apps, cost-effective         |32| **bge-large-en-v1.5**      | 1024       | 512        | Open source, local deployment       |33| **all-MiniLM-L6-v2**       | 384        | 256        | Fast, lightweight                   |34| **multilingual-e5-large**  | 1024       | 512        | Multi-language                      |35 36### 2. Embedding Pipeline37 38```39Document → Chunking → Preprocessing → Embedding Model → Vector40                ↓41        [Overlap, Size]  [Clean, Normalize]  [API/Local]42```43 44## Templates45 46### Template 1: Voyage AI Embeddings (Recommended for Claude)47 48```python49from langchain_voyageai import VoyageAIEmbeddings50from typing import List51import os52 53# Initialize Voyage AI embeddings (recommended by Anthropic for Claude)54embeddings = VoyageAIEmbeddings(55    model="voyage-3-large",56    voyage_api_key=os.environ.get("VOYAGE_API_KEY")57)58 59def get_embeddings(texts: List[str]) -> List[List[float]]:60    """Get embeddings from Voyage AI."""61    return embeddings.embed_documents(texts)62 63def get_query_embedding(query: str) -> List[float]:64    """Get single query embedding."""65    return embeddings.embed_query(query)66 67# Specialized models for domains68code_embeddings = VoyageAIEmbeddings(model="voyage-code-3")69finance_embeddings = VoyageAIEmbeddings(model="voyage-finance-2")70legal_embeddings = VoyageAIEmbeddings(model="voyage-law-2")71```72 73### Template 2: OpenAI Embeddings74 75```python76from openai import OpenAI77from typing import List78import numpy as np79 80client = OpenAI()81 82def get_embeddings(83    texts: List[str],84    model: str = "text-embedding-3-small",85    dimensions: int = None86) -> List[List[float]]:87    """Get embeddings from OpenAI with optional dimension reduction."""88    # Handle batching for large lists89    batch_size = 10090    all_embeddings = []91 92    for i in range(0, len(texts), batch_size):93        batch = texts[i:i + batch_size]94 95        kwargs = {"input": batch, "model": model}96        if dimensions:97            # Matryoshka dimensionality reduction98            kwargs["dimensions"] = dimensions99 100        response = client.embeddings.create(**kwargs)101        embeddings = [item.embedding for item in response.data]102        all_embeddings.extend(embeddings)103 104    return all_embeddings105 106 107def get_embedding(text: str, **kwargs) -> List[float]:108    """Get single embedding."""109    return get_embeddings([text], **kwargs)[0]110 111 112# Dimension reduction with Matryoshka embeddings113def get_reduced_embedding(text: str, dimensions: int = 512) -> List[float]:114    """Get embedding with reduced dimensions (Matryoshka)."""115    return get_embedding(116        text,117        model="text-embedding-3-small",118        dimensions=dimensions119    )120```121 122### Template 3: Local Embeddings with Sentence Transformers123 124```python125from sentence_transformers import SentenceTransformer126from typing import List, Optional127import numpy as np128 129class LocalEmbedder:130    """Local embedding with sentence-transformers."""131 132    def __init__(133        self,134        model_name: str = "BAAI/bge-large-en-v1.5",135        device: str = "cuda"136    ):137        self.model = SentenceTransformer(model_name, device=device)138        self.model_name = model_name139 140    def embed(141        self,142        texts: List[str],143        normalize: bool = True,144        show_progress: bool = False145    ) -> np.ndarray:146        """Embed texts with optional normalization."""147        embeddings = self.model.encode(148            texts,149            normalize_embeddings=normalize,150            show_progress_bar=show_progress,151            convert_to_numpy=True152        )153        return embeddings154 155    def embed_query(self, query: str) -> np.ndarray:156        """Embed a query with appropriate prefix for retrieval models."""157        # BGE and similar models benefit from query prefix158        if "bge" in self.model_name.lower():159            query = f"Represent this sentence for searching relevant passages: {query}"160        return self.embed([query])[0]161 162    def embed_documents(self, documents: List[str]) -> np.ndarray:163        """Embed documents for indexing."""164        return self.embed(documents)165 166 167# E5 model with instructions168class E5Embedder:169    def __init__(self, model_name: str = "intfloat/multilingual-e5-large"):170        self.model = SentenceTransformer(model_name)171 172    def embed_query(self, query: str) -> np.ndarray:173        """E5 requires 'query:' prefix for queries."""174        return self.model.encode(f"query: {query}")175 176    def embed_document(self, document: str) -> np.ndarray:177        """E5 requires 'passage:' prefix for documents."""178        return self.model.encode(f"passage: {document}")179```180 181### Template 4: Chunking Strategies182 183```python184from typing import List, Tuple185import re186 187def chunk_by_tokens(188    text: str,189    chunk_size: int = 512,190    chunk_overlap: int = 50,191    tokenizer=None192) -> List[str]:193    """Chunk text by token count."""194    import tiktoken195    tokenizer = tokenizer or tiktoken.get_encoding("cl100k_base")196 197    tokens = tokenizer.encode(text)198    chunks = []199 200    start = 0201    while start < len(tokens):202        end = start + chunk_size203        chunk_tokens = tokens[start:end]204        chunk_text = tokenizer.decode(chunk_tokens)205        chunks.append(chunk_text)206        start = end - chunk_overlap207 208    return chunks209 210 211def chunk_by_sentences(212    text: str,213    max_chunk_size: int = 1000,214    min_chunk_size: int = 100215) -> List[str]:216    """Chunk text by sentences, respecting size limits."""217    import nltk218    sentences = nltk.sent_tokenize(text)219 220    chunks = []221    current_chunk = []222    current_size = 0223 224    for sentence in sentences:225        sentence_size = len(sentence)226 227        if current_size + sentence_size > max_chunk_size and current_chunk:228            chunks.append(" ".join(current_chunk))229            current_chunk = []230            current_size = 0231 232        current_chunk.append(sentence)233        current_size += sentence_size234 235    if current_chunk:236        chunks.append(" ".join(current_chunk))237 238    return chunks239 240 241def chunk_by_semantic_sections(242    text: str,243    headers_pattern: str = r'^#{1,3}\s+.+$'244) -> List[Tuple[str, str]]:245    """Chunk markdown by headers, preserving hierarchy."""246    lines = text.split('\n')247    chunks = []248    current_header = ""249    current_content = []250 251    for line in lines:252        if re.match(headers_pattern, line, re.MULTILINE):253            if current_content:254                chunks.append((current_header, '\n'.join(current_content)))255            current_header = line256            current_content = []257        else:258            current_content.append(line)259 260    if current_content:261        chunks.append((current_header, '\n'.join(current_content)))262 263    return chunks264 265 266def recursive_character_splitter(267    text: str,268    chunk_size: int = 1000,269    chunk_overlap: int = 200,270    separators: List[str] = None271) -> List[str]:272    """LangChain-style recursive splitter."""273    separators = separators or ["\n\n", "\n", ". ", " ", ""]274 275    def split_text(text: str, separators: List[str]) -> List[str]:276        if not text:277            return []278 279        separator = separators[0]280        remaining_separators = separators[1:]281 282        if separator == "":283            # Character-level split284            return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size - chunk_overlap)]285 286        splits = text.split(separator)287        chunks = []288        current_chunk = []289        current_length = 0290 291        for split in splits:292            split_length = len(split) + len(separator)293 294            if current_length + split_length > chunk_size and current_chunk:295                chunk_text = separator.join(current_chunk)296 297                # Recursively split if still too large298                if len(chunk_text) > chunk_size and remaining_separators:299                    chunks.extend(split_text(chunk_text, remaining_separators))300                else:301                    chunks.append(chunk_text)302 303                # Start new chunk with overlap304                overlap_splits = []305                overlap_length = 0306                for s in reversed(current_chunk):307                    if overlap_length + len(s) <= chunk_overlap:308                        overlap_splits.insert(0, s)309                        overlap_length += len(s)310                    else:311                        break312                current_chunk = overlap_splits313                current_length = overlap_length314 315            current_chunk.append(split)316            current_length += split_length317 318        if current_chunk:319            chunks.append(separator.join(current_chunk))320 321        return chunks322 323    return split_text(text, separators)324```325 326### Template 5: Domain-Specific Embedding Pipeline327 328```python329import re330from typing import List, Optional331from dataclasses import dataclass332 333@dataclass334class EmbeddedDocument:335    id: str336    document_id: str337    chunk_index: int338    text: str339    embedding: List[float]340    metadata: dict341 342class DomainEmbeddingPipeline:343    """Pipeline for domain-specific embeddings."""344 345    def __init__(346        self,347        embedding_model: str = "voyage-3-large",348        chunk_size: int = 512,349        chunk_overlap: int = 50,350        preprocessing_fn=None351    ):352        self.embeddings = VoyageAIEmbeddings(model=embedding_model)353        self.chunk_size = chunk_size354        self.chunk_overlap = chunk_overlap355        self.preprocess = preprocessing_fn or self._default_preprocess356 357    def _default_preprocess(self, text: str) -> str:358        """Default preprocessing."""359        # Remove excessive whitespace360        text = re.sub(r'\s+', ' ', text)361        # Remove special characters (customize for your domain)362        text = re.sub(r'[^\w\s.,!?-]', '', text)363        return text.strip()364 365    async def process_documents(366        self,367        documents: List[dict],368        id_field: str = "id",369        content_field: str = "content",370        metadata_fields: Optional[List[str]] = None371    ) -> List[EmbeddedDocument]:372        """Process documents for vector storage."""373        processed = []374 375        for doc in documents:376            content = doc[content_field]377            doc_id = doc[id_field]378 379            # Preprocess380            cleaned = self.preprocess(content)381 382            # Chunk383            chunks = chunk_by_tokens(384                cleaned,385                self.chunk_size,386                self.chunk_overlap387            )388 389            # Create embeddings390            embeddings = await self.embeddings.aembed_documents(chunks)391 392            # Create records393            for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):394                metadata = {"document_id": doc_id, "chunk_index": i}395 396                # Add specified metadata fields397                if metadata_fields:398                    for field in metadata_fields:399                        if field in doc:400                            metadata[field] = doc[field]401 402                processed.append(EmbeddedDocument(403                    id=f"{doc_id}_chunk_{i}",404                    document_id=doc_id,405                    chunk_index=i,406                    text=chunk,407                    embedding=embedding,408                    metadata=metadata409                ))410 411        return processed412 413 414# Code-specific pipeline415class CodeEmbeddingPipeline:416    """Specialized pipeline for code embeddings."""417 418    def __init__(self):419        # Use Voyage's code-specific model420        self.embeddings = VoyageAIEmbeddings(model="voyage-code-3")421 422    def chunk_code(self, code: str, language: str) -> List[dict]:423        """Chunk code by functions/classes using tree-sitter."""424        try:425            import tree_sitter_languages426            parser = tree_sitter_languages.get_parser(language)427            tree = parser.parse(bytes(code, "utf8"))428 429            chunks = []430            # Extract function and class definitions431            self._extract_nodes(tree.root_node, code, chunks)432            return chunks433        except ImportError:434            # Fallback to simple chunking435            return [{"text": code, "type": "module"}]436 437    def _extract_nodes(self, node, source_code: str, chunks: list):438        """Recursively extract function/class definitions."""439        if node.type in ['function_definition', 'class_definition', 'method_definition']:440            text = source_code[node.start_byte:node.end_byte]441            chunks.append({442                "text": text,443                "type": node.type,444                "name": self._get_name(node),445                "start_line": node.start_point[0],446                "end_line": node.end_point[0]447            })448        for child in node.children:449            self._extract_nodes(child, source_code, chunks)450 451    def _get_name(self, node) -> str:452        """Extract name from function/class node."""453        for child in node.children:454            if child.type == 'identifier' or child.type == 'name':455                return child.text.decode('utf8')456        return "unknown"457 458    async def embed_with_context(459        self,460        chunk: str,461        context: str = ""462    ) -> List[float]:463        """Embed code with surrounding context."""464        if context:465            combined = f"Context: {context}\n\nCode:\n{chunk}"466        else:467            combined = chunk468        return await self.embeddings.aembed_query(combined)469```470 471### Template 6: Embedding Quality Evaluation472 473```python474import numpy as np475from typing import List, Dict476 477def evaluate_retrieval_quality(478    queries: List[str],479    relevant_docs: List[List[str]],  # List of relevant doc IDs per query480    retrieved_docs: List[List[str]],  # List of retrieved doc IDs per query481    k: int = 10482) -> Dict[str, float]:483    """Evaluate embedding quality for retrieval."""484 485    def precision_at_k(relevant: set, retrieved: List[str], k: int) -> float:486        retrieved_k = retrieved[:k]487        relevant_retrieved = len(set(retrieved_k) & relevant)488        return relevant_retrieved / k if k > 0 else 0489 490    def recall_at_k(relevant: set, retrieved: List[str], k: int) -> float:491        retrieved_k = retrieved[:k]492        relevant_retrieved = len(set(retrieved_k) & relevant)493        return relevant_retrieved / len(relevant) if relevant else 0494 495    def mrr(relevant: set, retrieved: List[str]) -> float:496        for i, doc in enumerate(retrieved):497            if doc in relevant:498                return 1 / (i + 1)499        return 0500 501    def ndcg_at_k(relevant: set, retrieved: List[str], k: int) -> float:502        dcg = sum(503            1 / np.log2(i + 2) if doc in relevant else 0504            for i, doc in enumerate(retrieved[:k])505        )506        ideal_dcg = sum(1 / np.log2(i + 2) for i in range(min(len(relevant), k)))507        return dcg / ideal_dcg if ideal_dcg > 0 else 0508 509    metrics = {510        f"precision@{k}": [],511        f"recall@{k}": [],512        "mrr": [],513        f"ndcg@{k}": []514    }515 516    for relevant, retrieved in zip(relevant_docs, retrieved_docs):517        relevant_set = set(relevant)518        metrics[f"precision@{k}"].append(precision_at_k(relevant_set, retrieved, k))519        metrics[f"recall@{k}"].append(recall_at_k(relevant_set, retrieved, k))520        metrics["mrr"].append(mrr(relevant_set, retrieved))521        metrics[f"ndcg@{k}"].append(ndcg_at_k(relevant_set, retrieved, k))522 523    return {name: np.mean(values) for name, values in metrics.items()}524 525 526def compute_embedding_similarity(527    embeddings1: np.ndarray,528    embeddings2: np.ndarray,529    metric: str = "cosine"530) -> np.ndarray:531    """Compute similarity matrix between embedding sets."""532    if metric == "cosine":533        # Normalize and compute dot product534        norm1 = embeddings1 / np.linalg.norm(embeddings1, axis=1, keepdims=True)535        norm2 = embeddings2 / np.linalg.norm(embeddings2, axis=1, keepdims=True)536        return norm1 @ norm2.T537    elif metric == "euclidean":538        from scipy.spatial.distance import cdist539        return -cdist(embeddings1, embeddings2, metric='euclidean')540    elif metric == "dot":541        return embeddings1 @ embeddings2.T542    else:543        raise ValueError(f"Unknown metric: {metric}")544 545 546def compare_embedding_models(547    texts: List[str],548    models: Dict[str, callable],549    queries: List[str],550    relevant_indices: List[List[int]],551    k: int = 5552) -> Dict[str, Dict[str, float]]:553    """Compare multiple embedding models on retrieval quality."""554    results = {}555 556    for model_name, embed_fn in models.items():557        # Embed all texts558        doc_embeddings = np.array(embed_fn(texts))559 560        retrieved_per_query = []561        for query in queries:562            query_embedding = np.array(embed_fn([query])[0])563            # Compute similarities564            similarities = compute_embedding_similarity(565                query_embedding.reshape(1, -1),566                doc_embeddings,567                metric="cosine"568            )[0]569            # Get top-k indices570            top_k_indices = np.argsort(similarities)[::-1][:k]571            retrieved_per_query.append([str(i) for i in top_k_indices])572 573        # Convert relevant indices to string IDs574        relevant_docs = [[str(i) for i in indices] for indices in relevant_indices]575 576        results[model_name] = evaluate_retrieval_quality(577            queries, relevant_docs, retrieved_per_query, k578        )579 580    return results581```582 583## Best Practices584 585### Do's586 587- **Match model to use case**: Code vs prose vs multilingual588- **Chunk thoughtfully**: Preserve semantic boundaries589- **Normalize embeddings**: For cosine similarity search590- **Batch requests**: More efficient than one-by-one591- **Cache embeddings**: Avoid recomputing for static content592- **Use Voyage AI for Claude apps**: Recommended by Anthropic593 594### Don'ts595 596- **Don't ignore token limits**: Truncation loses information597- **Don't mix embedding models**: Incompatible vector spaces598- **Don't skip preprocessing**: Garbage in, garbage out599- **Don't over-chunk**: Lose important context600- **Don't forget metadata**: Essential for filtering and debugging
Related skills
Accessibility Compliance

This walks you through implementing proper WCAG 2.2 compliance with real code patterns for screen readers, keyboard navigation, and mobile accessibility. It cov
Airflow Dag Patterns

If you're building data pipelines with Airflow, this skill gives you production-ready DAG patterns that actually work in the real world. It covers TaskFlow API
Angular Migration

Migrating from AngularJS to Angular is notoriously painful, and this skill tackles the practical stuff that makes or breaks these projects. It covers hybrid app