Name: Similarity Search Patterns
Author: Wshobson
Install
Terminal · npx
$npx skills add https://github.com/wshobson/agents --skill similarity-search-patterns
Works with Paperclip
How Similarity Search Patterns fits into a Paperclip company.

Similarity Search Patterns drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore pack
Source file
SKILL.md553 linesmarkdown
Expand
1---2name: similarity-search-patterns3description: Implement efficient similarity search with vector databases. Use when building semantic search, implementing nearest neighbor queries, or optimizing retrieval performance.4---5 6# Similarity Search Patterns7 8Patterns for implementing efficient similarity search in production systems.9 10## When to Use This Skill11 12- Building semantic search systems13- Implementing RAG retrieval14- Creating recommendation engines15- Optimizing search latency16- Scaling to millions of vectors17- Combining semantic and keyword search18 19## Core Concepts20 21### 1. Distance Metrics22 23| Metric             | Formula            | Best For              |24| ------------------ | ------------------ | --------------------- | --- | -------------- |25| **Cosine**         | 1 - (A·B)/(‖A‖‖B‖) | Normalized embeddings |26| **Euclidean (L2)** | √Σ(a-b)²           | Raw embeddings        |27| **Dot Product**    | A·B                | Magnitude matters     |28| **Manhattan (L1)** | Σ                  | a-b                   |     | Sparse vectors |29 30### 2. Index Types31 32```33┌─────────────────────────────────────────────────┐34│                 Index Types                      │35├─────────────┬───────────────┬───────────────────┤36│    Flat     │     HNSW      │    IVF+PQ         │37│ (Exact)     │ (Graph-based) │ (Quantized)       │38├─────────────┼───────────────┼───────────────────┤39│ O(n) search │ O(log n)      │ O(√n)             │40│ 100% recall │ ~95-99%       │ ~90-95%           │41│ Small data  │ Medium-Large  │ Very Large        │42└─────────────┴───────────────┴───────────────────┘43```44 45## Templates46 47### Template 1: Pinecone Implementation48 49```python50from pinecone import Pinecone, ServerlessSpec51from typing import List, Dict, Optional52import hashlib53 54class PineconeVectorStore:55    def __init__(56        self,57        api_key: str,58        index_name: str,59        dimension: int = 1536,60        metric: str = "cosine"61    ):62        self.pc = Pinecone(api_key=api_key)63 64        # Create index if not exists65        if index_name not in self.pc.list_indexes().names():66            self.pc.create_index(67                name=index_name,68                dimension=dimension,69                metric=metric,70                spec=ServerlessSpec(cloud="aws", region="us-east-1")71            )72 73        self.index = self.pc.Index(index_name)74 75    def upsert(76        self,77        vectors: List[Dict],78        namespace: str = ""79    ) -> int:80        """81        Upsert vectors.82        vectors: [{"id": str, "values": List[float], "metadata": dict}]83        """84        # Batch upsert85        batch_size = 10086        total = 087 88        for i in range(0, len(vectors), batch_size):89            batch = vectors[i:i + batch_size]90            self.index.upsert(vectors=batch, namespace=namespace)91            total += len(batch)92 93        return total94 95    def search(96        self,97        query_vector: List[float],98        top_k: int = 10,99        namespace: str = "",100        filter: Optional[Dict] = None,101        include_metadata: bool = True102    ) -> List[Dict]:103        """Search for similar vectors."""104        results = self.index.query(105            vector=query_vector,106            top_k=top_k,107            namespace=namespace,108            filter=filter,109            include_metadata=include_metadata110        )111 112        return [113            {114                "id": match.id,115                "score": match.score,116                "metadata": match.metadata117            }118            for match in results.matches119        ]120 121    def search_with_rerank(122        self,123        query: str,124        query_vector: List[float],125        top_k: int = 10,126        rerank_top_n: int = 50,127        namespace: str = ""128    ) -> List[Dict]:129        """Search and rerank results."""130        # Over-fetch for reranking131        initial_results = self.search(132            query_vector,133            top_k=rerank_top_n,134            namespace=namespace135        )136 137        # Rerank with cross-encoder or LLM138        reranked = self._rerank(query, initial_results)139 140        return reranked[:top_k]141 142    def _rerank(self, query: str, results: List[Dict]) -> List[Dict]:143        """Rerank results using cross-encoder."""144        from sentence_transformers import CrossEncoder145 146        model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')147 148        pairs = [(query, r["metadata"]["text"]) for r in results]149        scores = model.predict(pairs)150 151        for result, score in zip(results, scores):152            result["rerank_score"] = float(score)153 154        return sorted(results, key=lambda x: x["rerank_score"], reverse=True)155 156    def delete(self, ids: List[str], namespace: str = ""):157        """Delete vectors by ID."""158        self.index.delete(ids=ids, namespace=namespace)159 160    def delete_by_filter(self, filter: Dict, namespace: str = ""):161        """Delete vectors matching filter."""162        self.index.delete(filter=filter, namespace=namespace)163```164 165### Template 2: Qdrant Implementation166 167```python168from qdrant_client import QdrantClient169from qdrant_client.http import models170from typing import List, Dict, Optional171 172class QdrantVectorStore:173    def __init__(174        self,175        url: str = "localhost",176        port: int = 6333,177        collection_name: str = "documents",178        vector_size: int = 1536179    ):180        self.client = QdrantClient(url=url, port=port)181        self.collection_name = collection_name182 183        # Create collection if not exists184        collections = self.client.get_collections().collections185        if collection_name not in [c.name for c in collections]:186            self.client.create_collection(187                collection_name=collection_name,188                vectors_config=models.VectorParams(189                    size=vector_size,190                    distance=models.Distance.COSINE191                ),192                # Optional: enable quantization for memory efficiency193                quantization_config=models.ScalarQuantization(194                    scalar=models.ScalarQuantizationConfig(195                        type=models.ScalarType.INT8,196                        quantile=0.99,197                        always_ram=True198                    )199                )200            )201 202    def upsert(self, points: List[Dict]) -> int:203        """204        Upsert points.205        points: [{"id": str/int, "vector": List[float], "payload": dict}]206        """207        qdrant_points = [208            models.PointStruct(209                id=p["id"],210                vector=p["vector"],211                payload=p.get("payload", {})212            )213            for p in points214        ]215 216        self.client.upsert(217            collection_name=self.collection_name,218            points=qdrant_points219        )220        return len(points)221 222    def search(223        self,224        query_vector: List[float],225        limit: int = 10,226        filter: Optional[models.Filter] = None,227        score_threshold: Optional[float] = None228    ) -> List[Dict]:229        """Search for similar vectors."""230        results = self.client.search(231            collection_name=self.collection_name,232            query_vector=query_vector,233            limit=limit,234            query_filter=filter,235            score_threshold=score_threshold236        )237 238        return [239            {240                "id": r.id,241                "score": r.score,242                "payload": r.payload243            }244            for r in results245        ]246 247    def search_with_filter(248        self,249        query_vector: List[float],250        must_conditions: List[Dict] = None,251        should_conditions: List[Dict] = None,252        must_not_conditions: List[Dict] = None,253        limit: int = 10254    ) -> List[Dict]:255        """Search with complex filters."""256        conditions = []257 258        if must_conditions:259            conditions.extend([260                models.FieldCondition(261                    key=c["key"],262                    match=models.MatchValue(value=c["value"])263                )264                for c in must_conditions265            ])266 267        filter = models.Filter(must=conditions) if conditions else None268 269        return self.search(query_vector, limit=limit, filter=filter)270 271    def search_with_sparse(272        self,273        dense_vector: List[float],274        sparse_vector: Dict[int, float],275        limit: int = 10,276        dense_weight: float = 0.7277    ) -> List[Dict]:278        """Hybrid search with dense and sparse vectors."""279        # Requires collection with named vectors280        results = self.client.search(281            collection_name=self.collection_name,282            query_vector=models.NamedVector(283                name="dense",284                vector=dense_vector285            ),286            limit=limit287        )288        return [{"id": r.id, "score": r.score, "payload": r.payload} for r in results]289```290 291### Template 3: pgvector with PostgreSQL292 293```python294import asyncpg295from typing import List, Dict, Optional296import numpy as np297 298class PgVectorStore:299    def __init__(self, connection_string: str):300        self.connection_string = connection_string301 302    async def init(self):303        """Initialize connection pool and extension."""304        self.pool = await asyncpg.create_pool(self.connection_string)305 306        async with self.pool.acquire() as conn:307            # Enable extension308            await conn.execute("CREATE EXTENSION IF NOT EXISTS vector")309 310            # Create table311            await conn.execute("""312                CREATE TABLE IF NOT EXISTS documents (313                    id TEXT PRIMARY KEY,314                    content TEXT,315                    metadata JSONB,316                    embedding vector(1536)317                )318            """)319 320            # Create index (HNSW for better performance)321            await conn.execute("""322                CREATE INDEX IF NOT EXISTS documents_embedding_idx323                ON documents324                USING hnsw (embedding vector_cosine_ops)325                WITH (m = 16, ef_construction = 64)326            """)327 328    async def upsert(self, documents: List[Dict]):329        """Upsert documents with embeddings."""330        async with self.pool.acquire() as conn:331            await conn.executemany(332                """333                INSERT INTO documents (id, content, metadata, embedding)334                VALUES ($1, $2, $3, $4)335                ON CONFLICT (id) DO UPDATE SET336                    content = EXCLUDED.content,337                    metadata = EXCLUDED.metadata,338                    embedding = EXCLUDED.embedding339                """,340                [341                    (342                        doc["id"],343                        doc["content"],344                        doc.get("metadata", {}),345                        np.array(doc["embedding"]).tolist()346                    )347                    for doc in documents348                ]349            )350 351    async def search(352        self,353        query_embedding: List[float],354        limit: int = 10,355        filter_metadata: Optional[Dict] = None356    ) -> List[Dict]:357        """Search for similar documents."""358        query = """359            SELECT id, content, metadata,360                   1 - (embedding <=> $1::vector) as similarity361            FROM documents362        """363 364        params = [query_embedding]365 366        if filter_metadata:367            conditions = []368            for key, value in filter_metadata.items():369                params.append(value)370                conditions.append(f"metadata->>'{key}' = ${len(params)}")371            query += " WHERE " + " AND ".join(conditions)372 373        query += f" ORDER BY embedding <=> $1::vector LIMIT ${len(params) + 1}"374        params.append(limit)375 376        async with self.pool.acquire() as conn:377            rows = await conn.fetch(query, *params)378 379        return [380            {381                "id": row["id"],382                "content": row["content"],383                "metadata": row["metadata"],384                "score": row["similarity"]385            }386            for row in rows387        ]388 389    async def hybrid_search(390        self,391        query_embedding: List[float],392        query_text: str,393        limit: int = 10,394        vector_weight: float = 0.5395    ) -> List[Dict]:396        """Hybrid search combining vector and full-text."""397        async with self.pool.acquire() as conn:398            rows = await conn.fetch(399                """400                WITH vector_results AS (401                    SELECT id, content, metadata,402                           1 - (embedding <=> $1::vector) as vector_score403                    FROM documents404                    ORDER BY embedding <=> $1::vector405                    LIMIT $3 * 2406                ),407                text_results AS (408                    SELECT id, content, metadata,409                           ts_rank(to_tsvector('english', content),410                                   plainto_tsquery('english', $2)) as text_score411                    FROM documents412                    WHERE to_tsvector('english', content) @@ plainto_tsquery('english', $2)413                    LIMIT $3 * 2414                )415                SELECT416                    COALESCE(v.id, t.id) as id,417                    COALESCE(v.content, t.content) as content,418                    COALESCE(v.metadata, t.metadata) as metadata,419                    COALESCE(v.vector_score, 0) * $4 +420                    COALESCE(t.text_score, 0) * (1 - $4) as combined_score421                FROM vector_results v422                FULL OUTER JOIN text_results t ON v.id = t.id423                ORDER BY combined_score DESC424                LIMIT $3425                """,426                query_embedding, query_text, limit, vector_weight427            )428 429        return [dict(row) for row in rows]430```431 432### Template 4: Weaviate Implementation433 434```python435import weaviate436from weaviate.util import generate_uuid5437from typing import List, Dict, Optional438 439class WeaviateVectorStore:440    def __init__(441        self,442        url: str = "http://localhost:8080",443        class_name: str = "Document"444    ):445        self.client = weaviate.Client(url=url)446        self.class_name = class_name447        self._ensure_schema()448 449    def _ensure_schema(self):450        """Create schema if not exists."""451        schema = {452            "class": self.class_name,453            "vectorizer": "none",  # We provide vectors454            "properties": [455                {"name": "content", "dataType": ["text"]},456                {"name": "source", "dataType": ["string"]},457                {"name": "chunk_id", "dataType": ["int"]}458            ]459        }460 461        if not self.client.schema.exists(self.class_name):462            self.client.schema.create_class(schema)463 464    def upsert(self, documents: List[Dict]):465        """Batch upsert documents."""466        with self.client.batch as batch:467            batch.batch_size = 100468 469            for doc in documents:470                batch.add_data_object(471                    data_object={472                        "content": doc["content"],473                        "source": doc.get("source", ""),474                        "chunk_id": doc.get("chunk_id", 0)475                    },476                    class_name=self.class_name,477                    uuid=generate_uuid5(doc["id"]),478                    vector=doc["embedding"]479                )480 481    def search(482        self,483        query_vector: List[float],484        limit: int = 10,485        where_filter: Optional[Dict] = None486    ) -> List[Dict]:487        """Vector search."""488        query = (489            self.client.query490            .get(self.class_name, ["content", "source", "chunk_id"])491            .with_near_vector({"vector": query_vector})492            .with_limit(limit)493            .with_additional(["distance", "id"])494        )495 496        if where_filter:497            query = query.with_where(where_filter)498 499        results = query.do()500 501        return [502            {503                "id": item["_additional"]["id"],504                "content": item["content"],505                "source": item["source"],506                "score": 1 - item["_additional"]["distance"]507            }508            for item in results["data"]["Get"][self.class_name]509        ]510 511    def hybrid_search(512        self,513        query: str,514        query_vector: List[float],515        limit: int = 10,516        alpha: float = 0.5  # 0 = keyword, 1 = vector517    ) -> List[Dict]:518        """Hybrid search combining BM25 and vector."""519        results = (520            self.client.query521            .get(self.class_name, ["content", "source"])522            .with_hybrid(query=query, vector=query_vector, alpha=alpha)523            .with_limit(limit)524            .with_additional(["score"])525            .do()526        )527 528        return [529            {530                "content": item["content"],531                "source": item["source"],532                "score": item["_additional"]["score"]533            }534            for item in results["data"]["Get"][self.class_name]535        ]536```537 538## Best Practices539 540### Do's541 542- **Use appropriate index** - HNSW for most cases543- **Tune parameters** - ef_search, nprobe for recall/speed544- **Implement hybrid search** - Combine with keyword search545- **Monitor recall** - Measure search quality546- **Pre-filter when possible** - Reduce search space547 548### Don'ts549 550- **Don't skip evaluation** - Measure before optimizing551- **Don't over-index** - Start with flat, scale up552- **Don't ignore latency** - P99 matters for UX553- **Don't forget costs** - Vector storage adds up
Related skills
Accessibility Compliance

This walks you through implementing proper WCAG 2.2 compliance with real code patterns for screen readers, keyboard navigation, and mobile accessibility. It cov
Airflow Dag Patterns

If you're building data pipelines with Airflow, this skill gives you production-ready DAG patterns that actually work in the real world. It covers TaskFlow API
Angular Migration

Migrating from AngularJS to Angular is notoriously painful, and this skill tackles the practical stuff that makes or breaks these projects. It covers hybrid app