Install
Terminal · npx
$npx skills add https://github.com/vercel-labs/agent-skills --skill vercel-react-best-practices
Works with Paperclip
How Data Scraper Agent fits into a Paperclip company.

Data Scraper Agent drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore pack
Source file
SKILL.md764 linesmarkdown
Expand
1---2name: data-scraper-agent3description: Build a fully automated AI-powered data collection agent for any public source — job boards, prices, news, GitHub, sports, anything. Scrapes on a schedule, enriches data with a free LLM (Gemini Flash), stores results in Notion/Sheets/Supabase, and learns from user feedback. Runs 100% free on GitHub Actions. Use when the user wants to monitor, collect, or track any public data automatically.4origin: community5---6 7# Data Scraper Agent8 9Build a production-ready, AI-powered data collection agent for any public data source.10Runs on a schedule, enriches results with a free LLM, stores to a database, and improves over time.11 12**Stack: Python · Gemini Flash (free) · GitHub Actions (free) · Notion / Sheets / Supabase**13 14## When to Activate15 16- User wants to scrape or monitor any public website or API17- User says "build a bot that checks...", "monitor X for me", "collect data from..."18- User wants to track jobs, prices, news, repos, sports scores, events, listings19- User asks how to automate data collection without paying for hosting20- User wants an agent that gets smarter over time based on their decisions21 22## Core Concepts23 24### The Three Layers25 26Every data scraper agent has three layers:27 28```29COLLECT → ENRICH → STORE30  │           │        │31Scraper    AI (LLM)  Database32runs on    scores/   Notion /33schedule   summarises Sheets /34           & classifies Supabase35```36 37### Free Stack38 39| Layer | Tool | Why |40|---|---|---|41| **Scraping** | `requests` + `BeautifulSoup` | No cost, covers 80% of public sites |42| **JS-rendered sites** | `playwright` (free) | When HTML scraping fails |43| **AI enrichment** | Gemini Flash via REST API | 500 req/day, 1M tokens/day — free |44| **Storage** | Notion API | Free tier, great UI for review |45| **Schedule** | GitHub Actions cron | Free for public repos |46| **Learning** | JSON feedback file in repo | Zero infra, persists in git |47 48### AI Model Fallback Chain49 50Build agents to auto-fallback across Gemini models on quota exhaustion:51 52```53gemini-2.0-flash-lite (30 RPM) →54gemini-2.0-flash (15 RPM) →55gemini-2.5-flash (10 RPM) →56gemini-flash-lite-latest (fallback)57```58 59### Batch API Calls for Efficiency60 61Never call the LLM once per item. Always batch:62 63```python64# BAD: 33 API calls for 33 items65for item in items:66    result = call_ai(item)  # 33 calls → hits rate limit67 68# GOOD: 7 API calls for 33 items (batch size 5)69for batch in chunks(items, size=5):70    results = call_ai(batch)  # 7 calls → stays within free tier71```72 73---74 75## Workflow76 77### Step 1: Understand the Goal78 79Ask the user:80 811. **What to collect:** "What data source? URL / API / RSS / public endpoint?"822. **What to extract:** "What fields matter? Title, price, URL, date, score?"833. **How to store:** "Where should results go? Notion, Google Sheets, Supabase, or local file?"844. **How to enrich:** "Do you want AI to score, summarise, classify, or match each item?"855. **Frequency:** "How often should it run? Every hour, daily, weekly?"86 87Common examples to prompt:88- Job boards → score relevance to resume89- Product prices → alert on drops90- GitHub repos → summarise new releases91- News feeds → classify by topic + sentiment92- Sports results → extract stats to tracker93- Events calendar → filter by interest94 95---96 97### Step 2: Design the Agent Architecture98 99Generate this directory structure for the user:100 101```102my-agent/103├── config.yaml              # User customises this (keywords, filters, preferences)104├── profile/105│   └── context.md           # User context the AI uses (resume, interests, criteria)106├── scraper/107│   ├── __init__.py108│   ├── main.py              # Orchestrator: scrape → enrich → store109│   ├── filters.py           # Rule-based pre-filter (fast, before AI)110│   └── sources/111│       ├── __init__.py112│       └── source_name.py   # One file per data source113├── ai/114│   ├── __init__.py115│   ├── client.py            # Gemini REST client with model fallback116│   ├── pipeline.py          # Batch AI analysis117│   ├── jd_fetcher.py        # Fetch full content from URLs (optional)118│   └── memory.py            # Learn from user feedback119├── storage/120│   ├── __init__.py121│   └── notion_sync.py       # Or sheets_sync.py / supabase_sync.py122├── data/123│   └── feedback.json        # User decision history (auto-updated)124├── .env.example125├── setup.py                 # One-time DB/schema creation126├── enrich_existing.py       # Backfill AI scores on old rows127├── requirements.txt128└── .github/129    └── workflows/130        └── scraper.yml      # GitHub Actions schedule131```132 133---134 135### Step 3: Build the Scraper Source136 137Template for any data source:138 139```python140# scraper/sources/my_source.py141"""142[Source Name] — scrapes [what] from [where].143Method: [REST API / HTML scraping / RSS feed]144"""145import requests146from bs4 import BeautifulSoup147from datetime import datetime, timezone148from scraper.filters import is_relevant149 150HEADERS = {151    "User-Agent": "Mozilla/5.0 (compatible; research-bot/1.0)",152}153 154 155def fetch() -> list[dict]:156    """157    Returns a list of items with consistent schema.158    Each item must have at minimum: name, url, date_found.159    """160    results = []161 162    # ---- REST API source ----163    resp = requests.get("https://api.example.com/items", headers=HEADERS, timeout=15)164    if resp.status_code == 200:165        for item in resp.json().get("results", []):166            if not is_relevant(item.get("title", "")):167                continue168            results.append(_normalise(item))169 170    return results171 172 173def _normalise(raw: dict) -> dict:174    """Convert raw API/HTML data to the standard schema."""175    return {176        "name": raw.get("title", ""),177        "url": raw.get("link", ""),178        "source": "MySource",179        "date_found": datetime.now(timezone.utc).date().isoformat(),180        # add domain-specific fields here181    }182```183 184**HTML scraping pattern:**185```python186soup = BeautifulSoup(resp.text, "lxml")187for card in soup.select("[class*='listing']"):188    title = card.select_one("h2, h3").get_text(strip=True)189    link = card.select_one("a")["href"]190    if not link.startswith("http"):191        link = f"https://example.com{link}"192```193 194**RSS feed pattern:**195```python196import xml.etree.ElementTree as ET197root = ET.fromstring(resp.text)198for item in root.findall(".//item"):199    title = item.findtext("title", "")200    link = item.findtext("link", "")201```202 203---204 205### Step 4: Build the Gemini AI Client206 207```python208# ai/client.py209import os, json, time, requests210 211_last_call = 0.0212 213MODEL_FALLBACK = [214    "gemini-2.0-flash-lite",215    "gemini-2.0-flash",216    "gemini-2.5-flash",217    "gemini-flash-lite-latest",218]219 220 221def generate(prompt: str, model: str = "", rate_limit: float = 7.0) -> dict:222    """Call Gemini with auto-fallback on 429. Returns parsed JSON or {}."""223    global _last_call224 225    api_key = os.environ.get("GEMINI_API_KEY", "")226    if not api_key:227        return {}228 229    elapsed = time.time() - _last_call230    if elapsed < rate_limit:231        time.sleep(rate_limit - elapsed)232 233    models = [model] + [m for m in MODEL_FALLBACK if m != model] if model else MODEL_FALLBACK234    _last_call = time.time()235 236    for m in models:237        url = f"https://generativelanguage.googleapis.com/v1beta/models/{m}:generateContent?key={api_key}"238        payload = {239            "contents": [{"parts": [{"text": prompt}]}],240            "generationConfig": {241                "responseMimeType": "application/json",242                "temperature": 0.3,243                "maxOutputTokens": 2048,244            },245        }246        try:247            resp = requests.post(url, json=payload, timeout=30)248            if resp.status_code == 200:249                return _parse(resp)250            if resp.status_code in (429, 404):251                time.sleep(1)252                continue253            return {}254        except requests.RequestException:255            return {}256 257    return {}258 259 260def _parse(resp) -> dict:261    try:262        text = (263            resp.json()264            .get("candidates", [{}])[0]265            .get("content", {})266            .get("parts", [{}])[0]267            .get("text", "")268            .strip()269        )270        if text.startswith("```"):271            text = text.split("\n", 1)[-1].rsplit("```", 1)[0]272        return json.loads(text)273    except (json.JSONDecodeError, KeyError):274        return {}275```276 277---278 279### Step 5: Build the AI Pipeline (Batch)280 281```python282# ai/pipeline.py283import json284import yaml285from pathlib import Path286from ai.client import generate287 288def analyse_batch(items: list[dict], context: str = "", preference_prompt: str = "") -> list[dict]:289    """Analyse items in batches. Returns items enriched with AI fields."""290    config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text())291    model = config.get("ai", {}).get("model", "gemini-2.5-flash")292    rate_limit = config.get("ai", {}).get("rate_limit_seconds", 7.0)293    min_score = config.get("ai", {}).get("min_score", 0)294    batch_size = config.get("ai", {}).get("batch_size", 5)295 296    batches = [items[i:i + batch_size] for i in range(0, len(items), batch_size)]297    print(f"  [AI] {len(items)} items → {len(batches)} API calls")298 299    enriched = []300    for i, batch in enumerate(batches):301        print(f"  [AI] Batch {i + 1}/{len(batches)}...")302        prompt = _build_prompt(batch, context, preference_prompt, config)303        result = generate(prompt, model=model, rate_limit=rate_limit)304 305        analyses = result.get("analyses", [])306        for j, item in enumerate(batch):307            ai = analyses[j] if j < len(analyses) else {}308            if ai:309                score = max(0, min(100, int(ai.get("score", 0))))310                if min_score and score < min_score:311                    continue312                enriched.append({**item, "ai_score": score, "ai_summary": ai.get("summary", ""), "ai_notes": ai.get("notes", "")})313            else:314                enriched.append(item)315 316    return enriched317 318 319def _build_prompt(batch, context, preference_prompt, config):320    priorities = config.get("priorities", [])321    items_text = "\n\n".join(322        f"Item {i+1}: {json.dumps({k: v for k, v in item.items() if not k.startswith('_')})}"323        for i, item in enumerate(batch)324    )325 326    return f"""Analyse these {len(batch)} items and return a JSON object.327 328# Items329{items_text}330 331# User Context332{context[:800] if context else "Not provided"}333 334# User Priorities335{chr(10).join(f"- {p}" for p in priorities)}336 337{preference_prompt}338 339# Instructions340Return: {{"analyses": [{{"score": <0-100>, "summary": "<2 sentences>", "notes": "<why this matches or doesn't>"}} for each item in order]}}341Be concise. Score 90+=excellent match, 70-89=good, 50-69=ok, <50=weak."""342```343 344---345 346### Step 6: Build the Feedback Learning System347 348```python349# ai/memory.py350"""Learn from user decisions to improve future scoring."""351import json352from pathlib import Path353 354FEEDBACK_PATH = Path(__file__).parent.parent / "data" / "feedback.json"355 356 357def load_feedback() -> dict:358    if FEEDBACK_PATH.exists():359        try:360            return json.loads(FEEDBACK_PATH.read_text())361        except (json.JSONDecodeError, OSError):362            pass363    return {"positive": [], "negative": []}364 365 366def save_feedback(fb: dict):367    FEEDBACK_PATH.parent.mkdir(parents=True, exist_ok=True)368    FEEDBACK_PATH.write_text(json.dumps(fb, indent=2))369 370 371def build_preference_prompt(feedback: dict, max_examples: int = 15) -> str:372    """Convert feedback history into a prompt bias section."""373    lines = []374    if feedback.get("positive"):375        lines.append("# Items the user LIKED (positive signal):")376        for e in feedback["positive"][-max_examples:]:377            lines.append(f"- {e}")378    if feedback.get("negative"):379        lines.append("\n# Items the user SKIPPED/REJECTED (negative signal):")380        for e in feedback["negative"][-max_examples:]:381            lines.append(f"- {e}")382    if lines:383        lines.append("\nUse these patterns to bias scoring on new items.")384    return "\n".join(lines)385```386 387**Integration with your storage layer:** after each run, query your DB for items with positive/negative status and call `save_feedback()` with the extracted patterns.388 389---390 391### Step 7: Build Storage (Notion example)392 393```python394# storage/notion_sync.py395import os396from notion_client import Client397from notion_client.errors import APIResponseError398 399_client = None400 401def get_client():402    global _client403    if _client is None:404        _client = Client(auth=os.environ["NOTION_TOKEN"])405    return _client406 407def get_existing_urls(db_id: str) -> set[str]:408    """Fetch all URLs already stored — used for deduplication."""409    client, seen, cursor = get_client(), set(), None410    while True:411        resp = client.databases.query(database_id=db_id, page_size=100, **{"start_cursor": cursor} if cursor else {})412        for page in resp["results"]:413            url = page["properties"].get("URL", {}).get("url", "")414            if url: seen.add(url)415        if not resp["has_more"]: break416        cursor = resp["next_cursor"]417    return seen418 419def push_item(db_id: str, item: dict) -> bool:420    """Push one item to Notion. Returns True on success."""421    props = {422        "Name": {"title": [{"text": {"content": item.get("name", "")[:100]}}]},423        "URL": {"url": item.get("url")},424        "Source": {"select": {"name": item.get("source", "Unknown")}},425        "Date Found": {"date": {"start": item.get("date_found")}},426        "Status": {"select": {"name": "New"}},427    }428    # AI fields429    if item.get("ai_score") is not None:430        props["AI Score"] = {"number": item["ai_score"]}431    if item.get("ai_summary"):432        props["Summary"] = {"rich_text": [{"text": {"content": item["ai_summary"][:2000]}}]}433    if item.get("ai_notes"):434        props["Notes"] = {"rich_text": [{"text": {"content": item["ai_notes"][:2000]}}]}435 436    try:437        get_client().pages.create(parent={"database_id": db_id}, properties=props)438        return True439    except APIResponseError as e:440        print(f"[notion] Push failed: {e}")441        return False442 443def sync(db_id: str, items: list[dict]) -> tuple[int, int]:444    existing = get_existing_urls(db_id)445    added = skipped = 0446    for item in items:447        if item.get("url") in existing:448            skipped += 1; continue449        if push_item(db_id, item):450            added += 1; existing.add(item["url"])451        else:452            skipped += 1453    return added, skipped454```455 456---457 458### Step 8: Orchestrate in main.py459 460```python461# scraper/main.py462import os, sys, yaml463from pathlib import Path464from dotenv import load_dotenv465 466load_dotenv()467 468from scraper.sources import my_source          # add your sources469 470# NOTE: This example uses Notion. If storage.provider is "sheets" or "supabase",471# replace this import with storage.sheets_sync or storage.supabase_sync and update472# the env var and sync() call accordingly.473from storage.notion_sync import sync474 475SOURCES = [476    ("My Source", my_source.fetch),477]478 479def ai_enabled():480    return bool(os.environ.get("GEMINI_API_KEY"))481 482def main():483    config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text())484    provider = config.get("storage", {}).get("provider", "notion")485 486    # Resolve the storage target identifier from env based on provider487    if provider == "notion":488        db_id = os.environ.get("NOTION_DATABASE_ID")489        if not db_id:490            print("ERROR: NOTION_DATABASE_ID not set"); sys.exit(1)491    else:492        # Extend here for sheets (SHEET_ID) or supabase (SUPABASE_TABLE) etc.493        print(f"ERROR: provider '{provider}' not yet wired in main.py"); sys.exit(1)494 495    config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text())496    all_items = []497 498    for name, fetch_fn in SOURCES:499        try:500            items = fetch_fn()501            print(f"[{name}] {len(items)} items")502            all_items.extend(items)503        except Exception as e:504            print(f"[{name}] FAILED: {e}")505 506    # Deduplicate by URL507    seen, deduped = set(), []508    for item in all_items:509        if (url := item.get("url", "")) and url not in seen:510            seen.add(url); deduped.append(item)511 512    print(f"Unique items: {len(deduped)}")513 514    if ai_enabled() and deduped:515        from ai.memory import load_feedback, build_preference_prompt516        from ai.pipeline import analyse_batch517 518        # load_feedback() reads data/feedback.json written by your feedback sync script.519        # To keep it current, implement a separate feedback_sync.py that queries your520        # storage provider for items with positive/negative statuses and calls save_feedback().521        feedback = load_feedback()522        preference = build_preference_prompt(feedback)523        context_path = Path(__file__).parent.parent / "profile" / "context.md"524        context = context_path.read_text() if context_path.exists() else ""525        deduped = analyse_batch(deduped, context=context, preference_prompt=preference)526    else:527        print("[AI] Skipped — GEMINI_API_KEY not set")528 529    added, skipped = sync(db_id, deduped)530    print(f"Done — {added} new, {skipped} existing")531 532if __name__ == "__main__":533    main()534```535 536---537 538### Step 9: GitHub Actions Workflow539 540```yaml541# .github/workflows/scraper.yml542name: Data Scraper Agent543 544on:545  schedule:546    - cron: "0 */3 * * *"  # every 3 hours — adjust to your needs547  workflow_dispatch:        # allow manual trigger548 549permissions:550  contents: write   # required for the feedback-history commit step551 552jobs:553  scrape:554    runs-on: ubuntu-latest555    timeout-minutes: 20556 557    steps:558      - uses: actions/checkout@v4559 560      - uses: actions/setup-python@v5561        with:562          python-version: "3.11"563          cache: "pip"564 565      - run: pip install -r requirements.txt566 567      # Uncomment if Playwright is enabled in requirements.txt568      # - name: Install Playwright browsers569      #   run: python -m playwright install chromium --with-deps570 571      - name: Run agent572        env:573          NOTION_TOKEN: ${{ secrets.NOTION_TOKEN }}574          NOTION_DATABASE_ID: ${{ secrets.NOTION_DATABASE_ID }}575          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}576        run: python -m scraper.main577 578      - name: Commit feedback history579        run: |580          git config user.name "github-actions[bot]"581          git config user.email "github-actions[bot]@users.noreply.github.com"582          git add data/feedback.json || true583          git diff --cached --quiet || git commit -m "chore: update feedback history"584          git push585```586 587---588 589### Step 10: config.yaml Template590 591```yaml592# Customise this file — no code changes needed593 594# What to collect (pre-filter before AI)595filters:596  required_keywords: []      # item must contain at least one597  blocked_keywords: []       # item must not contain any598 599# Your priorities — AI uses these for scoring600priorities:601  - "example priority 1"602  - "example priority 2"603 604# Storage605storage:606  provider: "notion"         # notion | sheets | supabase | sqlite607 608# Feedback learning609feedback:610  positive_statuses: ["Saved", "Applied", "Interested"]611  negative_statuses: ["Skip", "Rejected", "Not relevant"]612 613# AI settings614ai:615  enabled: true616  model: "gemini-2.5-flash"617  min_score: 0               # filter out items below this score618  rate_limit_seconds: 7      # seconds between API calls619  batch_size: 5              # items per API call620```621 622---623 624## Common Scraping Patterns625 626### Pattern 1: REST API (easiest)627```python628resp = requests.get(url, params={"q": query}, headers=HEADERS, timeout=15)629items = resp.json().get("results", [])630```631 632### Pattern 2: HTML Scraping633```python634soup = BeautifulSoup(resp.text, "lxml")635for card in soup.select(".listing-card"):636    title = card.select_one("h2").get_text(strip=True)637    href = card.select_one("a")["href"]638```639 640### Pattern 3: RSS Feed641```python642import xml.etree.ElementTree as ET643root = ET.fromstring(resp.text)644for item in root.findall(".//item"):645    title = item.findtext("title", "")646    link = item.findtext("link", "")647    pub_date = item.findtext("pubDate", "")648```649 650### Pattern 4: Paginated API651```python652page = 1653while True:654    resp = requests.get(url, params={"page": page, "limit": 50}, timeout=15)655    data = resp.json()656    items = data.get("results", [])657    if not items:658        break659    for item in items:660        results.append(_normalise(item))661    if not data.get("has_more"):662        break663    page += 1664```665 666### Pattern 5: JS-Rendered Pages (Playwright)667```python668from playwright.sync_api import sync_playwright669 670with sync_playwright() as p:671    browser = p.chromium.launch()672    page = browser.new_page()673    page.goto(url)674    page.wait_for_selector(".listing")675    html = page.content()676    browser.close()677 678soup = BeautifulSoup(html, "lxml")679```680 681---682 683## Anti-Patterns to Avoid684 685| Anti-pattern | Problem | Fix |686|---|---|---|687| One LLM call per item | Hits rate limits instantly | Batch 5 items per call |688| Hardcoded keywords in code | Not reusable | Move all config to `config.yaml` |689| Scraping without rate limit | IP ban | Add `time.sleep(1)` between requests |690| Storing secrets in code | Security risk | Always use `.env` + GitHub Secrets |691| No deduplication | Duplicate rows pile up | Always check URL before pushing |692| Ignoring `robots.txt` | Legal/ethical risk | Respect crawl rules; use public APIs when available |693| JS-rendered sites with `requests` | Empty response | Use Playwright or look for the underlying API |694| `maxOutputTokens` too low | Truncated JSON, parse error | Use 2048+ for batch responses |695 696---697 698## Free Tier Limits Reference699 700| Service | Free Limit | Typical Usage |701|---|---|---|702| Gemini Flash Lite | 30 RPM, 1500 RPD | ~56 req/day at 3-hr intervals |703| Gemini 2.0 Flash | 15 RPM, 1500 RPD | Good fallback |704| Gemini 2.5 Flash | 10 RPM, 500 RPD | Use sparingly |705| GitHub Actions | Unlimited (public repos) | ~20 min/day |706| Notion API | Unlimited | ~200 writes/day |707| Supabase | 500MB DB, 2GB transfer | Fine for most agents |708| Google Sheets API | 300 req/min | Works for small agents |709 710---711 712## Requirements Template713 714```715requests==2.31.0716beautifulsoup4==4.12.3717lxml==5.1.0718python-dotenv==1.0.1719pyyaml==6.0.2720notion-client==2.2.1   # if using Notion721# playwright==1.40.0   # uncomment for JS-rendered sites722```723 724---725 726## Quality Checklist727 728Before marking the agent complete:729 730- [ ] `config.yaml` controls all user-facing settings — no hardcoded values731- [ ] `profile/context.md` holds user-specific context for AI matching732- [ ] Deduplication by URL before every storage push733- [ ] Gemini client has model fallback chain (4 models)734- [ ] Batch size ≤ 5 items per API call735- [ ] `maxOutputTokens` ≥ 2048736- [ ] `.env` is in `.gitignore`737- [ ] `.env.example` provided for onboarding738- [ ] `setup.py` creates DB schema on first run739- [ ] `enrich_existing.py` backfills AI scores on old rows740- [ ] GitHub Actions workflow commits `feedback.json` after each run741- [ ] README covers: setup in < 5 minutes, required secrets, customisation742 743---744 745## Real-World Examples746 747```748"Build me an agent that monitors Hacker News for AI startup funding news"749"Scrape product prices from 3 e-commerce sites and alert when they drop"750"Track new GitHub repos tagged with 'llm' or 'agents' — summarise each one"751"Collect Chief of Staff job listings from LinkedIn and Cutshort into Notion"752"Monitor a subreddit for posts mentioning my company — classify sentiment"753"Scrape new academic papers from arXiv on a topic I care about daily"754"Track sports fixture results and keep a running table in Google Sheets"755"Build a real estate listing watcher — alert on new properties under ₹1 Cr"756```757 758---759 760## Reference Implementation761 762A complete working agent built with this exact architecture would scrape 4+ sources,763batch Gemini calls, learn from Applied/Rejected decisions stored in Notion, and run764100% free on GitHub Actions. Follow Steps 1–9 above to build your own.
Related skills
Agent Eval

Install Agent Eval skill for Claude Code from affaan-m/everything-claude-code.
Agent Harness Construction

Install Agent Harness Construction skill for Claude Code from affaan-m/everything-claude-code.
Agent Payment X402

Install Agent Payment X402 skill for Claude Code from affaan-m/everything-claude-code.