Claude Agent Skill · by Affaan M

Data Scraper Agent

Install Data Scraper Agent skill for Claude Code from affaan-m/everything-claude-code.

Install
Terminal · npx
$npx skills add https://github.com/vercel-labs/agent-skills --skill vercel-react-best-practices
Works with Paperclip

How Data Scraper Agent fits into a Paperclip company.

Data Scraper Agent drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

S
SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59
Explore pack
Source file
SKILL.md764 lines
Expand
---name: data-scraper-agentdescription: Build a fully automated AI-powered data collection agent for any public source — job boards, prices, news, GitHub, sports, anything. Scrapes on a schedule, enriches data with a free LLM (Gemini Flash), stores results in Notion/Sheets/Supabase, and learns from user feedback. Runs 100% free on GitHub Actions. Use when the user wants to monitor, collect, or track any public data automatically.origin: community--- # Data Scraper Agent Build a production-ready, AI-powered data collection agent for any public data source.Runs on a schedule, enriches results with a free LLM, stores to a database, and improves over time. **Stack: Python · Gemini Flash (free) · GitHub Actions (free) · Notion / Sheets / Supabase** ## When to Activate - User wants to scrape or monitor any public website or API- User says "build a bot that checks...", "monitor X for me", "collect data from..."- User wants to track jobs, prices, news, repos, sports scores, events, listings- User asks how to automate data collection without paying for hosting- User wants an agent that gets smarter over time based on their decisions ## Core Concepts ### The Three Layers Every data scraper agent has three layers: ```COLLECT → ENRICH → STORE  │           │        │Scraper    AI (LLM)  Databaseruns on    scores/   Notion /schedule   summarises Sheets /           & classifies Supabase``` ### Free Stack | Layer | Tool | Why ||---|---|---|| **Scraping** | `requests` + `BeautifulSoup` | No cost, covers 80% of public sites || **JS-rendered sites** | `playwright` (free) | When HTML scraping fails || **AI enrichment** | Gemini Flash via REST API | 500 req/day, 1M tokens/day — free || **Storage** | Notion API | Free tier, great UI for review || **Schedule** | GitHub Actions cron | Free for public repos || **Learning** | JSON feedback file in repo | Zero infra, persists in git | ### AI Model Fallback Chain Build agents to auto-fallback across Gemini models on quota exhaustion: ```gemini-2.0-flash-lite (30 RPM) →gemini-2.0-flash (15 RPM) →gemini-2.5-flash (10 RPM) →gemini-flash-lite-latest (fallback)``` ### Batch API Calls for Efficiency Never call the LLM once per item. Always batch: ```python# BAD: 33 API calls for 33 itemsfor item in items:    result = call_ai(item)  # 33 calls → hits rate limit # GOOD: 7 API calls for 33 items (batch size 5)for batch in chunks(items, size=5):    results = call_ai(batch)  # 7 calls → stays within free tier``` --- ## Workflow ### Step 1: Understand the Goal Ask the user: 1. **What to collect:** "What data source? URL / API / RSS / public endpoint?"2. **What to extract:** "What fields matter? Title, price, URL, date, score?"3. **How to store:** "Where should results go? Notion, Google Sheets, Supabase, or local file?"4. **How to enrich:** "Do you want AI to score, summarise, classify, or match each item?"5. **Frequency:** "How often should it run? Every hour, daily, weekly?" Common examples to prompt:- Job boards → score relevance to resume- Product prices → alert on drops- GitHub repos → summarise new releases- News feeds → classify by topic + sentiment- Sports results → extract stats to tracker- Events calendar → filter by interest --- ### Step 2: Design the Agent Architecture Generate this directory structure for the user: ```my-agent/├── config.yaml              # User customises this (keywords, filters, preferences)├── profile/│   └── context.md           # User context the AI uses (resume, interests, criteria)├── scraper/│   ├── __init__.py│   ├── main.py              # Orchestrator: scrape → enrich → store│   ├── filters.py           # Rule-based pre-filter (fast, before AI)│   └── sources/│       ├── __init__.py│       └── source_name.py   # One file per data source├── ai/│   ├── __init__.py│   ├── client.py            # Gemini REST client with model fallback│   ├── pipeline.py          # Batch AI analysis│   ├── jd_fetcher.py        # Fetch full content from URLs (optional)│   └── memory.py            # Learn from user feedback├── storage/│   ├── __init__.py│   └── notion_sync.py       # Or sheets_sync.py / supabase_sync.py├── data/│   └── feedback.json        # User decision history (auto-updated)├── .env.example├── setup.py                 # One-time DB/schema creation├── enrich_existing.py       # Backfill AI scores on old rows├── requirements.txt└── .github/    └── workflows/        └── scraper.yml      # GitHub Actions schedule``` --- ### Step 3: Build the Scraper Source Template for any data source: ```python# scraper/sources/my_source.py"""[Source Name] — scrapes [what] from [where].Method: [REST API / HTML scraping / RSS feed]"""import requestsfrom bs4 import BeautifulSoupfrom datetime import datetime, timezonefrom scraper.filters import is_relevant HEADERS = {    "User-Agent": "Mozilla/5.0 (compatible; research-bot/1.0)",}  def fetch() -> list[dict]:    """    Returns a list of items with consistent schema.    Each item must have at minimum: name, url, date_found.    """    results = []     # ---- REST API source ----    resp = requests.get("https://api.example.com/items", headers=HEADERS, timeout=15)    if resp.status_code == 200:        for item in resp.json().get("results", []):            if not is_relevant(item.get("title", "")):                continue            results.append(_normalise(item))     return results  def _normalise(raw: dict) -> dict:    """Convert raw API/HTML data to the standard schema."""    return {        "name": raw.get("title", ""),        "url": raw.get("link", ""),        "source": "MySource",        "date_found": datetime.now(timezone.utc).date().isoformat(),        # add domain-specific fields here    }``` **HTML scraping pattern:**```pythonsoup = BeautifulSoup(resp.text, "lxml")for card in soup.select("[class*='listing']"):    title = card.select_one("h2, h3").get_text(strip=True)    link = card.select_one("a")["href"]    if not link.startswith("http"):        link = f"https://example.com{link}"``` **RSS feed pattern:**```pythonimport xml.etree.ElementTree as ETroot = ET.fromstring(resp.text)for item in root.findall(".//item"):    title = item.findtext("title", "")    link = item.findtext("link", "")``` --- ### Step 4: Build the Gemini AI Client ```python# ai/client.pyimport os, json, time, requests _last_call = 0.0 MODEL_FALLBACK = [    "gemini-2.0-flash-lite",    "gemini-2.0-flash",    "gemini-2.5-flash",    "gemini-flash-lite-latest",]  def generate(prompt: str, model: str = "", rate_limit: float = 7.0) -> dict:    """Call Gemini with auto-fallback on 429. Returns parsed JSON or {}."""    global _last_call     api_key = os.environ.get("GEMINI_API_KEY", "")    if not api_key:        return {}     elapsed = time.time() - _last_call    if elapsed < rate_limit:        time.sleep(rate_limit - elapsed)     models = [model] + [m for m in MODEL_FALLBACK if m != model] if model else MODEL_FALLBACK    _last_call = time.time()     for m in models:        url = f"https://generativelanguage.googleapis.com/v1beta/models/{m}:generateContent?key={api_key}"        payload = {            "contents": [{"parts": [{"text": prompt}]}],            "generationConfig": {                "responseMimeType": "application/json",                "temperature": 0.3,                "maxOutputTokens": 2048,            },        }        try:            resp = requests.post(url, json=payload, timeout=30)            if resp.status_code == 200:                return _parse(resp)            if resp.status_code in (429, 404):                time.sleep(1)                continue            return {}        except requests.RequestException:            return {}     return {}  def _parse(resp) -> dict:    try:        text = (            resp.json()            .get("candidates", [{}])[0]            .get("content", {})            .get("parts", [{}])[0]            .get("text", "")            .strip()        )        if text.startswith("```"):            text = text.split("\n", 1)[-1].rsplit("```", 1)[0]        return json.loads(text)    except (json.JSONDecodeError, KeyError):        return {}``` --- ### Step 5: Build the AI Pipeline (Batch) ```python# ai/pipeline.pyimport jsonimport yamlfrom pathlib import Pathfrom ai.client import generate def analyse_batch(items: list[dict], context: str = "", preference_prompt: str = "") -> list[dict]:    """Analyse items in batches. Returns items enriched with AI fields."""    config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text())    model = config.get("ai", {}).get("model", "gemini-2.5-flash")    rate_limit = config.get("ai", {}).get("rate_limit_seconds", 7.0)    min_score = config.get("ai", {}).get("min_score", 0)    batch_size = config.get("ai", {}).get("batch_size", 5)     batches = [items[i:i + batch_size] for i in range(0, len(items), batch_size)]    print(f"  [AI] {len(items)} items → {len(batches)} API calls")     enriched = []    for i, batch in enumerate(batches):        print(f"  [AI] Batch {i + 1}/{len(batches)}...")        prompt = _build_prompt(batch, context, preference_prompt, config)        result = generate(prompt, model=model, rate_limit=rate_limit)         analyses = result.get("analyses", [])        for j, item in enumerate(batch):            ai = analyses[j] if j < len(analyses) else {}            if ai:                score = max(0, min(100, int(ai.get("score", 0))))                if min_score and score < min_score:                    continue                enriched.append({**item, "ai_score": score, "ai_summary": ai.get("summary", ""), "ai_notes": ai.get("notes", "")})            else:                enriched.append(item)     return enriched  def _build_prompt(batch, context, preference_prompt, config):    priorities = config.get("priorities", [])    items_text = "\n\n".join(        f"Item {i+1}: {json.dumps({k: v for k, v in item.items() if not k.startswith('_')})}"        for i, item in enumerate(batch)    )     return f"""Analyse these {len(batch)} items and return a JSON object. # Items{items_text} # User Context{context[:800] if context else "Not provided"} # User Priorities{chr(10).join(f"- {p}" for p in priorities)} {preference_prompt} # InstructionsReturn: {{"analyses": [{{"score": <0-100>, "summary": "<2 sentences>", "notes": "<why this matches or doesn't>"}} for each item in order]}}Be concise. Score 90+=excellent match, 70-89=good, 50-69=ok, <50=weak."""``` --- ### Step 6: Build the Feedback Learning System ```python# ai/memory.py"""Learn from user decisions to improve future scoring."""import jsonfrom pathlib import Path FEEDBACK_PATH = Path(__file__).parent.parent / "data" / "feedback.json"  def load_feedback() -> dict:    if FEEDBACK_PATH.exists():        try:            return json.loads(FEEDBACK_PATH.read_text())        except (json.JSONDecodeError, OSError):            pass    return {"positive": [], "negative": []}  def save_feedback(fb: dict):    FEEDBACK_PATH.parent.mkdir(parents=True, exist_ok=True)    FEEDBACK_PATH.write_text(json.dumps(fb, indent=2))  def build_preference_prompt(feedback: dict, max_examples: int = 15) -> str:    """Convert feedback history into a prompt bias section."""    lines = []    if feedback.get("positive"):        lines.append("# Items the user LIKED (positive signal):")        for e in feedback["positive"][-max_examples:]:            lines.append(f"- {e}")    if feedback.get("negative"):        lines.append("\n# Items the user SKIPPED/REJECTED (negative signal):")        for e in feedback["negative"][-max_examples:]:            lines.append(f"- {e}")    if lines:        lines.append("\nUse these patterns to bias scoring on new items.")    return "\n".join(lines)``` **Integration with your storage layer:** after each run, query your DB for items with positive/negative status and call `save_feedback()` with the extracted patterns. --- ### Step 7: Build Storage (Notion example) ```python# storage/notion_sync.pyimport osfrom notion_client import Clientfrom notion_client.errors import APIResponseError _client = None def get_client():    global _client    if _client is None:        _client = Client(auth=os.environ["NOTION_TOKEN"])    return _client def get_existing_urls(db_id: str) -> set[str]:    """Fetch all URLs already stored — used for deduplication."""    client, seen, cursor = get_client(), set(), None    while True:        resp = client.databases.query(database_id=db_id, page_size=100, **{"start_cursor": cursor} if cursor else {})        for page in resp["results"]:            url = page["properties"].get("URL", {}).get("url", "")            if url: seen.add(url)        if not resp["has_more"]: break        cursor = resp["next_cursor"]    return seen def push_item(db_id: str, item: dict) -> bool:    """Push one item to Notion. Returns True on success."""    props = {        "Name": {"title": [{"text": {"content": item.get("name", "")[:100]}}]},        "URL": {"url": item.get("url")},        "Source": {"select": {"name": item.get("source", "Unknown")}},        "Date Found": {"date": {"start": item.get("date_found")}},        "Status": {"select": {"name": "New"}},    }    # AI fields    if item.get("ai_score") is not None:        props["AI Score"] = {"number": item["ai_score"]}    if item.get("ai_summary"):        props["Summary"] = {"rich_text": [{"text": {"content": item["ai_summary"][:2000]}}]}    if item.get("ai_notes"):        props["Notes"] = {"rich_text": [{"text": {"content": item["ai_notes"][:2000]}}]}     try:        get_client().pages.create(parent={"database_id": db_id}, properties=props)        return True    except APIResponseError as e:        print(f"[notion] Push failed: {e}")        return False def sync(db_id: str, items: list[dict]) -> tuple[int, int]:    existing = get_existing_urls(db_id)    added = skipped = 0    for item in items:        if item.get("url") in existing:            skipped += 1; continue        if push_item(db_id, item):            added += 1; existing.add(item["url"])        else:            skipped += 1    return added, skipped``` --- ### Step 8: Orchestrate in main.py ```python# scraper/main.pyimport os, sys, yamlfrom pathlib import Pathfrom dotenv import load_dotenv load_dotenv() from scraper.sources import my_source          # add your sources # NOTE: This example uses Notion. If storage.provider is "sheets" or "supabase",# replace this import with storage.sheets_sync or storage.supabase_sync and update# the env var and sync() call accordingly.from storage.notion_sync import sync SOURCES = [    ("My Source", my_source.fetch),] def ai_enabled():    return bool(os.environ.get("GEMINI_API_KEY")) def main():    config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text())    provider = config.get("storage", {}).get("provider", "notion")     # Resolve the storage target identifier from env based on provider    if provider == "notion":        db_id = os.environ.get("NOTION_DATABASE_ID")        if not db_id:            print("ERROR: NOTION_DATABASE_ID not set"); sys.exit(1)    else:        # Extend here for sheets (SHEET_ID) or supabase (SUPABASE_TABLE) etc.        print(f"ERROR: provider '{provider}' not yet wired in main.py"); sys.exit(1)     config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text())    all_items = []     for name, fetch_fn in SOURCES:        try:            items = fetch_fn()            print(f"[{name}] {len(items)} items")            all_items.extend(items)        except Exception as e:            print(f"[{name}] FAILED: {e}")     # Deduplicate by URL    seen, deduped = set(), []    for item in all_items:        if (url := item.get("url", "")) and url not in seen:            seen.add(url); deduped.append(item)     print(f"Unique items: {len(deduped)}")     if ai_enabled() and deduped:        from ai.memory import load_feedback, build_preference_prompt        from ai.pipeline import analyse_batch         # load_feedback() reads data/feedback.json written by your feedback sync script.        # To keep it current, implement a separate feedback_sync.py that queries your        # storage provider for items with positive/negative statuses and calls save_feedback().        feedback = load_feedback()        preference = build_preference_prompt(feedback)        context_path = Path(__file__).parent.parent / "profile" / "context.md"        context = context_path.read_text() if context_path.exists() else ""        deduped = analyse_batch(deduped, context=context, preference_prompt=preference)    else:        print("[AI] Skipped — GEMINI_API_KEY not set")     added, skipped = sync(db_id, deduped)    print(f"Done — {added} new, {skipped} existing") if __name__ == "__main__":    main()``` --- ### Step 9: GitHub Actions Workflow ```yaml# .github/workflows/scraper.ymlname: Data Scraper Agent on:  schedule:    - cron: "0 */3 * * *"  # every 3 hours — adjust to your needs  workflow_dispatch:        # allow manual trigger permissions:  contents: write   # required for the feedback-history commit step jobs:  scrape:    runs-on: ubuntu-latest    timeout-minutes: 20     steps:      - uses: actions/checkout@v4       - uses: actions/setup-python@v5        with:          python-version: "3.11"          cache: "pip"       - run: pip install -r requirements.txt       # Uncomment if Playwright is enabled in requirements.txt      # - name: Install Playwright browsers      #   run: python -m playwright install chromium --with-deps       - name: Run agent        env:          NOTION_TOKEN: ${{ secrets.NOTION_TOKEN }}          NOTION_DATABASE_ID: ${{ secrets.NOTION_DATABASE_ID }}          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}        run: python -m scraper.main       - name: Commit feedback history        run: |          git config user.name "github-actions[bot]"          git config user.email "github-actions[bot]@users.noreply.github.com"          git add data/feedback.json || true          git diff --cached --quiet || git commit -m "chore: update feedback history"          git push``` --- ### Step 10: config.yaml Template ```yaml# Customise this file — no code changes needed # What to collect (pre-filter before AI)filters:  required_keywords: []      # item must contain at least one  blocked_keywords: []       # item must not contain any # Your priorities — AI uses these for scoringpriorities:  - "example priority 1"  - "example priority 2" # Storagestorage:  provider: "notion"         # notion | sheets | supabase | sqlite # Feedback learningfeedback:  positive_statuses: ["Saved", "Applied", "Interested"]  negative_statuses: ["Skip", "Rejected", "Not relevant"] # AI settingsai:  enabled: true  model: "gemini-2.5-flash"  min_score: 0               # filter out items below this score  rate_limit_seconds: 7      # seconds between API calls  batch_size: 5              # items per API call``` --- ## Common Scraping Patterns ### Pattern 1: REST API (easiest)```pythonresp = requests.get(url, params={"q": query}, headers=HEADERS, timeout=15)items = resp.json().get("results", [])``` ### Pattern 2: HTML Scraping```pythonsoup = BeautifulSoup(resp.text, "lxml")for card in soup.select(".listing-card"):    title = card.select_one("h2").get_text(strip=True)    href = card.select_one("a")["href"]``` ### Pattern 3: RSS Feed```pythonimport xml.etree.ElementTree as ETroot = ET.fromstring(resp.text)for item in root.findall(".//item"):    title = item.findtext("title", "")    link = item.findtext("link", "")    pub_date = item.findtext("pubDate", "")``` ### Pattern 4: Paginated API```pythonpage = 1while True:    resp = requests.get(url, params={"page": page, "limit": 50}, timeout=15)    data = resp.json()    items = data.get("results", [])    if not items:        break    for item in items:        results.append(_normalise(item))    if not data.get("has_more"):        break    page += 1``` ### Pattern 5: JS-Rendered Pages (Playwright)```pythonfrom playwright.sync_api import sync_playwright with sync_playwright() as p:    browser = p.chromium.launch()    page = browser.new_page()    page.goto(url)    page.wait_for_selector(".listing")    html = page.content()    browser.close() soup = BeautifulSoup(html, "lxml")``` --- ## Anti-Patterns to Avoid | Anti-pattern | Problem | Fix ||---|---|---|| One LLM call per item | Hits rate limits instantly | Batch 5 items per call || Hardcoded keywords in code | Not reusable | Move all config to `config.yaml` || Scraping without rate limit | IP ban | Add `time.sleep(1)` between requests || Storing secrets in code | Security risk | Always use `.env` + GitHub Secrets || No deduplication | Duplicate rows pile up | Always check URL before pushing || Ignoring `robots.txt` | Legal/ethical risk | Respect crawl rules; use public APIs when available || JS-rendered sites with `requests` | Empty response | Use Playwright or look for the underlying API || `maxOutputTokens` too low | Truncated JSON, parse error | Use 2048+ for batch responses | --- ## Free Tier Limits Reference | Service | Free Limit | Typical Usage ||---|---|---|| Gemini Flash Lite | 30 RPM, 1500 RPD | ~56 req/day at 3-hr intervals || Gemini 2.0 Flash | 15 RPM, 1500 RPD | Good fallback || Gemini 2.5 Flash | 10 RPM, 500 RPD | Use sparingly || GitHub Actions | Unlimited (public repos) | ~20 min/day || Notion API | Unlimited | ~200 writes/day || Supabase | 500MB DB, 2GB transfer | Fine for most agents || Google Sheets API | 300 req/min | Works for small agents | --- ## Requirements Template ```requests==2.31.0beautifulsoup4==4.12.3lxml==5.1.0python-dotenv==1.0.1pyyaml==6.0.2notion-client==2.2.1   # if using Notion# playwright==1.40.0   # uncomment for JS-rendered sites``` --- ## Quality Checklist Before marking the agent complete: - [ ] `config.yaml` controls all user-facing settings — no hardcoded values- [ ] `profile/context.md` holds user-specific context for AI matching- [ ] Deduplication by URL before every storage push- [ ] Gemini client has model fallback chain (4 models)- [ ] Batch size ≤ 5 items per API call- [ ] `maxOutputTokens` ≥ 2048- [ ] `.env` is in `.gitignore`- [ ] `.env.example` provided for onboarding- [ ] `setup.py` creates DB schema on first run- [ ] `enrich_existing.py` backfills AI scores on old rows- [ ] GitHub Actions workflow commits `feedback.json` after each run- [ ] README covers: setup in < 5 minutes, required secrets, customisation --- ## Real-World Examples ```"Build me an agent that monitors Hacker News for AI startup funding news""Scrape product prices from 3 e-commerce sites and alert when they drop""Track new GitHub repos tagged with 'llm' or 'agents' — summarise each one""Collect Chief of Staff job listings from LinkedIn and Cutshort into Notion""Monitor a subreddit for posts mentioning my company — classify sentiment""Scrape new academic papers from arXiv on a topic I care about daily""Track sports fixture results and keep a running table in Google Sheets""Build a real estate listing watcher — alert on new properties under ₹1 Cr"``` --- ## Reference Implementation A complete working agent built with this exact architecture would scrape 4+ sources,batch Gemini calls, learn from Applied/Rejected decisions stored in Notion, and run100% free on GitHub Actions. Follow Steps 1–9 above to build your own.