Name: Web Scraping
Author: Jamditis
Install
Terminal · npx
$npx skills add https://github.com/jamditis/claude-skills-journalism --skill web-scraping
Works with Paperclip
How Web Scraping fits into a Paperclip company.

Web Scraping drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore pack
Source file
SKILL.md618 linesmarkdown
Expand
1---2name: web-scraping3description: Web scraping with anti-bot bypass, content extraction, undocumented APIs and poison pill detection. Use when extracting content from websites, handling paywalls, implementing scraping cascades or processing social media. Covers requests, trafilatura, Playwright with stealth mode, yt-dlp and instaloader patterns.4---5 6# Web scraping methodology7 8Patterns for reliable, ethical web scraping with fallback strategies and anti-bot handling.9 10## Scraping cascade architecture11 12Implement multiple extraction strategies with automatic fallback:13 14```python15from abc import ABC, abstractmethod16from typing import Optional17import requests18from bs4 import BeautifulSoup19import trafilatura20 21#for .py files22from playwright.sync_api import sync_playwright23from playwright_stealth import stealth_sync24 25#for .ipynb files26import asyncio27from playwright.async_api import async_playwright28 29class ScrapingResult:30    def __init__(self, content: str, title: str, method: str):31        self.content = content32        self.title = title33        self.method = method  # Track which method succeeded34 35class Scraper(ABC):36    @abstractmethod37    def fetch(self, url: str) -> Optional[ScrapingResult]: ...38 39class TrafilaturaСscraper(Scraper):40    """Fast, lightweight extraction for standard articles."""41 42    def fetch(self, url: str) -> Optional[ScrapingResult]:43        try:44            downloaded = trafilatura.fetch_url(url)45            if not downloaded:46                return None47 48            content = trafilatura.extract(49                downloaded,50                include_comments=False,51                include_tables=True,52                favor_recall=True53            )54 55            if not content or len(content) < 100:56                return None57 58            # Extract title separately59            soup = BeautifulSoup(downloaded, 'html.parser')60            title = soup.find('title')61            title_text = title.get_text() if title else ''62 63            return ScrapingResult(content, title_text, 'trafilatura')64        except Exception:65            return None66 67class RequestsScraper(Scraper):68    """HTTP requests with rotating user agents."""69 70    USER_AGENTS = [71        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',72        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',73        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',74    ]75 76    def fetch(self, url: str) -> Optional[ScrapingResult]:77        import random78 79        headers = {80            'User-Agent': random.choice(self.USER_AGENTS),81            'Accept': 'text/html,application/xhtml+xml',82            'Accept-Language': 'en-US,en;q=0.9',83        }84 85        try:86            response = requests.get(url, headers=headers, timeout=30)87            response.raise_for_status()88 89            soup = BeautifulSoup(response.text, 'html.parser')90 91            # Remove script/style elements92            for element in soup(['script', 'style', 'nav', 'footer', 'aside']):93                element.decompose()94 95            # Find main content96            main = soup.find('main') or soup.find('article') or soup.find('body')97            content = main.get_text(separator='\n', strip=True) if main else ''98 99            title = soup.find('title')100            title_text = title.get_text() if title else ''101 102            if len(content) < 100:103                return None104 105            return ScrapingResult(content, title_text, 'requests')106        except Exception:107            return None108 109class PlaywrightScraper(Scraper):110    """Heavy JavaScript rendering with stealth mode for anti-bot bypass."""111 112    def fetch(self, url: str) -> Optional[ScrapingResult]:113        try:114            with sync_playwright() as p:115                browser = p.chromium.launch(headless=True)116                context = browser.new_context(117                    viewport={'width': 1920, 'height': 1080},118                    user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'119                )120                page = context.new_page()121 122                # Apply stealth to avoid detection123                stealth_sync(page)124 125                page.goto(url, wait_until='networkidle', timeout=60000)126 127                # Wait for content to load128                page.wait_for_timeout(2000)129 130                # Extract content131                content = page.evaluate('''() => {132                    const article = document.querySelector('article, main, .content, #content');133                    return article ? article.innerText : document.body.innerText;134                }''')135 136                title = page.title()137 138                browser.close()139 140                if len(content) < 100:141                    return None142 143                return ScrapingResult(content, title, 'playwright')144        except Exception:145            return None146 147class PlaywrightScraperAsync:148    """Async Playwright scraper for Jupyter notebooks (.ipynb files).149    150    Jupyter notebooks run their own event loop, so sync Playwright won't work.151    Use this async version with `await` in notebook cells.152    """153 154    async def fetch(self, url: str) -> Optional[ScrapingResult]:155        try:156            async with async_playwright() as p:157                browser = await p.chromium.launch(headless=True)158                context = await browser.new_context(159                    viewport={'width': 1920, 'height': 1080},160                    user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'161                )162                page = await context.new_page()163 164                # Note: playwright-stealth async version165                # from playwright_stealth import stealth_async166                # await stealth_async(page)167 168                await page.goto(url, wait_until='networkidle', timeout=60000)169 170                # Wait for content to load171                await page.wait_for_timeout(2000)172 173                # Extract content174                content = await page.evaluate('''() => {175                    const article = document.querySelector('article, main, .content, #content');176                    return article ? article.innerText : document.body.innerText;177                }''')178 179                title = await page.title()180 181                await browser.close()182 183                if len(content) < 100:184                    return None185 186                return ScrapingResult(content, title, 'playwright_async')187        except Exception:188            return None189 190# Usage in Jupyter notebook cells:191# scraper = PlaywrightScraperAsync()192# result = await scraper.fetch('https://example.com')193 194class ScrapingCascade:195    """Try multiple scrapers in order until one succeeds."""196 197    def __init__(self):198        self.scrapers = [199            TrafilaturaСscraper(),200            RequestsScraper(),201            PlaywrightScraper(),202        ]203 204    def fetch(self, url: str) -> Optional[ScrapingResult]:205        for scraper in self.scrapers:206            result = scraper.fetch(url)207            if result:208                return result209        return None210```211 212## Undocumented APIs213 214### Finding undocumented APIs215 216Use browser developer tools to discover APIs:217 2181. **Open developer tools** (right-click → Inspect, or F12)2192. **Go to the Network tab** to monitor all requests2203. **Filter by Fetch/XHR** to show only API calls2214. **Trigger the action** you want to capture (search, scroll, click)2225. **Analyze the response** — usually JSON with key-value pairs2236. **Copy as cURL** (right-click the request)2247. **Convert to code** using [curlconverter.com](https://curlconverter.com/)225 226### Stripping down API requests227 228When you copy a cURL from dev tools, it includes many parameters. Strip it down by:229 2301. **Remove unnecessary cookies** — test without them first2312. **Keep authentication tokens** if required2323. **Identify the input parameters** you can modify (like `prefix` for search terms)2334. **Test parameter values** — some expire, so periodically verify234 235### Example: Reverse-engineering an autocomplete API236 237```python238import requests239import time240 241def search_suggestions(keyword: str) -> dict:242    """243    Get autocompleted search suggestions from an undocumented API.244    Stripped down from browser dev tools capture.245    """246    headers = {247        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0',248        'Accept': 'application/json, text/javascript, */*; q=0.01',249        'Accept-Language': 'en-US,en;q=0.5',250    }251 252    params = {253        'prefix': keyword,254        'suggestion-type': ['WIDGET', 'KEYWORD'],255        'alias': 'aps',256        'plain-mid': '1',257    }258 259    response = requests.get(260        'https://completion.amazon.com/api/2017/suggestions',261        params=params,262        headers=headers263    )264    return response.json()265 266# Collect suggestions for multiple keywords267keywords = ['a', 'b', 'cookie', 'sock']268data = []269 270for keyword in keywords:271    suggestions = search_suggestions(keyword)272    suggestions['search_word'] = keyword  # track seed keyword273    time.sleep(1)  # rate limit yourself274    data.extend(suggestions.get('suggestions', []))275```276*Source: [Leon Yin, "Finding Undocumented APIs," Inspect Element](https://inspectelement.org/apis.html), 2023*277 278## Poison pill detection279 280Detect paywalls, anti-bot pages, and other failures:281 282```python283from dataclasses import dataclass284from enum import Enum285import re286 287class PoisonPillType(Enum):288    PAYWALL = 'paywall'289    CAPTCHA = 'captcha'290    RATE_LIMIT = 'rate_limit'291    CLOUDFLARE = 'cloudflare'292    LOGIN_REQUIRED = 'login_required'293    NOT_FOUND = 'not_found'294    NONE = 'none'295 296@dataclass297class PoisonPillResult:298    detected: bool299    type: PoisonPillType300    confidence: float301    details: str302 303class PoisonPillDetector:304    PATTERNS = {305        PoisonPillType.PAYWALL: [306            r'subscribe to continue',307            r'subscription required',308            r'become a member',309            r'sign up to read',310            r'you\'ve reached your limit',311            r'article limit reached',312        ],313        PoisonPillType.CAPTCHA: [314            r'verify you are human',315            r'captcha',316            r'robot verification',317            r'prove you\'re not a robot',318        ],319        PoisonPillType.RATE_LIMIT: [320            r'too many requests',321            r'rate limit exceeded',322            r'slow down',323            r'429',324        ],325        PoisonPillType.CLOUDFLARE: [326            r'checking your browser',327            r'cloudflare',328            r'ddos protection',329            r'please wait while we verify',330        ],331        PoisonPillType.LOGIN_REQUIRED: [332            r'sign in to continue',333            r'log in required',334            r'create an account',335        ],336    }337 338    PAYWALL_DOMAINS = {339        'nytimes.com': PoisonPillType.PAYWALL,340        'wsj.com': PoisonPillType.PAYWALL,341        'washingtonpost.com': PoisonPillType.PAYWALL,342        'ft.com': PoisonPillType.PAYWALL,343        'bloomberg.com': PoisonPillType.PAYWALL,344    }345 346    def detect(self, url: str, content: str, status_code: int = 200) -> PoisonPillResult:347        # Check status code348        if status_code == 429:349            return PoisonPillResult(True, PoisonPillType.RATE_LIMIT, 1.0, 'HTTP 429')350        if status_code == 403:351            return PoisonPillResult(True, PoisonPillType.CLOUDFLARE, 0.8, 'HTTP 403')352        if status_code == 404:353            return PoisonPillResult(True, PoisonPillType.NOT_FOUND, 1.0, 'HTTP 404')354 355        # Check known paywall domains356        from urllib.parse import urlparse357        domain = urlparse(url).netloc.replace('www.', '')358        for paywall_domain, pill_type in self.PAYWALL_DOMAINS.items():359            if paywall_domain in domain:360                # Check if content is suspiciously short (paywall truncation)361                if len(content) < 500:362                    return PoisonPillResult(True, pill_type, 0.9, f'Short content from {domain}')363 364        # Pattern matching365        content_lower = content.lower()366        for pill_type, patterns in self.PATTERNS.items():367            for pattern in patterns:368                if re.search(pattern, content_lower):369                    return PoisonPillResult(True, pill_type, 0.7, f'Pattern match: {pattern}')370 371        return PoisonPillResult(False, PoisonPillType.NONE, 0.0, '')372```373 374## Social media scraping375 376### YouTube with yt-dlp377 378```python379import yt_dlp380from pathlib import Path381 382def download_video_metadata(url: str) -> dict:383    """Extract metadata without downloading video."""384    ydl_opts = {385        'skip_download': True,386        'quiet': True,387        'no_warnings': True,388    }389 390    with yt_dlp.YoutubeDL(ydl_opts) as ydl:391        info = ydl.extract_info(url, download=False)392        return {393            'title': info.get('title'),394            'description': info.get('description'),395            'duration': info.get('duration'),396            'upload_date': info.get('upload_date'),397            'view_count': info.get('view_count'),398            'channel': info.get('channel'),399            'thumbnail': info.get('thumbnail'),400        }401 402def download_video(url: str, output_dir: Path, audio_only: bool = False) -> Path:403    """Download video or audio."""404    output_template = str(output_dir / '%(title)s.%(ext)s')405 406    ydl_opts = {407        'outtmpl': output_template,408        'quiet': True,409    }410 411    if audio_only:412        ydl_opts['format'] = 'bestaudio/best'413        ydl_opts['postprocessors'] = [{414            'key': 'FFmpegExtractAudio',415            'preferredcodec': 'mp3',416        }]417 418    with yt_dlp.YoutubeDL(ydl_opts) as ydl:419        info = ydl.extract_info(url, download=True)420        filename = ydl.prepare_filename(info)421        if audio_only:422            filename = filename.rsplit('.', 1)[0] + '.mp3'423        return Path(filename)424 425def get_transcript(url: str) -> list[dict]:426    """Extract auto-generated or manual subtitles."""427    ydl_opts = {428        'skip_download': True,429        'writesubtitles': True,430        'writeautomaticsub': True,431        'subtitleslangs': ['en'],432        'quiet': True,433    }434 435    with yt_dlp.YoutubeDL(ydl_opts) as ydl:436        info = ydl.extract_info(url, download=False)437 438        # Check for subtitles439        subtitles = info.get('subtitles', {})440        auto_captions = info.get('automatic_captions', {})441 442        # Prefer manual subtitles over auto-generated443        subs = subtitles.get('en') or auto_captions.get('en')444        if not subs:445            return []446 447        # Get the vtt or json format448        for sub in subs:449            if sub['ext'] in ['vtt', 'json3']:450                # Download and parse subtitle file451                # ... implementation depends on format452                pass453 454        return []455```456 457### Instagram with instaloader458 459```python460import instaloader461from pathlib import Path462 463class InstagramScraper:464    def __init__(self, username: str = None, session_file: str = None):465        self.loader = instaloader.Instaloader(466            download_videos=True,467            download_video_thumbnails=False,468            download_geotags=False,469            download_comments=False,470            save_metadata=True,471            compress_json=False,472        )473 474        if session_file and Path(session_file).exists():475            self.loader.load_session_from_file(username, session_file)476 477    def get_profile_posts(self, username: str, limit: int = 50) -> list[dict]:478        """Get recent posts from a profile."""479        profile = instaloader.Profile.from_username(self.loader.context, username)480        posts = []481 482        for i, post in enumerate(profile.get_posts()):483            if i >= limit:484                break485 486            posts.append({487                'shortcode': post.shortcode,488                'url': f'https://instagram.com/p/{post.shortcode}/',489                'caption': post.caption,490                'timestamp': post.date_utc.isoformat(),491                'likes': post.likes,492                'comments': post.comments,493                'is_video': post.is_video,494                'video_url': post.video_url if post.is_video else None,495            })496 497        return posts498 499    def download_post(self, shortcode: str, output_dir: Path):500        """Download a single post's media."""501        post = instaloader.Post.from_shortcode(self.loader.context, shortcode)502        self.loader.download_post(post, target=str(output_dir))503```504 505### TikTok with yt-dlp506 507```python508def scrape_tiktok_profile(username: str, output_dir: Path, limit: int = 50) -> list[dict]:509    """Scrape TikTok profile videos."""510    profile_url = f'https://tiktok.com/@{username}'511 512    ydl_opts = {513        'quiet': True,514        'extract_flat': True,  # Don't download, just get info515        'playlistend': limit,516    }517 518    with yt_dlp.YoutubeDL(ydl_opts) as ydl:519        info = ydl.extract_info(profile_url, download=False)520        videos = []521 522        for entry in info.get('entries', []):523            videos.append({524                'id': entry.get('id'),525                'title': entry.get('title'),526                'url': entry.get('url'),527                'timestamp': entry.get('timestamp'),528                'view_count': entry.get('view_count'),529            })530 531        return videos532 533def download_tiktok_video(url: str, output_dir: Path) -> Path:534    """Download a single TikTok video."""535    ydl_opts = {536        'outtmpl': str(output_dir / '%(id)s.%(ext)s'),537        'quiet': True,538    }539 540    with yt_dlp.YoutubeDL(ydl_opts) as ydl:541        info = ydl.extract_info(url, download=True)542        return Path(ydl.prepare_filename(info))543```544 545## Request patterns546 547### Rotating user agents and headers548 549```python550import random551from fake_useragent import UserAgent552 553class RequestManager:554    def __init__(self):555        self.ua = UserAgent()556        self.session = requests.Session()557 558    def get_headers(self) -> dict:559        return {560            'User-Agent': self.ua.random,561            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',562            'Accept-Language': 'en-US,en;q=0.5',563            'Accept-Encoding': 'gzip, deflate, br',564            'DNT': '1',565            'Connection': 'keep-alive',566            'Upgrade-Insecure-Requests': '1',567        }568 569    def fetch(self, url: str, retry_count: int = 3) -> requests.Response:570        for attempt in range(retry_count):571            try:572                response = self.session.get(573                    url,574                    headers=self.get_headers(),575                    timeout=30576                )577                response.raise_for_status()578                return response579            except requests.RequestException as e:580                if attempt == retry_count - 1:581                    raise582                time.sleep(2 ** attempt)  # Exponential backoff583```584 585### Respectful scraping with delays586 587```python588import time589import random590from urllib.parse import urlparse591 592class PoliteRequester:593    def __init__(self, min_delay: float = 1.0, max_delay: float = 3.0):594        self.min_delay = min_delay595        self.max_delay = max_delay596        self.last_request_per_domain = {}597 598    def wait_for_domain(self, url: str):599        domain = urlparse(url).netloc600        last_request = self.last_request_per_domain.get(domain, 0)601 602        elapsed = time.time() - last_request603        delay = random.uniform(self.min_delay, self.max_delay)604 605        if elapsed < delay:606            time.sleep(delay - elapsed)607 608        self.last_request_per_domain[domain] = time.time()609```610 611## Ethical considerations612 613- Always check `robots.txt` before scraping614- Respect rate limits and add delays between requests615- Don't scrape personal data without consent616- Cache responses to avoid redundant requests617- Identify yourself with a descriptive User-Agent when appropriate618- Stop if you receive explicit blocking signals
Related skills
Academic Writing

Install Academic Writing skill for Claude Code from jamditis/claude-skills-journalism.
1password

Install 1password skill for Claude Code from steipete/clawdis.
3d Web Experience

Install 3d Web Experience skill for Claude Code from sickn33/antigravity-awesome-skills.