Name: Baoyu Url To Markdown
Author: Jimliu
Install
Terminal · npx
$npx skills add https://github.com/jimliu/baoyu-skills --skill baoyu-url-to-markdown
Works with Paperclip
How Baoyu Url To Markdown fits into a Paperclip company.

Baoyu Url To Markdown drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore pack
Source file
SKILL.md324 linesmarkdown
Expand
1---2name: baoyu-url-to-markdown3description: Fetch any URL and convert to markdown using baoyu-fetch CLI (Chrome CDP with site-specific adapters). Built-in adapters for X/Twitter, YouTube transcripts, Hacker News threads, and generic pages via Defuddle. Handles login/CAPTCHA via interaction wait modes. Use when user wants to save a webpage as markdown.4version: 1.60.05metadata:6  openclaw:7    homepage: https://github.com/JimLiu/baoyu-skills#baoyu-url-to-markdown8    requires:9      anyBins:10        - bun11        - npx12---13 14# URL to Markdown15 16Fetches any URL via `baoyu-fetch` CLI (Chrome CDP + site-specific adapters) and converts it to clean markdown.17 18## CLI Setup19 20**Important**: The CLI source is vendored in the `scripts/vendor/baoyu-fetch/` subdirectory of this skill.21 22**Agent Execution Instructions**:231. Determine this SKILL.md file's directory path as `{baseDir}`242. CLI entry point = `{baseDir}/scripts/vendor/baoyu-fetch/src/cli.ts`253. Resolve `${BUN_X}` runtime: if `bun` installed → `bun`; if `npx` available → `npx -y bun`; else suggest installing bun264. `${READER}` = `${BUN_X} {baseDir}/scripts/vendor/baoyu-fetch/src/cli.ts`275. Replace all `${READER}` in this document with the resolved value28 29## Preferences (EXTEND.md)30 31Check EXTEND.md existence (priority order):32 33```bash34# macOS, Linux, WSL, Git Bash35test -f .baoyu-skills/baoyu-url-to-markdown/EXTEND.md && echo "project"36test -f "${XDG_CONFIG_HOME:-$HOME/.config}/baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "xdg"37test -f "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "user"38```39 40```powershell41# PowerShell (Windows)42if (Test-Path .baoyu-skills/baoyu-url-to-markdown/EXTEND.md) { "project" }43$xdg = if ($env:XDG_CONFIG_HOME) { $env:XDG_CONFIG_HOME } else { "$HOME/.config" }44if (Test-Path "$xdg/baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "xdg" }45if (Test-Path "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "user" }46```47 48| Path | Location |49|------|----------|50| `.baoyu-skills/baoyu-url-to-markdown/EXTEND.md` | Project directory |51| `$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md` | User home |52 53| Result | Action |54|--------|--------|55| Found | Read, parse, apply settings |56| Not found | **MUST** run first-time setup (see below) — do NOT silently create defaults |57 58**EXTEND.md Supports**: Download media by default | Default output directory59 60### First-Time Setup (BLOCKING)61 62**CRITICAL**: When EXTEND.md is not found, you **MUST use `AskUserQuestion`** to ask the user for their preferences before creating EXTEND.md. **NEVER** create EXTEND.md with defaults without asking. This is a **BLOCKING** operation — do NOT proceed with any conversion until setup is complete.63 64Use `AskUserQuestion` with ALL questions in ONE call:65 66**Question 1** — header: "Media", question: "How to handle images and videos in pages?"67- "Ask each time (Recommended)" — After saving markdown, ask whether to download media68- "Always download" — Always download media to local imgs/ and videos/ directories69- "Never download" — Keep original remote URLs in markdown70 71**Question 2** — header: "Output", question: "Default output directory?"72- "url-to-markdown (Recommended)" — Save to ./url-to-markdown/{domain}/{slug}.md73- (User may choose "Other" to type a custom path)74 75**Question 3** — header: "Save", question: "Where to save preferences?"76- "User (Recommended)" — ~/.baoyu-skills/ (all projects)77- "Project" — .baoyu-skills/ (this project only)78 79After user answers, create EXTEND.md at the chosen location, confirm "Preferences saved to [path]", then continue.80 81Full reference: [references/config/first-time-setup.md](references/config/first-time-setup.md)82 83### Supported Keys84 85| Key | Default | Values | Description |86|-----|---------|--------|-------------|87| `download_media` | `ask` | `ask` / `1` / `0` | `ask` = prompt each time, `1` = always download, `0` = never |88| `default_output_dir` | empty | path or empty | Default output directory (empty = `./url-to-markdown/`) |89 90**EXTEND.md → CLI mapping**:91| EXTEND.md key | CLI argument | Notes |92|---------------|-------------|-------|93| `download_media: 1` | `--download-media` | Requires `--output` to be set |94| `default_output_dir: ./posts/` | Agent constructs `--output ./posts/{domain}/{slug}.md` | Agent generates path, not a direct CLI flag |95 96**Value priority**:971. CLI arguments (`--download-media`, `--output`)982. EXTEND.md993. Skill defaults100 101## Features102 103- Chrome CDP for full JavaScript rendering via `baoyu-fetch` CLI104- Site-specific adapters: X/Twitter, YouTube, Hacker News, generic (Defuddle)105- Automatic adapter selection based on URL, or force with `--adapter`106- Interaction gate detection: Cloudflare, reCAPTCHA, hCAPTCHA, custom challenges107- Two capture modes: headless (default) or interactive with wait-for-interaction108- Clean markdown output with YAML front matter109- Structured JSON output available via `--format json`110- X/Twitter: extracts tweets, threads, and X Articles with media111- YouTube: transcript/caption extraction, chapters, cover images112- Hacker News: threaded comment parsing with proper nesting113- Generic: Defuddle extraction with Readability fallback114- Download images and videos to local directories115- Chrome profile persistence for authenticated sessions116- Debug artifact output for troubleshooting117 118## Usage119 120```bash121# Default: headless capture, markdown to stdout122${READER} <url>123 124# Save to file125${READER} <url> --output article.md126 127# Save with media download128${READER} <url> --output article.md --download-media129 130# Headless mode (explicit)131${READER} <url> --headless --output article.md132 133# Wait for interaction (login/CAPTCHA) — auto-detect and continue134${READER} <url> --wait-for interaction --output article.md135 136# Wait for interaction — manual control (Enter to continue)137${READER} <url> --wait-for force --output article.md138 139# JSON output140${READER} <url> --format json --output article.json141 142# Force specific adapter143${READER} <url> --adapter youtube --output transcript.md144 145# Connect to existing Chrome146${READER} <url> --cdp-url http://localhost:9222 --output article.md147 148# Debug artifacts149${READER} <url> --output article.md --debug-dir ./debug/150```151 152## Options153 154| Option | Description |155|--------|-------------|156| `<url>` | URL to fetch |157| `--output <path>` | Output file path (default: stdout) |158| `--format <type>` | Output format: `markdown` (default) or `json` |159| `--json` | Shorthand for `--format json` |160| `--adapter <name>` | Force adapter: `x`, `youtube`, `hn`, or `generic` (default: auto-detect) |161| `--headless` | Force headless Chrome (no visible window) |162| `--wait-for <mode>` | Interaction wait mode: `none` (default), `interaction`, or `force` |163| `--wait-for-interaction` | Alias for `--wait-for interaction` |164| `--wait-for-login` | Alias for `--wait-for interaction` |165| `--timeout <ms>` | Page load timeout (default: 30000) |166| `--interaction-timeout <ms>` | Login/CAPTCHA wait timeout (default: 600000 = 10 min) |167| `--interaction-poll-interval <ms>` | Poll interval for interaction checks (default: 1500) |168| `--download-media` | Download images/videos to local `imgs/` and `videos/`, rewrite markdown links. Requires `--output` |169| `--media-dir <dir>` | Base directory for downloaded media (default: same as `--output` directory) |170| `--cdp-url <url>` | Reuse existing Chrome DevTools Protocol endpoint |171| `--browser-path <path>` | Custom Chrome/Chromium binary path |172| `--chrome-profile-dir <path>` | Chrome user data directory (default: `BAOYU_CHROME_PROFILE_DIR` env or `./baoyu-skills/chrome-profile`) |173| `--debug-dir <dir>` | Write debug artifacts (document.json, markdown.md, page.html, network.json) |174 175## Capture Modes176 177| Mode | Behavior | Use When |178|------|----------|----------|179| Default | Headless Chrome, auto-extract on network idle | Public pages, static content |180| `--headless` | Explicit headless (same as default) | Clarify intent |181| `--wait-for interaction` | Opens visible Chrome, auto-detects login/CAPTCHA gates, waits for them to clear, then continues | Login-required, CAPTCHA-protected |182| `--wait-for force` | Opens visible Chrome, auto-detects OR accepts Enter keypress to continue | Complex flows, lazy loading, paywalls |183 184**Interaction gate auto-detection**:185- Cloudflare Turnstile / "just a moment" pages186- Google reCAPTCHA187- hCaptcha188- Custom challenge / verification screens189 190**Wait-for-interaction workflow**:1911. Run with `--wait-for interaction` → Chrome opens visibly1922. CLI auto-detects login/CAPTCHA gates1933. User completes login or solves CAPTCHA in the browser1944. CLI auto-detects gate cleared → captures page1955. If `--wait-for force` is used, user can also press Enter to trigger capture manually196 197## Agent Quality Gate198 199**CRITICAL**: The agent must treat default headless capture as provisional. Some sites render differently in headless mode and can silently return low-quality content without causing the CLI to fail.200 201After every headless run, the agent **MUST** inspect the saved markdown output.202 203### Quality checks the agent must perform204 2051. Confirm the markdown title matches the target page, not a generic site shell2062. Confirm the body contains the expected article or page content, not just navigation, footer, or a generic error2073. Watch for obvious failure signs:208   - `Application error`209   - `This page could not be found`210   - Login, signup, subscribe, or verification shells211   - Extremely short markdown for a page that should be long-form212   - Raw framework payloads or mostly boilerplate content2134. If the result is low quality, incomplete, or clearly wrong, do **not** accept the run as successful just because the CLI exited with code 0214 215**Tip**: Use `--format json` to get structured output including `status`, `login.state`, and `interaction` fields for programmatic quality assessment. A `"status": "needs_interaction"` response means the page requires manual interaction.216 217### Recovery workflow the agent must follow218 2191. First run headless (default) unless there is already a clear reason to use interaction mode2202. Review markdown quality immediately after the run2213. If the content is low quality or indicates login/CAPTCHA:222   - `--wait-for interaction` for auto-detected gates (login, CAPTCHA, Cloudflare)223   - `--wait-for force` when the page needs manual browsing, scroll loading, or complex interaction2244. If `--wait-for` is used, tell the user exactly what to do:225   - If login is required, ask them to sign in in the browser226   - If CAPTCHA appears, ask them to solve it227   - If the page needs time to load, ask them to wait until content is visible228   - For `--wait-for force`: tell them to press Enter when ready2295. If JSON output shows `"status": "needs_interaction"`, switch to `--wait-for interaction` automatically230 231## Output Path Generation232 233The agent must construct the output file path since `baoyu-fetch` does not auto-generate paths.234 235**Algorithm**:2361. Determine base directory from EXTEND.md `default_output_dir` or default `./url-to-markdown/`2372. Extract domain from URL (e.g., `example.com`)2383. Generate slug from URL path or page title (kebab-case, 2-6 words)2394. Construct: `{base_dir}/{domain}/{slug}/{slug}.md` — each URL gets its own directory so media files stay isolated2405. Conflict resolution: append timestamp `{slug}-YYYYMMDD-HHMMSS/{slug}-YYYYMMDD-HHMMSS.md`241 242Pass the constructed path to `--output`. Media files (`--download-media`) are saved into subdirectories next to the markdown file, keeping each URL's assets self-contained.243 244## Output Format245 246Markdown output to stdout (or file with `--output`) as clean markdown text.247 248JSON output (`--format json`) returns structured data including:249- `adapter` — which adapter handled the URL250- `status` — `"ok"` or `"needs_interaction"`251- `login` — login state detection (`logged_in`, `logged_out`, `unknown`)252- `interaction` — interaction gate details (kind, provider, prompt)253- `document` — structured content (url, title, author, publishedAt, content blocks, metadata)254- `media` — collected media assets with url, kind, role255- `markdown` — converted markdown text256- `downloads` — media download results (when `--download-media` used)257 258When `--download-media` is enabled:259- Images are saved to `imgs/` next to the output file (or in `--media-dir`)260- Videos are saved to `videos/` next to the output file (or in `--media-dir`)261- Markdown media links are rewritten to local relative paths262 263## Built-in Adapters264 265| Adapter | URLs | Key Features |266|---------|------|-------------|267| `x` | x.com, twitter.com | Tweets, threads, X Articles, media, login detection |268| `youtube` | youtube.com, youtu.be | Transcript/captions, chapters, cover image, metadata |269| `hn` | news.ycombinator.com | Threaded comments, story metadata, nested replies |270| `generic` | Any URL (fallback) | Defuddle extraction, Readability fallback, auto-scroll, network idle detection |271 272Adapter is auto-selected based on URL. Use `--adapter <name>` to override.273 274## Media Download Workflow275 276Based on `download_media` setting in EXTEND.md:277 278| Setting | Behavior |279|---------|----------|280| `1` (always) | Run CLI with `--download-media --output <path>` |281| `0` (never) | Run CLI with `--output <path>` (no media download) |282| `ask` (default) | Follow the ask-each-time flow below |283 284### Ask-Each-Time Flow285 2861. Run CLI **without** `--download-media` with `--output <path>` → markdown saved2872. Check saved markdown for remote media URLs (`https://` in image/video links)2883. **If no remote media found** → done, no prompt needed2894. **If remote media found** → use `AskUserQuestion`:290   - header: "Media", question: "Download N images/videos to local files?"291   - "Yes" — Download to local directories292   - "No" — Keep remote URLs2935. If user confirms → run CLI **again** with `--download-media --output <same-path>` (overwrites markdown with localized links)294 295## Environment Variables296 297| Variable | Description |298|----------|-------------|299| `BAOYU_CHROME_PROFILE_DIR` | Chrome user data directory (can also use `--chrome-profile-dir`) |300 301**Troubleshooting**: Chrome not found → use `--browser-path`. Timeout → increase `--timeout`. Login/CAPTCHA pages → use `--wait-for interaction`. Debug → use `--debug-dir` to inspect captured HTML and network logs.302 303### YouTube Notes304 305- YouTube adapter extracts transcripts/captions automatically when available306- Transcript format: `[MM:SS] Text segment` with chapter headings307- Transcript availability depends on YouTube exposing a caption track. Videos with captions disabled or restricted playback may produce description-only output308- Use `--wait-for force` if the page needs time to finish loading player metadata309 310### X/Twitter Notes311 312- Extracts single tweets, threads, and X Articles313- Auto-detects login state; if logged out and content requires auth, JSON output will show `"status": "needs_interaction"`314- Use `--wait-for interaction` for login-protected content315 316### Hacker News Notes317 318- Parses threaded comments with proper nesting and reply hierarchy319- Includes story metadata (title, URL, author, score, comment count)320- Shows comment deletion/dead status321 322## Extension Support323 324Custom configurations via EXTEND.md. See **Preferences** section for paths and supported options.
Related skills
Baoyu Article Illustrator

Baoyu-article-illustrator analyzes article content and automatically identifies positions where visual aids would enhance understanding, then generates illustra
Baoyu Comic

baoyu-comic generates original educational comics from markdown content with customizable art styles (ligne-claire, manga, realistic, ink-brush, chalk, minimali
Baoyu Compress Image

Baoyu-compress-image compresses images to WebP or PNG format using the best available system tool (sips, cwebp, ImageMagick, or Sharp) selected based on what's