name: 使用网页抓取 description: 使用头less Chrome和DuckDuckGo以安全实践搜索和抓取公开网页内容。

网页抓取技能 — Chrome (Playwright) + DuckDuckGo

一个注重隐私、面向代理的网页抓取技能，使用头less Chrome (Playwright/Puppeteer) 和 DuckDuckGo 进行搜索。专注于：可靠导航、提取结构化文本、遵守robots.txt和速率限制。

何时使用

收集公开网页内容以进行摘要、元数据提取或链接发现。
当您想要一个尊重隐私的搜索源时，使用DuckDuckGo进行查询。
不适用于绕过付费墙、抓取私有/登录内容或违反服务条款。

安全与礼仪

在抓取网站前，始终检查并尊重/robots.txt。
速率限制请求（默认：1请求/秒）并使用礼貌的User-Agent字符串。
避免在抓取的页面上执行任意用户提供的JavaScript。
仅抓取公开内容；如果需要登录，返回login_required而不是尝试绕过。

能力

搜索DuckDuckGo并返回前N个结果链接。
在头less Chrome中访问结果页面并提取title、meta description、main文本（或尽力提取文章文本）和canonical URL。
以结构化JSON返回结果以供下游消费。

示例

Node.js (Playwright)

const { chromium } = require('playwright');

async function ddgSearchAndScrape(query) {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage({ userAgent: 'open-skills-bot/1.0' });

  // DuckDuckGo搜索
  await page.goto('https://duckduckgo.com/');
  await page.fill('input[name="q"]', query);
  await page.keyboard.press('Enter');
  await page.waitForSelector('.result__title a');

  // 收集顶部结果URL
  const href = await page.getAttribute('.result__title a', 'href');
  if (!href) { await browser.close(); return []; }

  // 访问结果并提取
  await page.goto(href, { waitUntil: 'domcontentloaded' });
  const title = await page.title();
  const description = await page.locator('meta[name="description"]').getAttribute('content').catch(() => null);
  const article = await page.locator('article, main, #content').first().innerText().catch(() => null);

  await browser.close();
  return [{ url: href, title, description, text: article }];
}

// 用法
// ddgSearchAndScrape('open-source agent runtimes').then(console.log);

代理提示（复制/粘贴）

您是一个具有网页抓取技能的代理。对于任何`search:`任务，使用DuckDuckGo查找相关页面，然后在头less Chrome实例（Playwright/Puppeteer）中打开每个页面并提取`title`、`meta description`、`main text`和`canonical` URL。始终：
- 检查并尊重robots.txt
- 速率限制请求（≤1 req/sec）
- 使用清晰的`User-Agent`并且不执行任意页面JS
返回结果为JSON：[{url,title,description,text}] 或 `login_required` 如果页面需要认证。

快速设置

Node：npm i playwright 并运行 npx playwright install 获取浏览器二进制文件。
Python：pip install playwright 并 playwright install。

提示

使用page.route阻止大型资源（如图像、字体）当您只需要文本时。
尊重网站条款并为重试引入指数退避。

另请参阅

using-youtube-download.md — 媒体特定抓取和下载示例。