Basic crawl
Provide one or more seed URLs and the crawler follows links from there:Controlling scope
Domain restriction
Setsame_domain_only: true (the default) to stay within the seed domain. The crawler ignores links to external domains.
Depth limit
max_depth controls how many link-hops away from the seeds the crawler goes. Default is 2.
0 fetches only the seed pages. A depth of 1 fetches seeds plus pages linked directly from seeds.
Page budget
max_pages caps the total number of pages fetched across the entire crawl. Default is 10, maximum is 50.
max_pages is reached the crawl stops, even if there are more reachable pages.
Output formats
The crawl surface supportshtml, markdown, and json outputs. You can request multiple at once:
Async crawls
Large crawls always run as async jobs. Submit withPOST /v1/jobs:
GET /v1/jobs/{id} for completion. See Async Jobs.
Crawl budget tips
- Start with
max_pages: 5to verify the crawl finds the right pages before scaling up. - Lower
max_depthto avoid crawling into unrelated sections (e.g. blog archives linked from docs). - Use multiple seeds to cover isolated sections of a site without needing deep traversal.