Scraping

The /scrapes endpoints let you trigger ad-hoc scrapes and retrieve their metadata. Requests are synchronous: ScrapeNest dispatches the job and waits for a result (subject to timeout).

Submit a scrape

POST /scrapes (auth required)

Payload (JSON):

{
  "url": "https://example.com",
  "backend": "playwright",
  "include_body": true,
  "take_screenshot": false,
  "timeout": 30,
  "wait_until": "networkidle"
}

Fields:

url (required): URL to fetch.
backend (required): one of httpx, playwright, scrapling.
include_body (default true, alias saveHtml): return HTML content.
take_screenshot (default false, alias takeScreenshot): capture screenshot.
timeout (optional): positive seconds; wait budget for the scrape itself.
wait_until (optional): load | domcontentloaded | networkidle (Playwright-like).

Response (success example, HTTP 200):

{
  "id": "scrape-id",
  "jobId": "scrape-id",
  "resultId": "result-id",
  "resultKey": "s3/key",
  "status": "succeeded",
  "ok": true,
  "statusCode": 200,
  "headers": { "content-type": "text/html" },
  "html": "<html>...</html>",
  "htmlKey": "s3/html/key",
  "screenshot": null,
  "screenshotKey": null,
  "errorCode": null,
  "errorMessage": null
}

Failure shapes reuse the same envelope with ok: false, status: "failed", and errorCode / errorMessage. HTTP status codes:

422: validation errors or scrape completed with failure.
504: scrape result timed out (errorCode: "timeout").
502: gateway error (errorCode: "gateway_error").
401: missing/invalid token.

The service adds 15 seconds to the provided timeout when waiting for the result, so you can budget both fetch and result propagation.

List scrapes

GET /scrapes (auth required)

Query params:

status: pending|succeeded|failed
backend: httpx|playwright|scrapling
q: substring match on URL
Pagination: page (default 1), perPage (default 20, max 100)
Sorting: sort (createdAt|completedAt), dir (asc|desc)

Response:

{
  "items": [
    {
      "id": "uuid",
      "url": "https://example.com",
      "backend": "playwright",
      "status": "succeeded",
      "statusCode": 200,
      "saveHtml": true,
      "takeScreenshot": false,
      "createdAt": "2024-01-01T12:00:00Z",
      "completedAt": "2024-01-01T12:00:05Z",
      "screenshotUrl": null,
      "errorCode": null
    }
  ],
  "total": 1,
  "page": 1,
  "perPage": 20,
  "pageCount": 1
}

Get a scrape

GET /scrapes/{id} (auth required)

Returns the same metadata as the list plus errorMessage. Scrape payloads (HTML, screenshot, headers) are only included in the immediate POST /scrapes response.