> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runpulse.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Batch Document Intake

> Process many files through Extract, Schema, Tables, or Split with polling and retry-safe writes.

## Goal

Move from one-off document extraction to a repeatable intake queue for folders, customer uploads, or backfills.

## Use This Workflow

```mermaid theme={null}
sequenceDiagram
    participant App
    participant Pulse
    participant Store
    App->>Pulse: POST /batch/extract
    Pulse-->>App: batch_job_id
    App->>Pulse: GET /job/{batch_job_id}
    Pulse-->>Store: child extraction artifacts
    App->>Store: upsert normalized records
```

Use batch when you already have a set of URLs or a storage prefix. Use webhooks when completion should wake your backend automatically.

## Request

```bash theme={null}
curl -X POST https://api.runpulse.com/batch/extract \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "file_urls": [
        "https://platform.runpulse.com/api/examples/637e5678-30b1-45fa-acc4-877f2d636419/pdf",
        "https://platform.runpulse.com/api/examples/18ed11c2-dbce-4bf5-8385-102c55d13480/pdf"
      ]
    },
    "output": {
      "s3_prefix": "s3://customer-results/pulse/extractions/"
    },
    "extract_options": {
      "storage": {"enabled": true}
    },
    "workers": 4
  }'
```

## Intake Record

Store one row per source document before you submit the batch:

| Field                | Why                                             |
| -------------------- | ----------------------------------------------- |
| `source_document_id` | Idempotency key from your app.                  |
| `source_url`         | Reproduce or debug the input.                   |
| `batch_job_id`       | Track the parent job.                           |
| `child_job_id`       | Track individual file status.                   |
| `extraction_id`      | Chain into Schema, Tables, Split, or retrieval. |
| `status`             | Drive retry and review queues.                  |

## Checks

* Set a worker count your downstream systems can absorb.
* Treat batch completion as orchestration; each child job can still fail independently.
* Retry failed child documents by source ID, not by blind resubmission.
* Save extraction IDs before running downstream Schema or Tables steps.
* Keep a webhook path for production and a polling path for local recovery.

## Related

<CardGroup cols={3}>
  <Card title="Batch Processing" icon="layer-group" href="/api-reference/endpoint/batch-overview">
    Full endpoint reference.
  </Card>

  <Card title="Production Webhooks" icon="webhook" href="/cookbooks/webhooks-production">
    Event-driven completion.
  </Card>

  <Card title="S3 Storage Pipeline" icon="aws" href="/cookbooks/byos-s3-ingestion">
    Process cloud storage prefixes.
  </Card>
</CardGroup>
