Split Document
Identify which pages of a document contain each topic/section. Takes an existing extraction and a list of topics, then uses AI to identify which PDF pages contain content related to each topic.
The result is persisted with a split_id that can be used with
the /schema endpoint (split mode) for targeted schema extraction on
specific page groups.
Set async: true to return immediately with a job_id for polling.
To split many extractions at once, see Batch Split or the Batch Processing guide.
Overview
split_id to apply per-topic schemas./split endpoint analyzes a saved extraction and uses AI to map pages to your defined topics. This is useful for:
- Processing multi-section documents (e.g., annual reports, contracts)
- Applying different schemas to different parts of a document
- Organizing large documents by content type
/extract with storage enabled, which is the default).Async Mode
Setasync: true to return immediately with a job ID for polling. See Polling for Results for details.
Request
Request Body
| Field | Type | Required | Description |
|---|---|---|---|
extraction_id | uuid | Yes | ID of the saved extraction to split |
split_config | object | XOR | Inline split configuration with topics |
split_config_id | uuid | XOR | Reference to a saved split configuration |
async | boolean | No | If true, returns immediately with a job_id for polling. Default: false. |
Inline Config (split_config)
| Field | Type | Required | Description |
|---|---|---|---|
split_config.split_input | array | Yes | List of topics to identify. Also accepts legacy name topics for backward compatibility. |
split_input array:
| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Unique identifier for the topic |
description | string | No | Description of what content belongs to this topic |
Response
Synchronous Response (200)
| Field | Type | Description |
|---|---|---|
split_id | uuid | Unique identifier for this split result |
split_output | object | Contains splits — a mapping of topic names to arrays of 1-indexed page numbers |
Async Response (202)
| Field | Type | Description |
|---|---|---|
job_id | string | Job ID for polling |
status | string | "pending" |
message | string | Human-readable description |
Example Usage
Split with Inline Config
Split with Saved Config Reference
Example Response
Using Split Results
After splitting, use thesplit_id with the /schema endpoint (split mode) to apply per-topic schemas:
Error Responses
| Status | Error | Description |
|---|---|---|
| 400 | Invalid request | Missing required fields or invalid topic format |
| 401 | Unauthorized | Invalid or missing API key |
| 404 | Extraction not found | The extraction_id doesn’t exist or you don’t have access |
| 429 | Rate limit exceeded | Too many requests |
| 500 | Processing error | Split processing failed |
Best Practices
Use descriptive topic names
Use descriptive topic names
/schema (split mode). Use clear, descriptive names like financial_statements rather than section_1.Provide detailed descriptions
Provide detailed descriptions
Use async for large documents
Use async for large documents
async: true to avoid request timeouts. See Polling for Results.Authorizations
Body
Request body for splitting a document into topics.
Provide EITHER split_config (inline) OR split_config_id (reference).
ID of the saved extraction to split.
Inline split configuration with topics. Required if split_config_id is not provided.
Reference to a saved split configuration. Use this instead of providing split_config inline.
If true, returns immediately with a job_id for polling via GET /job/{jobId}. Otherwise processes synchronously.
Response
Split result with page assignments (when async=false or omitted)
Result of document splitting with page assignments.
Unique identifier for this split result. Use this ID with the /schema endpoint (split mode) to apply schemas to specific page groups.
Page assignments per topic.
Number of credits consumed by this request. Only present when the organization has the credit billing system enabled.
Billing tier and cumulative usage information for the calling org, including this split run.