> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runpulse.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Split Document

> Identify which pages of a document contain each topic/section. Takes an
existing extraction and a list of topics, then uses AI to identify which
PDF pages contain content related to each topic.

The result is persisted with a `split_id` that can be used with the
`/schema` endpoint (split mode) for targeted schema extraction on
specific page groups.

Set `async: true` to return 202 with a job_id for polling.


## Overview

<Info>
  **Pipeline Step 2 (optional)** — Split requires a prior [extraction](/api-reference/endpoint/extract). After splitting, use [schema extraction](/api-reference/endpoint/schema) with `split_id` to apply per-topic schemas.
</Info>

Identify which pages of a document contain each topic or section. The `/split` endpoint analyzes a saved extraction and uses AI to map pages to your defined topics. This is useful for:

* Processing multi-section documents (e.g., annual reports, contracts)
* Applying different schemas to different parts of a document
* Organizing large documents by content type

<Note>
  This endpoint operates on **saved extractions** (created via `/extract` with storage enabled, which is the default).
</Note>

<Note>
  To split many extractions at once, use [Batch Split](/api-reference/endpoint/batch-overview#batch-split).
</Note>

### Async Mode

Set `async: true` to return immediately with a job ID for polling. See [Polling for Results](/api-reference/endpoint/poll) for details.

```json theme={null}
{
  "extraction_id": "abc123-def456",
  "split_config": { "split_input": [...] },
  "async": true
}
```

***

## Request

### Request Body

| Field             | Type    | Required | Description                                                                                                   |
| ----------------- | ------- | -------- | ------------------------------------------------------------------------------------------------------------- |
| `extraction_id`   | uuid    | Yes      | ID of the saved extraction to split                                                                           |
| `split_config`    | object  | XOR      | Inline split configuration with topics                                                                        |
| `split_config_id` | uuid    | XOR      | Reference to a saved split configuration                                                                      |
| `async`           | boolean | No       | If `true`, returns immediately with a `job_id` for [polling](/api-reference/endpoint/poll). Default: `false`. |

### Inline Config (`split_config`)

| Field                      | Type  | Required | Description                                                                               |
| -------------------------- | ----- | -------- | ----------------------------------------------------------------------------------------- |
| `split_config.split_input` | array | Yes      | List of topics to identify. Also accepts legacy name `topics` for backward compatibility. |

Each topic in the `split_input` array:

| Field         | Type   | Required | Description                                       |
| ------------- | ------ | -------- | ------------------------------------------------- |
| `name`        | string | Yes      | Unique identifier for the topic                   |
| `description` | string | No       | Description of what content belongs to this topic |

***

## Response

### Synchronous Response (200)

| Field          | Type   | Description                                                                      |
| -------------- | ------ | -------------------------------------------------------------------------------- |
| `split_id`     | uuid   | Unique identifier for this split result                                          |
| `split_output` | object | Contains `splits` — a mapping of topic names to arrays of 1-indexed page numbers |

### Async Response (202)

| Field     | Type   | Description                                        |
| --------- | ------ | -------------------------------------------------- |
| `job_id`  | string | Job ID for [polling](/api-reference/endpoint/poll) |
| `status`  | string | `"pending"`                                        |
| `message` | string | Human-readable description                         |

***

## Example Usage

### Split with Inline Config

<CodeGroup>
  ```python Python theme={null}
  from pulse import Pulse

  client = Pulse(api_key="YOUR_API_KEY")

  split_result = client.split(
      extraction_id="abc123-def456-ghi789",
      split_config={
          "split_input": [
              {
                  "name": "financial_statements",
                  "description": "Balance sheets, income statements, cash flow statements"
              },
              {
                  "name": "executive_summary",
                  "description": "Letter to shareholders, company overview, highlights"
              },
              {
                  "name": "risk_factors",
                  "description": "Risk disclosures, forward-looking statements"
              }
          ]
      }
  )

  print(f"Split ID: {split_result.split_id}")
  for topic, pages in split_result.split_output.splits.items():
      print(f"  {topic}: pages {pages}")
  ```

  ```typescript TypeScript theme={null}
  import { PulseClient } from "pulse-ts-sdk";

  const client = new PulseClient({ apiKey: "YOUR_API_KEY" });

  const splitResult = await client.split({
      extraction_id: "abc123-def456-ghi789",
      split_config: {
          split_input: [
              { name: "financial_statements", description: "Balance sheets, income statements" },
              { name: "executive_summary", description: "Letter to shareholders" },
              { name: "risk_factors", description: "Risk disclosures" },
          ],
      },
  });

  console.log("Split ID:", splitResult.split_id);
  ```

  ```bash curl theme={null}
  curl -X POST https://api.runpulse.com/split \
    -H "x-api-key: YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "extraction_id": "abc123-def456-ghi789",
      "split_config": {
        "split_input": [
          {"name": "financial_statements", "description": "Balance sheets, income statements"},
          {"name": "executive_summary", "description": "Letter to shareholders"},
          {"name": "risk_factors", "description": "Risk disclosures"}
        ]
      }
    }'
  ```
</CodeGroup>

### Split with Saved Config Reference

<CodeGroup>
  ```python Python theme={null}
  split_result = client.split(
      extraction_id="abc123-def456-ghi789",
      split_config_id="config-uuid-456"
  )
  ```

  ```bash curl theme={null}
  curl -X POST https://api.runpulse.com/split \
    -H "x-api-key: YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"extraction_id": "abc123-def456-ghi789", "split_config_id": "config-uuid-456"}'
  ```
</CodeGroup>

### Example Response

```json theme={null}
{
  "split_id": "split-uuid-123",
  "split_output": {
    "splits": {
      "financial_statements": [15, 16, 17, 18, 19, 20],
      "executive_summary": [1, 2, 3, 4],
      "risk_factors": [25, 26, 27, 28, 29, 30]
    }
  }
}
```

***

## Using Split Results

After splitting, use the `split_id` with the [`/schema`](/api-reference/endpoint/schema) endpoint (split mode) to apply per-topic schemas:

```python theme={null}
split_id = split_result.split_id

schema_result = client.schema(
    split_id=split_id,
    split_schema_config={
        "financial_statements": {
            "schema": {"type": "object", "properties": {"revenue": {"type": "number"}}},
            "schema_prompt": "Extract financial data"
        },
        "risk_factors": {
            "schema": {"type": "object", "properties": {"risk_description": {"type": "string"}}}
        }
    }
)
```

***

## Error Responses

| Status | Error                | Description                                                |
| ------ | -------------------- | ---------------------------------------------------------- |
| 400    | Invalid request      | Missing required fields or invalid topic format            |
| 401    | Unauthorized         | Invalid or missing API key                                 |
| 404    | Extraction not found | The `extraction_id` doesn't exist or you don't have access |
| 429    | Rate limit exceeded  | Too many requests                                          |
| 500    | Processing error     | Split processing failed                                    |

***

## Best Practices

<AccordionGroup>
  <Accordion title="Use descriptive topic names">
    Topic names become keys in the response and are used with `/schema` (split mode). Use clear, descriptive names like `financial_statements` rather than `section_1`.
  </Accordion>

  <Accordion title="Provide detailed descriptions">
    The description field helps the AI accurately identify relevant pages. Be specific about what content belongs to each topic.
  </Accordion>

  <Accordion title="Use async for large documents">
    For documents with many pages, set `async: true` to avoid request timeouts. See [Polling for Results](/api-reference/endpoint/poll).
  </Accordion>
</AccordionGroup>


## OpenAPI

````yaml POST /split
openapi: 3.1.0
info:
  title: Pulse API
  description: >-
    Production-grade document extraction service that transforms complex
    documents  into structured, AI-ready data. This specification is the single
    source of truth  for the Pulse extraction APIs.
  version: 1.0.0
  contact:
    name: Pulse Support
    email: support@trypulse.ai
    url: https://docs.runpulse.com
servers:
  - url: https://api.runpulse.com
    description: Production server
security:
  - ApiKeyAuth: []
paths:
  /split:
    post:
      tags:
        - Split
      summary: Split Document
      description: |
        Identify which pages of a document contain each topic/section. Takes an
        existing extraction and a list of topics, then uses AI to identify which
        PDF pages contain content related to each topic.

        The result is persisted with a `split_id` that can be used with the
        `/schema` endpoint (split mode) for targeted schema extraction on
        specific page groups.

        Set `async: true` to return 202 with a job_id for polling.
      operationId: splitDocument
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/SplitInput'
      responses:
        '200':
          description: Split result with page assignments (when async=false or omitted)
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/SplitResponse'
        '202':
          description: Split job accepted (when async=true)
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/AsyncSubmissionResponse'
        '400':
          $ref: '#/components/responses/BadRequest'
        '401':
          $ref: '#/components/responses/Unauthorized'
        '404':
          $ref: '#/components/responses/NotFound'
        '429':
          $ref: '#/components/responses/TooManyRequests'
        '500':
          $ref: '#/components/responses/InternalServerError'
      x-codeSamples:
        - lang: python
          label: Python SDK
          source: |
            from pulse import Pulse

            client = Pulse(api_key="YOUR_API_KEY")

            # Async split (recommended)
            response = client.split(
                extraction_id="your-extraction-id",
                split_config={
                    "split_input": [
                        {"name": "Introduction", "description": "Overview section"},
                        {"name": "Financials", "description": "Financial data"}
                    ]
                },
                async_=True
            )
            print(response.job_id)  # poll via client.jobs.get_job(job_id=...)

            # Sync split
            response = client.split(
                extraction_id="your-extraction-id",
                split_config={
                    "split_input": [
                        {"name": "Introduction", "description": "Overview section"},
                        {"name": "Financials", "description": "Financial data"}
                    ]
                }
            )
            print(response.split_id)
        - lang: typescript
          label: TypeScript SDK
          source: |
            import { PulseClient } from "pulse-ts-sdk";

            const client = new PulseClient({
                apiKey: "YOUR_API_KEY"
            });

            // Async split (recommended)
            const response = await client.split({
                extraction_id: "your-extraction-id",
                split_config: {
                    split_input: [
                        { name: "Introduction", description: "Overview section" },
                        { name: "Financials", description: "Financial data" }
                    ]
                },
                async: true
            });
            console.log(response.job_id); // poll via client.jobs.getJob(...)

            // Sync split
            const syncResp = await client.split({
                extraction_id: "your-extraction-id",
                split_config: {
                    split_input: [
                        { name: "Introduction", description: "Overview section" },
                        { name: "Financials", description: "Financial data" }
                    ]
                }
            });
            console.log(syncResp.split_id);
        - lang: bash
          label: curl
          source: |
            curl -X POST https://api.runpulse.com/split \
              -H "x-api-key: YOUR_API_KEY" \
              -H "Content-Type: application/json" \
              -d '{
                "extraction_id": "your-extraction-id",
                "split_config": {
                  "split_input": [
                    {"name": "Introduction", "description": "Overview section"},
                    {"name": "Financials", "description": "Financial data"}
                  ]
                }
              }'
components:
  schemas:
    SplitInput:
      type: object
      description: Request body for splitting a document into topics.
      required:
        - extraction_id
      properties:
        extraction_id:
          type: string
          format: uuid
          description: ID of the saved extraction to split.
        split_config:
          type: object
          description: >-
            Inline split configuration with topics. Required if split_config_id
            is not provided.
          properties:
            topics:
              type: array
              items:
                type: object
                required:
                  - name
                properties:
                  name:
                    type: string
                    description: Unique identifier for the topic.
                  description:
                    type: string
                    description: Description of what content belongs to this topic.
              minItems: 1
        split_config_id:
          type: string
          format: uuid
          description: Reference to a saved split configuration.
        async:
          type: boolean
          default: false
          description: >-
            If true, returns 202 with a job_id for polling via GET /job/{jobId}.
            Otherwise processes synchronously.
    SplitResponse:
      type: object
      description: Result of document splitting with page assignments.
      required:
        - split_id
        - split_output
      properties:
        split_id:
          type: string
          format: uuid
          description: >-
            Unique identifier for this split result. Use with `/schema` endpoint
            (split mode) to apply per-topic schemas.
        split_output:
          type: object
          description: Page assignments per topic.
          properties:
            splits:
              type: object
              description: Mapping of topic names to arrays of 1-indexed page numbers.
              additionalProperties:
                type: array
                items:
                  type: integer
                  minimum: 1
    AsyncSubmissionResponse:
      type: object
      description: >-
        Acknowledgement returned when a request is submitted for asynchronous
        processing. Poll GET /job/{job_id} to check status and retrieve results.
      required:
        - job_id
        - status
      properties:
        job_id:
          type: string
          description: Identifier assigned to the asynchronous job.
        status:
          type: string
          description: Initial status reported by the server.
          enum:
            - pending
            - processing
        message:
          type: string
          description: Human-readable description of the accepted job.
    ErrorResponse:
      type: object
      properties:
        error:
          type: object
          properties:
            code:
              type: string
              description: Error code (e.g., FILE_001, AUTH_002)
            message:
              type: string
              description: Human-readable error message
            details:
              type: object
              description: Additional error context
  responses:
    BadRequest:
      description: Bad request - Invalid parameters
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/ErrorResponse'
    Unauthorized:
      description: Unauthorized - Invalid or missing API key
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/ErrorResponse'
    NotFound:
      description: Resource not found
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/ErrorResponse'
    TooManyRequests:
      description: Rate limit exceeded
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/ErrorResponse'
    InternalServerError:
      description: Internal server error
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/ErrorResponse'
  securitySchemes:
    ApiKeyAuth:
      type: apiKey
      in: header
      name: x-api-key
      description: API key for authentication

````