> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runpulse.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Split Document

> Identify which pages of a document contain each topic/section.
Takes an existing extraction and a list of topics, then uses AI to
identify which PDF pages contain content related to each topic.

The result is persisted with a `split_id` that can be used with
the `/schema` endpoint (split mode) for targeted schema extraction on
specific page groups.

Set `async: true` to return immediately with a job_id for polling.

To split many extractions at once, see [Batch Split](api:POST/batch/split)
or the [Batch Processing guide](/batch).

## Overview

<Info>
  **Pipeline Step 2 (optional)** — Split requires a prior [extraction](/api-reference/endpoint/extract). After splitting, use [schema extraction](/api-reference/endpoint/schema) with `split_id` to apply per-topic schemas.
</Info>

Identify which pages of a document contain each topic or section. The `/split` endpoint analyzes a saved extraction and uses AI to map pages to your defined topics. This is useful for:

* Processing multi-section documents (e.g., annual reports, contracts)
* Applying different schemas to different parts of a document
* Organizing large documents by content type

<Note>
  This endpoint operates on **saved extractions** (created via `/extract` with storage enabled, which is the default).
</Note>

<Note>
  To split many extractions at once, use [Batch Split](/api-reference/endpoint/batch-overview#batch-split).
</Note>

### Async Mode

Set `async: true` to return immediately with a job ID for polling. See [Polling for Results](/api-reference/endpoint/poll) for details.

```json theme={null}
{
  "extraction_id": "abc123-def456",
  "split_config": { "split_input": [...] },
  "async": true
}
```

***

## Request

### Request Body

| Field             | Type    | Required | Description                                                                                                   |
| ----------------- | ------- | -------- | ------------------------------------------------------------------------------------------------------------- |
| `extraction_id`   | uuid    | Yes      | ID of the saved extraction to split                                                                           |
| `split_config`    | object  | XOR      | Inline split configuration with topics                                                                        |
| `split_config_id` | uuid    | XOR      | Reference to a saved split configuration                                                                      |
| `async`           | boolean | No       | If `true`, returns immediately with a `job_id` for [polling](/api-reference/endpoint/poll). Default: `false`. |

### Inline Config (`split_config`)

| Field                      | Type  | Required | Description                                                                               |
| -------------------------- | ----- | -------- | ----------------------------------------------------------------------------------------- |
| `split_config.split_input` | array | Yes      | List of topics to identify. Also accepts legacy name `topics` for backward compatibility. |

Each topic in the `split_input` array:

| Field         | Type   | Required | Description                                       |
| ------------- | ------ | -------- | ------------------------------------------------- |
| `name`        | string | Yes      | Unique identifier for the topic                   |
| `description` | string | No       | Description of what content belongs to this topic |

***

## Response

### Synchronous Response (200)

| Field          | Type   | Description                                                                      |
| -------------- | ------ | -------------------------------------------------------------------------------- |
| `split_id`     | uuid   | Unique identifier for this split result                                          |
| `split_output` | object | Contains `splits` — a mapping of topic names to arrays of 1-indexed page numbers |

### Async Response (202)

| Field     | Type   | Description                                        |
| --------- | ------ | -------------------------------------------------- |
| `job_id`  | string | Job ID for [polling](/api-reference/endpoint/poll) |
| `status`  | string | `"pending"`                                        |
| `message` | string | Human-readable description                         |

***

## Example Usage

### Split with Inline Config

<CodeGroup>
  ```python Python theme={null}
  from pulse import Pulse

  client = Pulse(api_key="YOUR_API_KEY")

  split_result = client.split(
      extraction_id="abc123-def456-ghi789",
      split_config={
          "split_input": [
              {
                  "name": "financial_statements",
                  "description": "Balance sheets, income statements, cash flow statements"
              },
              {
                  "name": "executive_summary",
                  "description": "Letter to shareholders, company overview, highlights"
              },
              {
                  "name": "risk_factors",
                  "description": "Risk disclosures, forward-looking statements"
              }
          ]
      }
  )

  print(f"Split ID: {split_result.split_id}")
  for topic, pages in split_result.split_output.splits.items():
      print(f"  {topic}: pages {pages}")
  ```

  ```typescript TypeScript theme={null}
  import { PulseClient } from "pulse-ts-sdk";

  const client = new PulseClient({ apiKey: "YOUR_API_KEY" });

  const splitResult = await client.split({
      extraction_id: "abc123-def456-ghi789",
      split_config: {
          split_input: [
              { name: "financial_statements", description: "Balance sheets, income statements" },
              { name: "executive_summary", description: "Letter to shareholders" },
              { name: "risk_factors", description: "Risk disclosures" },
          ],
      },
  });

  console.log("Split ID:", splitResult.split_id);
  ```

  ```bash curl theme={null}
  curl -X POST https://api.runpulse.com/split \
    -H "x-api-key: YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "extraction_id": "abc123-def456-ghi789",
      "split_config": {
        "split_input": [
          {"name": "financial_statements", "description": "Balance sheets, income statements"},
          {"name": "executive_summary", "description": "Letter to shareholders"},
          {"name": "risk_factors", "description": "Risk disclosures"}
        ]
      }
    }'
  ```
</CodeGroup>

### Split with Saved Config Reference

<CodeGroup>
  ```python Python theme={null}
  split_result = client.split(
      extraction_id="abc123-def456-ghi789",
      split_config_id="config-uuid-456"
  )
  ```

  ```bash curl theme={null}
  curl -X POST https://api.runpulse.com/split \
    -H "x-api-key: YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"extraction_id": "abc123-def456-ghi789", "split_config_id": "config-uuid-456"}'
  ```
</CodeGroup>

### Example Response

```json theme={null}
{
  "split_id": "split-uuid-123",
  "split_output": {
    "splits": {
      "financial_statements": [15, 16, 17, 18, 19, 20],
      "executive_summary": [1, 2, 3, 4],
      "risk_factors": [25, 26, 27, 28, 29, 30]
    }
  }
}
```

***

## Using Split Results

After splitting, use the `split_id` with the [`/schema`](/api-reference/endpoint/schema) endpoint (split mode) to apply per-topic schemas:

```python theme={null}
split_id = split_result.split_id

schema_result = client.schema(
    split_id=split_id,
    split_schema_config={
        "financial_statements": {
            "schema": {"type": "object", "properties": {"revenue": {"type": "number"}}},
            "schema_prompt": "Extract financial data"
        },
        "risk_factors": {
            "schema": {"type": "object", "properties": {"risk_description": {"type": "string"}}}
        }
    }
)
```

***

## Error Responses

| Status | Error                | Description                                                |
| ------ | -------------------- | ---------------------------------------------------------- |
| 400    | Invalid request      | Missing required fields or invalid topic format            |
| 401    | Unauthorized         | Invalid or missing API key                                 |
| 404    | Extraction not found | The `extraction_id` doesn't exist or you don't have access |
| 429    | Rate limit exceeded  | Too many requests                                          |
| 500    | Processing error     | Split processing failed                                    |

***

## Best Practices

<AccordionGroup>
  <Accordion title="Use descriptive topic names">
    Topic names become keys in the response and are used with `/schema` (split mode). Use clear, descriptive names like `financial_statements` rather than `section_1`.
  </Accordion>

  <Accordion title="Provide detailed descriptions">
    The description field helps the AI accurately identify relevant pages. Be specific about what content belongs to each topic.
  </Accordion>

  <Accordion title="Use async for large documents">
    For documents with many pages, set `async: true` to avoid request timeouts. See [Polling for Results](/api-reference/endpoint/poll).
  </Accordion>
</AccordionGroup>


## OpenAPI

````yaml POST /split
openapi: 3.1.0
info:
  title: Pulse API Structure
  version: 0.1.0
  description: >-
    Canonical contract for the Pulse extraction APIs. This specification is the
    single source of truth for shared request/response models that client and
    server packages consume.
servers:
  - url: https://api.runpulse.com
    description: Default Pulse API base URL
security:
  - ApiKey: []
paths:
  /split:
    post:
      tags:
        - Split
      summary: Split a document into topics
      description: >-
        Identify which pages of a document contain each topic/section.

        Takes an existing extraction and a list of topics, then uses AI to

        identify which PDF pages contain content related to each topic.


        The result is persisted with a `split_id` that can be used with

        the `/schema` endpoint (split mode) for targeted schema extraction on

        specific page groups.


        Set `async: true` to return immediately with a job_id for polling.


        To split many extractions at once, see [Batch
        Split](api:POST/batch/split)

        or the [Batch Processing guide](/batch).
      operationId: splitDocument
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/SplitInput'
      responses:
        '200':
          description: Split result with page assignments (when async=false or omitted)
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/SplitResponse'
        '202':
          description: Split job accepted (when async=true)
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/AsyncSubmissionResponse'
        '400':
          description: Invalid request parameters
        '401':
          description: Authentication failed or missing API key
        '404':
          description: Extraction not found
        '429':
          description: Rate limit exceeded
        '500':
          description: Internal server error
components:
  schemas:
    SplitInput:
      type: object
      description: |-
        Request body for splitting a document into topics.
        Provide EITHER `split_config` (inline) OR `split_config_id` (reference).
      required:
        - extraction_id
      properties:
        extraction_id:
          type: string
          format: uuid
          description: ID of the saved extraction to split.
        split_config:
          description: >-
            Inline split configuration with topics. Required if split_config_id
            is not provided.
          allOf:
            - $ref: '#/components/schemas/SplitConfig'
        split_config_id:
          type: string
          format: uuid
          description: >-
            Reference to a saved split configuration. Use this instead of
            providing split_config inline.
        async:
          type: boolean
          default: false
          description: >-
            If true, returns immediately with a job_id for polling via  GET
            /job/{jobId}. Otherwise processes synchronously.
    SplitResponse:
      type: object
      description: Result of document splitting with page assignments.
      required:
        - split_id
        - split_output
      allOf:
        - $ref: '#/components/schemas/SplitResultCore'
    AsyncSubmissionResponse:
      type: object
      description: >-
        Acknowledgement returned when a request is submitted for asynchronous
        processing. Poll `GET /job/{job_id}` to check status and retrieve
        results.
      required:
        - job_id
        - status
      properties:
        job_id:
          type: string
          description: Identifier assigned to the asynchronous job.
        status:
          type: string
          description: Initial status reported by the server.
          enum:
            - pending
            - processing
            - completed
            - failed
            - canceled
        message:
          type: string
          description: Human-readable description of the accepted job.
        queuedAt:
          type: string
          format: date-time
          deprecated: true
          description: >-
            **Deprecated** — Timestamp indicating when the job was accepted.
            Retained for backward compatibility. Use `GET /job/{jobId}` for
            timing details.
        credits_used:
          type: number
          format: float
          nullable: true
          description: >-
            Number of credits consumed by this request. Only present when the
            organization has the credit billing system enabled.
    SplitConfig:
      type: object
      description: Inline split configuration with topic definitions.
      required:
        - split_input
      properties:
        split_input:
          type: array
          items:
            $ref: '#/components/schemas/TopicDefinition'
          minItems: 1
          description: >-
            List of topics to identify in the document. Each topic will be
            assigned page numbers where relevant content is found.
    SplitResultCore:
      type: object
      description: >-
        Core split result fields shared by the synchronous `/split` endpoint and
        the pipeline split step.
      properties:
        split_id:
          type: string
          format: uuid
          description: >-
            Unique identifier for this split result. Use this ID with the
            `/schema` endpoint (split mode) to apply schemas to specific page
            groups.
        split_output:
          description: Page assignments per topic.
          allOf:
            - $ref: '#/components/schemas/SplitOutput'
        credits_used:
          type: number
          format: float
          nullable: true
          description: >-
            Number of credits consumed by this request. Only present when the
            organization has the credit billing system enabled.
        plan_info:
          allOf:
            - $ref: '#/components/schemas/PlanInfo'
          description: >-
            Billing tier and cumulative usage information for the calling org,
            including this split run.
    TopicDefinition:
      type: object
      description: Definition of a topic to identify in the document.
      required:
        - name
      properties:
        name:
          type: string
          description: >-
            Unique identifier for the topic. Used as the key in the response and
            for referencing in subsequent schema extraction.
        description:
          type: string
          description: >-
            Optional description of what content belongs to this topic. Helps
            the AI identify relevant pages more accurately.
    SplitOutput:
      type: object
      description: >-
        Page assignments per topic. Each key is a topic name and the value is an
        array of 1-indexed page numbers.
      properties:
        splits:
          type: object
          description: Mapping of topic names to arrays of page numbers.
          additionalProperties:
            type: array
            items:
              type: integer
              minimum: 1
    PlanInfo:
      type: object
      description: >-
        Cumulative billing snapshot for the calling organization. Sourced from
        the `pulse-org-stats` aggregate table maintained asynchronously by the
        org-stats Lambda; the in-flight request's contribution is added on top
        so every response reflects post-request state. Returned by every
        endpoint that consumes credits (extract, schema, tables, split, form,
        and their batch / pipeline equivalents).
      properties:
        tier:
          type: string
          description: Billing tier, e.g. `"trial"`, `"growth"`, `"pulse_ultra_2"`.
        total_credits_used:
          type: number
          format: float
          description: >-
            Total credits consumed by the organization to date, including this
            request. The primary billing metric going forward.
        pages_used:
          type: integer
          minimum: 0
          description: >-
            Total pages processed by the organization to date, including this
            request. Kept for backward compatibility with clients that haven't
            migrated to `total_credits_used`.
        note:
          type: string
          description: >-
            Optional human-readable note about billing state for this response
            (e.g. trial credits remaining). Omitted when no note applies.
  securitySchemes:
    ApiKey:
      type: apiKey
      in: header
      name: x-api-key
      x-fern-header:
        name: apiKey
        env: PULSE_API_KEY

````