Skip to main content
POST
/
schema
Python SDK
from pulse import Pulse

client = Pulse(api_key="YOUR_API_KEY")

# Single mode — apply schema to entire document
response = client.schema(
    extraction_id="your-extraction-id",
    schema_config={
        "input_schema": {
            "type": "object",
            "properties": {
                "total": {"type": "number"},
                "vendor": {"type": "string"}
            }
        },
        "schema_prompt": "Extract invoice total and vendor"
    }
)
print(response.schema_output)

# Split mode — different schemas per topic
response = client.schema(
    split_id="your-split-id",
    split_schema_config={
        "Introduction": {
            "schema": {"type": "object", "properties": {"summary": {"type": "string"}}},
            "schema_prompt": "Summarize the introduction"
        },
        "Financials": {
            "schema": {"type": "object", "properties": {"revenue": {"type": "number"}}},
            "schema_prompt": "Extract financial figures"
        }
    }
)
print(response.schema_output)
{
  "schema_id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
  "version": 2,
  "schema_output": {
    "values": {},
    "citations": {}
  },
  "extraction_ids": [
    "3c90c3cc-0d44-4b50-8888-8dd25736052a"
  ],
  "excel_output_url": "<string>"
}

Overview

Pipeline Step 2 or 3 — Schema requires a prior extraction. For split mode, it also requires a prior split. The mode is inferred from the input fields you provide.
Apply a schema to previously extracted documents to get structured data output. This endpoint supports multiple modes, inferred from the input:
  • Single mode — provide extraction_id to apply one schema to a single document
  • Multi-extraction mode — provide a batch extract ID as extraction_id (auto-detected) or an explicit extraction_ids list to combine content from multiple documents and apply the schema to the composite
  • Split mode — provide split_id to apply per-topic schemas to page groups from a prior /split
  • Excel template mode — provide excel_template (base64 .xlsx) in schema_config instead of input_schema to auto-generate the schema from column headers and receive a filled Excel file
This endpoint operates on saved extractions (created via /extract with storage enabled, which is the default).
To apply schemas across many extractions or splits at once, use Batch Schema. It supports both single and split modes.

Async Mode

Set async: true to return immediately with a job ID for polling. See Polling for Results for details.
FieldTypeRequiredDescription
asyncbooleanNoIf true, returns immediately with a job_id for polling. Default: false.
Async Response (200):
FieldTypeDescription
job_idstringJob ID for polling
statusstring"pending"
messagestringHuman-readable description

Mode Reference

Request

Apply one schema to an entire extraction.
FieldTypeRequiredDescription
extraction_iduuidYesID of a saved extraction, or a batch extract job ID (auto-detected — see Multi-Extraction Mode)
extraction_idsuuid[]NoExplicit list of extraction IDs to combine (see Multi-Extraction Mode)
schema_configobjectXORInline schema (see Schema Config)
schema_config_iduuidXORReference to a saved schema configuration
asyncbooleanNoDefault: false

Schema Config

Provide either input_schema or excel_template — not both.
FieldTypeRequiredDescription
input_schemaobjectXORJSON Schema defining the structured data to extract
excel_templatestring (base64)XORBase64-encoded .xlsx template — column headers are used to auto-generate the JSON Schema and a filled copy is returned (see Excel Template Mode)
schema_promptstringNoNatural language instructions to guide extraction
effortbooleanNoEnable extended reasoning for complex documents

Response (200)

FieldTypeDescription
schema_iduuidUnique identifier for this schema version
versionintegerSchema version number
schema_outputobject{ values: {...}, citations: {...} }
extraction_idsuuid[]Present when multiple extractions were combined — lists all source extraction IDs
excel_output_urlstringAPI path to download the filled Excel template (only present when excel_template was provided)

Example — Inline Schema

from pulse import Pulse

client = Pulse(api_key="YOUR_API_KEY")

schema_result = client.schema(
    extraction_id="abc123-def456-ghi789",
    schema_config={
        "input_schema": {
            "type": "object",
            "properties": {
                "invoice_number": {"type": "string"},
                "total_amount": {"type": "number"},
                "vendor_name": {"type": "string"},
                "line_items": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "description": {"type": "string"},
                            "amount": {"type": "number"}
                        }
                    }
                }
            },
            "required": ["invoice_number", "total_amount"]
        },
        "schema_prompt": "Extract all invoice details including line items"
    }
)

print(schema_result.schema_output)

Example — Saved Config Reference

schema_result = client.schema(
    extraction_id="abc123-def456-ghi789",
    schema_config_id="config-uuid-123"
)

Example Response

{
  "schema_id": "schema-uuid-456",
  "version": 2,
  "schema_output": {
    "values": {
      "invoice_number": "INV-2024-001",
      "total_amount": 1250.00,
      "vendor_name": "Acme Corp",
      "line_items": [
        {"description": "Consulting Services", "amount": 1000.00},
        {"description": "Travel Expenses", "amount": 250.00}
      ]
    },
    "citations": {
      "invoice_number": {"page": 1, "bbox": [100, 50, 200, 70]},
      "total_amount": {"page": 1, "bbox": [400, 500, 500, 520]}
    }
  }
}

Multi-Extraction Mode

Combine content from multiple documents and apply a single schema to the composite, producing one merged result. This is useful when the data you need spans across several files (e.g., a loss summary in one file and exposure data in another).
This is different from Batch Schema, which applies the same schema to each document independently (one result per document). Use multi-extraction when you need to cross-reference or merge data from multiple source files into a single output.
There are two ways to trigger multi-extraction:
  1. Batch extract auto-detection — Pass a batch extract batch_job_id as extraction_id. The system detects it as a batch parent and automatically combines all completed child extractions.
  2. Explicit list — Pass an extraction_ids array with the specific extraction IDs to combine.
Citations in multi-extraction results use the extraction_id-bb_id format (e.g., abc123-txt-1) to disambiguate bounding boxes across source documents.
from pulse import Pulse
from pulse.types.schema_config import SchemaConfig

client = Pulse(api_key="YOUR_API_KEY")

# Option 1: Pass a batch extract job ID (auto-detected)
schema_result = client.schema(
    extraction_id="<batch_job_id>",
    schema_config=SchemaConfig(
        input_schema={
            "type": "object",
            "properties": {
                "policy_period": {"type": "string"},
                "exposure": {"type": "number"},
            },
        },
        schema_prompt="Combine data from both documents",
    ),
)

# Option 2: Pass an explicit list of extraction IDs
schema_result = client.schema(
    extraction_ids=["extraction-1-uuid", "extraction-2-uuid"],
    schema_config=SchemaConfig(
        input_schema={
            "type": "object",
            "properties": {
                "policy_period": {"type": "string"},
                "exposure": {"type": "number"},
            },
        },
    ),
)

# Response includes the list of source extraction IDs
print(schema_result.extraction_ids)

Excel Template Mode

Instead of writing a JSON Schema by hand, provide an Excel template (.xlsx) with the column headers you want filled. The system auto-generates the JSON Schema from the template’s structure, applies it to the extraction, and returns a filled copy of the original template.
import base64
from pulse import Pulse
from pulse.types.schema_config import SchemaConfig

client = Pulse(api_key="YOUR_API_KEY")

with open("template.xlsx", "rb") as f:
    template_b64 = base64.b64encode(f.read()).decode()

schema_result = client.schema(
    extraction_id="abc123-def456-ghi789",
    schema_config=SchemaConfig(
        excel_template=template_b64,
        schema_prompt="Extract policy period data into the template columns",
    ),
)

# Download the filled Excel file
excel_bytes = client.download_schema_excel(schema_result.schema_id)
with open("filled_output.xlsx", "wb") as f:
    for chunk in excel_bytes:
        f.write(chunk)
The response includes excel_output_url (e.g., /schema/{schema_id}/excel) — an authenticated API path for downloading the filled template. Use client.download_schema_excel(schema_id) in the SDK or make an authenticated GET request.
Excel template mode and multi-extraction mode can be combined — pass a batch extract ID or extraction_ids along with excel_template to fill a template from multiple source documents.

Download Filled Excel — GET /schema/{schemaId}/excel

When a schema extraction was created with excel_template, the filled output can be downloaded from this authenticated endpoint. Requires the same API key used for other endpoints. The caller must belong to the org that owns the underlying extraction.
StatusDescription
200Returns the filled .xlsx file as a binary download
401Authentication failed or missing API key
404Schema not found, or no Excel output (was excel_template provided in the original request?)

Error Responses

StatusErrorDescription
400Invalid requestMust provide extraction_id, extraction_ids, or split_id
400Invalid schemaSchema must follow JSON Schema / OpenAPI 3.0 format
400Mutually exclusiveCannot provide both input_schema and excel_template
401UnauthorizedInvalid or missing API key
404Not foundExtraction, batch job, or split not found
500Processing errorSchema extraction failed

Best Practices

Set effort: true for documents with complex layouts, tables, or when initial extraction quality is low.
Add natural language instructions to guide the extraction, especially for ambiguous fields.
If your schema has many fields or the document is large, set async: true to avoid timeouts. See Polling for Results.
First call /split to get page groups, then use this endpoint with split_id + split_schema_config.
When the data you need spans multiple documents, use Batch Extract to extract all files, then pass the batch_job_id as extraction_id to this endpoint. The system auto-detects the batch parent and combines content from all child extractions.
If your output is an Excel spreadsheet, skip the JSON Schema definition and provide the empty .xlsx template directly via excel_template. The column headers define the schema, and you get a filled copy back via excel_output_url.

Extract

Extract content from a document

Split Document

Split a document into topic-based page groups

Batch Processing

Apply schema across many documents in parallel

Authorizations

x-api-key
string
header
required

API key for authentication

Body

application/json

Request body for schema extraction. Mode is inferred from the input:

  • Provide extraction_id for single-mode or multi-extraction (auto-detected). If the ID belongs to a batch extract, its child extractions are combined automatically.
  • Provide extraction_ids for an explicit list of extractions to combine.
  • Provide split_id + split_schema_config for split-mode extraction.
extraction_id
string<uuid>

ID of a saved extraction OR a batch extract job. When a batch extract ID is provided, the system auto-detects it and combines all completed child extractions into a single schema application.

extraction_ids
string<uuid>[]

Explicit list of extraction IDs to combine. The markdown and bounding boxes from all extractions are merged and the schema is applied to the composite content. Citations use extraction_id-bb_id format to disambiguate across source documents.

split_id
string<uuid>

ID of saved split (for split mode).

schema_config
object

Inline schema configuration for single mode. Provide input_schema (JSON Schema) OR excel_template (base64 .xlsx) — not both.

schema_config_id
string<uuid>

Reference to a saved schema configuration (for single mode).

split_schema_config
object

Per-topic schema configurations for split mode. Keys must match topic names from the split.

async
boolean
default:false

If true, returns 202 with a job_id for polling via GET /job/{jobId}.

Response

Schema extraction result (when async=false or omitted). Shape depends on the mode used (single vs split).

Response for single schema extraction mode.

schema_id
string<uuid>
required

Unique identifier for this schema version.

version
integer
required

Version number of this schema for the extraction.

Required range: x >= 1
schema_output
object
required

Extracted values and citations.

extraction_ids
string<uuid>[]

Present when multiple extractions were combined (via batch extract auto-detection or explicit extraction_ids input). Lists all source extraction IDs that contributed to the result.

excel_output_url
string

API path to download the filled Excel template (e.g. /schema/{schema_id}/excel). Requires the same API key authentication. Only present when excel_template was provided in the request.