Overview
The Extract → Split → Schema pipeline is the most powerful processing mode in Pulse. After extracting the document, it splits the pages into topic-based sections and then applies a different schema to each section. This is ideal for long, multi-section documents where different parts contain different kinds of data.When to Use
- Annual reports — Financials, Leadership, and Outlook each have different data to extract
- Multi-section contracts — different clause types (indemnification, IP rights, payment terms) need different schemas
- Research papers — Abstract, Methodology, Results, and Conclusion each have distinct structure
- Insurance documents — policy details, claims history, and coverage schedules are all different
- Regulatory filings — mixed sections like company overview, financial statements, risk factors
How to Use in the Playground
Set page range, figure extraction, chunking, and other options on the Configuration tab — same as Extract Only.
Switch to the Split step. Add topics with names and descriptions — Pulse uses these to assign pages to topics based on document content.
The split step assigns whole pages to topics. A page belongs to the topic that best matches its content. Pages can only belong to one topic.
For each topic, define a JSON Schema tailored to the data you expect in that section. Each topic gets its own schema and optional prompt.
{
"type": "object",
"properties": {
"total_revenue": { "type": "number", "description": "Total revenue for the fiscal year" },
"net_income": { "type": "number", "description": "Net income after taxes" },
"revenue_growth_pct": { "type": "number", "description": "Year-over-year revenue growth %" }
},
"required": ["total_revenue"]
}
{
"type": "object",
"properties": {
"ceo": { "type": "string", "description": "Name of the CEO" },
"board_members": {
"type": "array",
"items": { "type": "string" },
"description": "List of board member names"
}
}
}
What You Get Back
Everything from Extract, plus:| Field | Description |
|---|---|
split_output.splits | Page assignments — { "Financials": [1,2,3], "Leadership": [4,5] } |
split_id | Saved split result ID |
results | Per-topic structured output with values and citations for each topic |
schema_id | Saved schema result ID |
API Usage
- Python
- TypeScript
- curl
