Documentation Index Fetch the complete documentation index at: https://docs.runpulse.com/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The Extract → Split → Schema pipeline is the most powerful processing mode in Pulse. After extracting the document, it splits the pages into topic-based sections and then applies a different schema to each section. This is ideal for long, multi-section documents where different parts contain different kinds of data.
When to Use
Annual reports — Financials, Leadership, and Outlook each have different data to extract
Multi-section contracts — different clause types (indemnification, IP rights, payment terms) need different schemas
Research papers — Abstract, Methodology, Results, and Conclusion each have distinct structure
Insurance documents — policy details, claims history, and coverage schedules are all different
Regulatory filings — mixed sections like company overview, financial statements, risk factors
If your entire document uses one schema , use Extract → Schema instead — it’s simpler and faster.
How to Use in the Playground
Set page range, figure extraction, chunking, and other options on the
Configuration tab — same as
Extract Only .
Switch to the Split step. Add topics with names and descriptions — Pulse uses these to assign pages to topics based on document content.
Topic Description
Financials Revenue, expenses, and profit data Leadership Executive team and board of directors Outlook Future plans, projections, and guidance
The split step assigns whole pages to topics. A page belongs to the topic that best matches its content. Pages can only belong to one topic.
For each topic, define a JSON Schema tailored to the data you expect in that section. Each topic gets its own schema and optional prompt.
Example — Financials schema:
{
"type" : "object" ,
"properties" : {
"total_revenue" : { "type" : "number" , "description" : "Total revenue for the fiscal year" },
"net_income" : { "type" : "number" , "description" : "Net income after taxes" },
"revenue_growth_pct" : { "type" : "number" , "description" : "Year-over-year revenue growth %" }
},
"required" : [ "total_revenue" ]
}
Example — Leadership schema:
{
"type" : "object" ,
"properties" : {
"ceo" : { "type" : "string" , "description" : "Name of the CEO" },
"board_members" : {
"type" : "array" ,
"items" : { "type" : "string" },
"description" : "List of board member names"
}
}
}
Click Extract All . The pipeline chains all three steps automatically:
Extract — converts the document to markdown
Split — assigns pages to topics based on content
Schema — runs each topic’s schema against its assigned pages
Results appear organized by topic. Switch between topics to see:
Page assignments — which pages belong to each topic
Structured output — the JSON extracted for each topic
Citations — where in the document each value was found
What You Get Back
Everything from Extract , plus:
Field Description split_output.splitsPage assignments — { "Financials": [1,2,3], "Leadership": [4,5] } split_idSaved split result ID resultsPer-topic structured output with values and citations for each topic schema_idSaved schema result ID
API Usage
from pulse import Pulse
client = Pulse( api_key = "YOUR_API_KEY" )
# Step 1: Extract
extract_result = client.extract(
file = open ( "annual_report.pdf" , "rb" ),
async_ = True ,
storage = { "enabled" : True }
)
extraction_id = extract_result.extraction_id
# Step 2: Split
split_result = client.split(
extraction_id = extraction_id,
split_config = {
"split_input" : [
{ "name" : "Financials" , "description" : "Revenue, expenses, and profit data" },
{ "name" : "Leadership" , "description" : "Executive team and board members" },
{ "name" : "Outlook" , "description" : "Future plans and projections" }
]
}
)
split_id = split_result.split_id
print ( f "Split: { split_result.split_output } " )
# Step 3: Schema per topic
schema_result = client.schema(
split_id = split_id,
split_schema_config = {
"Financials" : {
"schema" : {
"type" : "object" ,
"properties" : {
"total_revenue" : { "type" : "number" },
"net_income" : { "type" : "number" }
}
},
"schema_prompt" : "Extract financial metrics"
},
"Leadership" : {
"schema" : {
"type" : "object" ,
"properties" : {
"ceo" : { "type" : "string" },
"board_members" : { "type" : "array" , "items" : { "type" : "string" }}
}
},
"schema_prompt" : "Extract leadership information"
},
"Outlook" : {
"schema" : {
"type" : "object" ,
"properties" : {
"guidance" : { "type" : "string" },
"growth_target" : { "type" : "number" }
}
},
"schema_prompt" : "Extract forward-looking statements"
}
}
)
for topic, data in schema_result.results.items():
print ( f " { topic } : { data } " )
import { PulseClient } from "pulse-ts-sdk" ;
import fs from "fs" ;
const client = new PulseClient ({
apiKey: "YOUR_API_KEY"
});
// Step 1: Extract (async)
const extractResult = await client . extract ({
file: fs . createReadStream ( "annual_report.pdf" ),
async: true ,
storage: { enabled: true }
});
const extractionId = extractResult . extraction_id ;
// Step 2: Split
const splitResult = await client . split ({
extraction_id: extractionId ,
split_config: {
split_input: [
{ name: "Financials" , description: "Revenue, expenses, and profit data" },
{ name: "Leadership" , description: "Executive team and board members" },
{ name: "Outlook" , description: "Future plans and projections" }
]
}
});
const splitId = splitResult . split_id ;
// Step 3: Schema per topic
const schemaResult = await client . schema ({
split_id: splitId ,
split_schema_config: {
Financials: {
schema: { type: "object" , properties: { total_revenue: { type: "number" } } },
schema_prompt: "Extract financial metrics"
},
Leadership: {
schema: { type: "object" , properties: { ceo: { type: "string" } } },
schema_prompt: "Extract leadership information"
},
Outlook: {
schema: { type: "object" , properties: { guidance: { type: "string" } } },
schema_prompt: "Extract forward-looking statements"
}
}
});
console . log ( schemaResult . results );
# Step 1: Extract
curl -X POST https://api.runpulse.com/extract \
-H "x-api-key: YOUR_API_KEY" \
-F "file=@annual_report.pdf" \
-F "async=true" \
-F 'storage={"enabled": true}'
# Save extraction_id from response
# Step 2: Split
curl -X POST https://api.runpulse.com/split \
-H "x-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"extraction_id": "EXTRACTION_ID",
"split_config": {
"split_input": [
{"name": "Financials", "description": "Revenue, expenses, and profit data"},
{"name": "Leadership", "description": "Executive team and board members"},
{"name": "Outlook", "description": "Future plans and projections"}
]
}
}'
# Save split_id from response
# Step 3: Schema per topic
curl -X POST https://api.runpulse.com/schema \
-H "x-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"split_id": "SPLIT_ID",
"split_schema_config": {
"Financials": {
"schema": {"type": "object", "properties": {"total_revenue": {"type": "number"}}},
"schema_prompt": "Extract financial metrics"
},
"Leadership": {
"schema": {"type": "object", "properties": {"ceo": {"type": "string"}}},
"schema_prompt": "Extract leadership information"
}
}
}'
You don’t have to add schema after splitting. If you just want to know which pages belong to which topic — without structured extraction — you can stop after the split step. This is useful for document triage or routing.
# Extract → Split only (no schema)
split_result = client.split(
extraction_id = extraction_id,
split_config = {
"split_input" : [
{ "name" : "Relevant" , "description" : "Pages related to our query" },
{ "name" : "Not Relevant" , "description" : "Background or boilerplate pages" }
]
}
)
print (split_result.split_output)
# {"splits": {"Relevant": [1,3,5], "Not Relevant": [2,4,6]}}
Split API Reference Full API documentation for the /split endpoint
Schema API Reference Full API documentation for the /schema endpoint (single and split mode)