Skip to main content

Overview

The Extract → Split → Schema pipeline is the most powerful processing mode in Pulse. After extracting the document, it splits the pages into topic-based sections and then applies a different schema to each section. This is ideal for long, multi-section documents where different parts contain different kinds of data.

When to Use

  • Annual reports — Financials, Leadership, and Outlook each have different data to extract
  • Multi-section contracts — different clause types (indemnification, IP rights, payment terms) need different schemas
  • Research papers — Abstract, Methodology, Results, and Conclusion each have distinct structure
  • Insurance documents — policy details, claims history, and coverage schedules are all different
  • Regulatory filings — mixed sections like company overview, financial statements, risk factors
If your entire document uses one schema, use Extract → Schema instead — it’s simpler and faster.

How to Use in the Playground

1
Configure extraction settings
2
Set page range, figure extraction, chunking, and other options on the Configuration tab — same as Extract Only.
3
Define split topics
4
Switch to the Split step. Add topics with names and descriptions — Pulse uses these to assign pages to topics based on document content.
5
TopicDescriptionFinancialsRevenue, expenses, and profit dataLeadershipExecutive team and board of directorsOutlookFuture plans, projections, and guidance
6
The split step assigns whole pages to topics. A page belongs to the topic that best matches its content. Pages can only belong to one topic.
7
Define per-topic schemas
8
For each topic, define a JSON Schema tailored to the data you expect in that section. Each topic gets its own schema and optional prompt.
9
Example — Financials schema:
10
{
  "type": "object",
  "properties": {
    "total_revenue": { "type": "number", "description": "Total revenue for the fiscal year" },
    "net_income": { "type": "number", "description": "Net income after taxes" },
    "revenue_growth_pct": { "type": "number", "description": "Year-over-year revenue growth %" }
  },
  "required": ["total_revenue"]
}
11
Example — Leadership schema:
12
{
  "type": "object",
  "properties": {
    "ceo": { "type": "string", "description": "Name of the CEO" },
    "board_members": {
      "type": "array",
      "items": { "type": "string" },
      "description": "List of board member names"
    }
  }
}
13
Upload and extract
14
Click Extract All. The pipeline chains all three steps automatically:
15
  • Extract — converts the document to markdown
  • Split — assigns pages to topics based on content
  • Schema — runs each topic’s schema against its assigned pages
  • 16
    Review results
    17
    Results appear organized by topic. Switch between topics to see:
    18
  • Page assignments — which pages belong to each topic
  • Structured output — the JSON extracted for each topic
  • Citations — where in the document each value was found

  • What You Get Back

    Everything from Extract, plus:
    FieldDescription
    split_output.splitsPage assignments — { "Financials": [1,2,3], "Leadership": [4,5] }
    split_idSaved split result ID
    resultsPer-topic structured output with values and citations for each topic
    schema_idSaved schema result ID

    API Usage

    from pulse_python_sdk import Pulse
    
    client = Pulse(api_key="YOUR_API_KEY")
    
    # Step 1: Extract
    extract_result = client.extract(
        file=open("annual_report.pdf", "rb"),
        async_=True,
        storage={"enabled": True}
    )
    extraction_id = extract_result.extraction_id
    
    # Step 2: Split
    split_result = client.split.document(
        extraction_id=extraction_id,
        split_config={
            "topics": [
                {"name": "Financials", "description": "Revenue, expenses, and profit data"},
                {"name": "Leadership", "description": "Executive team and board members"},
                {"name": "Outlook", "description": "Future plans and projections"}
            ]
        }
    )
    split_id = split_result.split_id
    print(f"Split: {split_result.split_output}")
    
    # Step 3: Schema per topic
    schema_result = client.schema.extract_schema(
        split_id=split_id,
        split_schema_config={
            "Financials": {
                "schema": {
                    "type": "object",
                    "properties": {
                        "total_revenue": {"type": "number"},
                        "net_income": {"type": "number"}
                    }
                },
                "schema_prompt": "Extract financial metrics"
            },
            "Leadership": {
                "schema": {
                    "type": "object",
                    "properties": {
                        "ceo": {"type": "string"},
                        "board_members": {"type": "array", "items": {"type": "string"}}
                    }
                },
                "schema_prompt": "Extract leadership information"
            },
            "Outlook": {
                "schema": {
                    "type": "object",
                    "properties": {
                        "guidance": {"type": "string"},
                        "growth_target": {"type": "number"}
                    }
                },
                "schema_prompt": "Extract forward-looking statements"
            }
        }
    )
    
    for topic, data in schema_result.results.items():
        print(f"{topic}: {data}")
    

    Skipping Schema (Extract → Split Only)

    You don’t have to add schema after splitting. If you just want to know which pages belong to which topic — without structured extraction — you can stop after the split step. This is useful for document triage or routing.
    # Extract → Split only (no schema)
    split_result = client.split.document(
        extraction_id=extraction_id,
        split_config={
            "topics": [
                {"name": "Relevant", "description": "Pages related to our query"},
                {"name": "Not Relevant", "description": "Background or boilerplate pages"}
            ]
        }
    )
    
    print(split_result.split_output)
    # {"splits": {"Relevant": [1,3,5], "Not Relevant": [2,4,6]}}