Skip to main content

Goal

Extract different types of structured data from different sections of a long report without forcing one huge schema across the whole document.

Sample Document

Use the built-in 10-K Annual Report Platform example. The example includes a saved Split output with topics for exhibits and financial statement schedules, signatures, and certifications.

Use This Workflow

Use Extract -> Split -> Schema when sections have different vocabulary, layout, and output requirements.

Split Topics

TopicDescription
Financial Statement SchedulesFinancial statement schedules, exhibits, and supporting tables.
SignaturesSignature blocks, officers, titles, and signing dates.
CertificationsOfficer certifications and compliance attestations.

Platform Steps

1

Extract the report

Upload the report and run Extract. Use async for long PDFs.
2

Add Split

Add topics with clear names and descriptions. Run Split and inspect page assignments.
3

Add per-topic schemas

Define a different schema for each topic. Keep each schema narrow and specific.
4

Save presets

Save the split config and per-topic schema configs once the workflow works across multiple reports.

Python

from pulse import Pulse

client = Pulse(api_key="YOUR_API_KEY")

extract_result = client.extract(
    file_url="https://platform.runpulse.com/api/examples/0514fc05-8b0a-4a3b-9b9b-18ac89fc04e5/pdf",
    async_=True,
)

# In production, poll extract_result.job_id when async_=True, then read the
# completed extraction_id from the job result.
extraction_id = extract_result.extraction_id

split_result = client.split(
    extraction_id=extraction_id,
    split_config={
        "split_input": [
            {"name": "Financial Statement Schedules", "description": "Financial statement schedules, exhibits, and supporting tables"},
            {"name": "Signatures", "description": "Signature blocks, officers, titles, and signing dates"},
            {"name": "Certifications", "description": "Officer certifications and compliance attestations"},
        ]
    },
)

schema_result = client.schema(
    split_id=split_result.split_id,
    split_schema_config={
        "Financial Statement Schedules": {
            "schema": {
                "type": "object",
                "properties": {
                    "schedule_names": {"type": "array", "items": {"type": "string"}},
                    "exhibit_numbers": {"type": "array", "items": {"type": "string"}},
                },
            },
            "schema_prompt": "Extract schedule names and exhibit identifiers.",
        },
        "Signatures": {
            "schema": {
                "type": "object",
                "properties": {
                    "signatories": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "name": {"type": "string"},
                                "title": {"type": "string"},
                                "date": {"type": "string"},
                            },
                        },
                    },
                },
            }
        },
        "Certifications": {
            "schema": {
                "type": "object",
                "properties": {
                    "certifying_officers": {
                        "type": "array",
                        "items": {"type": "string"},
                        "description": "Officers who signed certifications",
                    }
                },
            }
        },
    },
)

print(schema_result.results)

Checks

  • Split topics should be mutually exclusive and easy to describe.
  • Inspect page assignments before trusting schema output.
  • Keep per-topic schemas smaller than a single all-purpose schema.
  • If the same topic appears across many documents, save it as a split preset.
  • For regulated review, store the split output and schema version IDs with the downstream record so reviewers can reproduce the exact extraction.

Extract -> Split -> Schema

Full Platform walkthrough.

Chaining Steps

Understand the extraction_id to split_id handoff.

Sample Documents

Use a long sample PDF to test topic splits.