Skip to main content

Overview

The Extract → Schema pipeline adds structured data extraction on top of the base extraction. You define a JSON Schema describing the fields you want, and Pulse extracts them from the entire document as structured JSON with citations.

When to Use

  • Invoice processing — extract vendor name, invoice number, line items, totals
  • Form extraction — pull fields from applications, tax forms, insurance claims
  • Contract parsing — extract parties, dates, clauses, obligations
  • Single-structure documents — any document where one schema covers the entire content
If your document has distinct sections that need different schemas (e.g., an annual report with Financials, Leadership, and Outlook), use Extract → Split → Schema instead.

How to Use in the Playground

1
Configure extraction settings
2
Set page range, figure extraction, chunking, and other options on the Configuration tab — same as Extract Only.
3
Define your schema
4
Switch to the Schema step in the pipeline tabs. Define a JSON Schema describing the fields you want to extract:
5
{
  "type": "object",
  "properties": {
    "invoice_number": { "type": "string", "description": "The invoice identifier" },
    "vendor_name": { "type": "string", "description": "Name of the vendor or seller" },
    "total_amount": { "type": "number", "description": "Total amount due" },
    "due_date": { "type": "string", "description": "Payment due date" },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "quantity": { "type": "integer" },
          "unit_price": { "type": "number" }
        }
      }
    }
  },
  "required": ["invoice_number", "vendor_name"]
}
6
Optionally add a schema prompt to guide the extraction — e.g., “Extract billing details from this invoice. Line items should include all products listed.”
7
Upload and extract
8
Click Extract All. The pipeline runs both steps automatically:
9
  • Extract — processes the document into markdown
  • Schema — applies your schema to the extracted content
  • 10
    Review structured output
    11
    Results appear in the Schema tab as structured JSON. Each extracted field includes citations pointing to the exact location in the document where the value was found.

    What You Get Back

    Everything from Extract, plus:
    FieldDescription
    schema_output.valuesExtracted field values matching your JSON Schema
    schema_output.citationsSource locations for each extracted value
    schema_idSaved schema result ID

    Schema Tips

    The description property in your JSON Schema helps Pulse understand what to look for. Be specific:
    // Good
    "vendor_name": { "type": "string", "description": "Full legal name of the vendor or seller, as shown in the invoice header" }
    
    // Less helpful
    "vendor_name": { "type": "string" }
    
    Mark fields as required when you know they’ll always be present. Optional fields are returned as null if not found.
    For tables or lists in the document (line items, attendees, clauses), use "type": "array" with an items schema.
    The schema prompt gives the extraction model additional context. Use it to clarify ambiguities or specify preferences.
    If your schema has many nested fields or the document layout is complex, enable Effort mode in the extraction settings for higher accuracy.

    API Usage

    from pulse_python_sdk import Pulse
    
    client = Pulse(api_key="YOUR_API_KEY")
    
    # Step 1: Extract the document
    extract_result = client.extract(
        file=open("invoice.pdf", "rb"),
        async_=True,
        storage={"enabled": True}
    )
    
    # Poll for completion, then get extraction_id
    extraction_id = extract_result.extraction_id
    
    # Step 2: Apply schema
    schema_result = client.schema.extract_schema(
        extraction_id=extraction_id,
        schema_config={
            "schema": {
                "type": "object",
                "properties": {
                    "invoice_number": {"type": "string"},
                    "vendor_name": {"type": "string"},
                    "total_amount": {"type": "number"},
                    "line_items": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "description": {"type": "string"},
                                "quantity": {"type": "integer"},
                                "unit_price": {"type": "number"}
                            }
                        }
                    }
                },
                "required": ["invoice_number", "vendor_name"]
            },
            "schema_prompt": "Extract all billing details from this invoice"
        }
    )
    
    print(schema_result.schema_output)
    

    Iterating on Your Schema

    You don’t need to re-extract the document to try a different schema. Use a Schema-Only Rerun — see Reruns for details.