Skip to main content
Deprecation Notice: The structured_output parameter on /extract is deprecated. Use the /schema endpoint after extraction instead. The schema format and design principles on this page still apply — just pass your schema to /schema via schema_config instead of to /extract via structured_output.

Overview

This guide covers best practices for designing JSON schemas used with the /schema endpoint (recommended) or the legacy structured_output parameter on /extract.

Schema Format

The structured_output.schema field uses the JSON Schema specification (OpenAPI 3.1 compatible). This is the same schema format used by OpenAI’s structured outputs and other LLM providers.

Key JSON Schema Properties

PropertyDescriptionExample
typeData type of the field"string", "number", "boolean", "object", "array"
propertiesDefine fields for an object{"name": {"type": "string"}}
itemsDefine schema for array elements{"type": "object", "properties": {...}}
requiredList of required field names["name", "email"]
descriptionHuman-readable description to guide extraction"Customer's full name"
formatHint for string formatting"date", "email", "uri"
Don’t write schemas by hand! Use the Schema Editor in the Pulse Platform to generate and refine schemas interactively.
The Schema Editor provides two powerful ways to create schemas:

1. Generate from Prompt

Describe what you want to extract in natural language, and the editor will generate a properly formatted JSON Schema for you.
“Extract the account holder name, account number, statement period, opening and closing balances, and all transactions with date, description, and amount.”

2. Interactive Editor

  • Visually add, remove, and reorder fields
  • Set field types and descriptions
  • Mark fields as required
  • Preview the generated schema in real-time
  • Test against sample documents
Once you’re happy with your schema, copy it directly into your API requests. The recommended approach is a two-step flow:
  1. Extract the document via /extract to get an extraction_id
  2. Apply a schema via /schema using the extraction_id
The schema_config object contains:
FieldTypeDescription
input_schemaobjectJSON schema defining the structure of data to extract
schema_promptstringNatural language instructions to guide extraction
effortbooleanEnable extended reasoning for complex documents

Bank Statement Example

Here’s an example extracting key fields from a bank statement: Step 1: Extract
POST /extract
{"file_url": "https://...bank_statement.pdf"}
# → {"extraction_id": "abc123-...", "markdown": "...", ...}
Step 2: Apply Schema
{
  "extraction_id": "abc123-...",
  "schema_config": {
    "input_schema": {
      "type": "object",
      "properties": {
        "account_holder": {
          "type": "string",
          "description": "Name of the account holder"
        },
        "account_number": {
          "type": "string",
          "description": "Bank account number"
        },
        "opening_balance": {
          "type": "number",
          "description": "Balance at the start of the statement period"
        },
        "closing_balance": {
          "type": "number",
          "description": "Balance at the end of the statement period"
        }
      },
      "required": ["account_holder", "account_number", "opening_balance", "closing_balance"]
    }
  }
}
Response (schema_output):
{
  "schema_id": "schema-uuid-456",
  "version": 1,
  "schema_output": {
    "values": {
      "account_holder": "JAMES C. MORRISON",
      "account_number": "12345678",
      "opening_balance": 69.96,
      "closing_balance": 586.71
    },
    "citations": {
      "account_holder": {"page": 1, "bbox": [100, 50, 300, 70]}
    }
  }
}

SDK Examples

from pulse import Pulse

client = Pulse(api_key="YOUR_API_KEY")

# Step 1: Extract
response = client.extract(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf"
)

# Step 2: Apply schema
schema = {
    "type": "object",
    "properties": {
        "account_holder": {
            "type": "string",
            "description": "Name of the account holder"
        },
        "account_number": {
            "type": "string",
            "description": "Bank account number"
        },
        "opening_balance": {"type": "number"},
        "closing_balance": {"type": "number"}
    },
    "required": ["account_holder", "account_number"]
}

schema_result = client.schema(
    extraction_id=response.extraction_id,
    schema_config={
        "input_schema": schema,
        "schema_prompt": "Extract bank statement details"
    }
)

print(f"Account Holder: {schema_result.schema_output['values']['account_holder']}")
print(f"Balance: {schema_result.schema_output['values']['closing_balance']}")

Schema Format

Schemas follow the JSON Schema specification. Each field is defined with:
PropertyDescription
typeData type: string, number, boolean, object, array
descriptionHuman-readable description to guide extraction
formatOptional format hint (e.g., date, email, uri)
requiredArray of required field names (for objects)
itemsSchema for array elements
propertiesNested field definitions (for objects)

Data Types

TypeDescriptionExample Value
stringText values"John Doe"
numberNumeric values (integer or decimal)99.99
booleanTrue/false valuestrue
objectNested structures with properties{"name": {"type": "string"}}
arrayLists with items defining element schema{"type": "array", "items": {...}}

Schema Design Principles

1. Start Simple

Begin with basic fields and gradually add complexity:
{
  "extraction_id": "abc123-...",
  "schema_config": {
    "input_schema": {
      "type": "object",
      "properties": {
        "invoice_number": {"type": "string"},
        "total": {"type": "number"}
      },
      "required": ["invoice_number", "total"]
    }
  }
}
Then expand with nested objects and arrays:
{
  "extraction_id": "abc123-...",
  "schema_config": {
    "input_schema": {
      "type": "object",
      "properties": {
        "invoice_number": {"type": "string", "description": "Invoice ID"},
        "date": {"type": "string", "format": "date"},
        "vendor": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "address": {"type": "string"}
          }
        },
        "line_items": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "description": {"type": "string"},
              "amount": {"type": "number"}
            }
          }
        },
        "total": {"type": "number"}
      },
      "required": ["invoice_number", "total"]
    },
    "schema_prompt": "Extract all invoice details including vendor information and itemized charges."
  }
}

2. Use Descriptions

Add description fields to guide extraction:
{
  "properties": {
    "invoice_number": {
      "type": "string",
      "description": "The unique invoice identifier, usually at the top of the document"
    },
    "bill_to": {
      "type": "string", 
      "description": "Customer billing address"
    },
    "remit_to": {
      "type": "string",
      "description": "Payment remittance address"
    }
  }
}

3. Use schema_prompt for Context

The schema_prompt field provides natural language guidance to help the model understand nuances:
{
  "extraction_id": "abc123-...",
  "schema_config": {
    "input_schema": {
      "type": "object",
      "properties": {
        "contract_type": {"type": "string"},
        "effective_date": {"type": "string", "format": "date"},
        "parties": {"type": "array", "items": {"type": "string"}},
        "key_terms": {"type": "array", "items": {"type": "string"}}
      }
    },
    "schema_prompt": "Extract contract details. For key_terms, focus on payment terms, termination clauses, and liability limitations. Format dates as YYYY-MM-DD."
  }
}

Common Schema Patterns

Invoice / Financial Documents

{
  "extraction_id": "your-extraction-id",
  "schema_config": {
    "input_schema": {
      "type": "object",
      "properties": {
        "document_type": {"type": "string"},
        "document_number": {"type": "string"},
        "date": {"type": "string", "format": "date"},
        "due_date": {"type": "string", "format": "date"},
        "vendor": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "address": {"type": "string"},
            "tax_id": {"type": "string"}
          }
        },
        "customer": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "account_number": {"type": "string"}
          }
        },
        "line_items": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "description": {"type": "string"},
              "quantity": {"type": "number"},
              "unit_price": {"type": "number"},
              "amount": {"type": "number"}
            },
            "required": ["description", "amount"]
          }
        },
        "subtotal": {"type": "number"},
        "tax": {"type": "number"},
        "total": {"type": "number"}
      },
      "required": ["document_number", "total"]
    },
    "schema_prompt": "Extract all invoice details. Include all line items. Format currency as numbers without symbols."
  }
}
{
  "extraction_id": "your-extraction-id",
  "schema_config": {
    "input_schema": {
      "type": "object",
      "properties": {
        "document_title": {"type": "string"},
        "case_number": {"type": "string"},
        "parties": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": {"type": "string"},
              "role": {"type": "string"},
              "representation": {"type": "string"}
            },
            "required": ["name", "role"]
          }
        },
        "dates": {
          "type": "object",
          "properties": {
            "filed": {"type": "string", "format": "date"},
            "effective": {"type": "string", "format": "date"},
            "expiration": {"type": "string", "format": "date"}
          }
        },
        "signatures": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": {"type": "string"},
              "title": {"type": "string"},
              "date": {"type": "string", "format": "date"}
            }
          }
        }
      },
      "required": ["document_title"]
    },
    "schema_prompt": "Extract legal document details. Include all parties and their roles."
  }
}

Medical Records

{
  "extraction_id": "your-extraction-id",
  "schema_config": {
    "input_schema": {
      "type": "object",
      "properties": {
        "patient": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "dob": {"type": "string", "format": "date", "description": "Date of birth"},
            "mrn": {"type": "string", "description": "Medical record number"}
          },
          "required": ["name", "mrn"]
        },
        "encounter": {
          "type": "object",
          "properties": {
            "date": {"type": "string", "format": "date"},
            "provider": {"type": "string"},
            "location": {"type": "string"}
          }
        },
        "chief_complaint": {"type": "string"},
        "medications": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": {"type": "string"},
              "dosage": {"type": "string"},
              "frequency": {"type": "string"}
            }
          }
        },
        "diagnoses": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "code": {"type": "string", "description": "ICD-10 code"},
              "description": {"type": "string"}
            }
          }
        },
        "plan": {"type": "string"}
      },
      "required": ["patient", "encounter"]
    },
    "schema_prompt": "Extract patient encounter details. Include all medications and diagnoses with their codes."
  }
}

Advanced Techniques

Conditional Extraction

Use schema_prompt to guide conditional extraction:
{
  "extraction_id": "your-extraction-id",
  "schema_config": {
    "input_schema": {
      "type": "object",
      "properties": {
        "contract_type": {"type": "string", "description": "Type of contract: lease, purchase, service, etc."},
        "terms": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "type": {"type": "string"},
              "value": {"type": "string"}
            }
          }
        }
      },
      "required": ["contract_type"]
    },
    "schema_prompt": "First identify the contract_type. If it's a 'lease', extract rental amount and duration as terms. If it's a 'purchase', extract price and closing date as terms."
  }
}

Hierarchical Data

For documents with deeply nested structures:
{
  "extraction_id": "your-extraction-id",
  "schema_config": {
    "input_schema": {
      "type": "object",
      "properties": {
        "organization": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "departments": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "name": {"type": "string"},
                  "head": {"type": "string"},
                  "employee_count": {"type": "number"}
                }
              }
            }
          }
        }
      }
    },
    "schema_prompt": "Extract the organizational hierarchy including all departments."
  }
}

Performance Tips

Keep Schemas Focused

Extract only what you need. Avoid extracting entire documents as single fields.

Use Descriptions

Add description fields to guide the model on ambiguous fields or specific formats.

Leverage schema_prompt

Use schema_prompt to provide context that can’t be expressed in the schema structure alone.

Migration from Legacy Parameters

Both the schema / schema_prompt top-level parameters and the structured_output parameter on /extract are deprecated. Use the /schema endpoint after extraction instead.

Before (Deprecated — top-level schema on /extract)

{
  "file": "@document.pdf",
  "schema": {"invoice_number": "string", "total": "number"},
  "schema_prompt": "Extract invoice details"
}

Before (Deprecated — structured_output on /extract)

{
  "file": "@document.pdf",
  "structured_output": {
    "schema": {
      "type": "object",
      "properties": {
        "invoice_number": { "type": "string" },
        "total": { "type": "number" }
      },
      "required": ["invoice_number", "total"]
    },
    "schema_prompt": "Extract invoice details"
  }
}
# Step 1: Extract
POST /extract  {"file_url": "https://..."}
# → {"extraction_id": "abc123-...", "markdown": "...", ...}

# Step 2: Apply schema
POST /schema
{
  "extraction_id": "abc123-...",
  "schema_config": {
    "input_schema": {
      "type": "object",
      "properties": {
        "invoice_number": {
          "type": "string",
          "description": "The unique invoice identifier"
        },
        "total": {
          "type": "number",
          "description": "Total invoice amount"
        }
      },
      "required": ["invoice_number", "total"]
    },
    "schema_prompt": "Extract invoice details"
  }
}
The API supports structured_output on /extract for backward compatibility, but all new integrations should use the /schema endpoint.

Error Handling

Common Schema Errors

ErrorCauseSolution
Invalid JSONSyntax error in schemaValidate JSON syntax
Unknown typeUsing unsupported data typeUse string, number, boolean, or nested objects/arrays
Too complexDeeply nested structureSimplify schema, flatten where possible
No matchesFields don’t match documentAdjust field names, use schema_prompt for guidance

Debugging Tips

  1. Start with a minimal schema and add fields incrementally
  2. Use schema_prompt to provide context and clarify ambiguous fields
  3. Check extracted markdown without schema first to see available content
  4. Verify field names match document terminology

Best Practices Summary

  • Use the /schema endpoint for all new integrations
  • Provide descriptive schema_prompt instructions
  • Use descriptive field names matching document terminology
  • Start simple and iterate
  • Test with real documents
  • Use appropriate data types (number for numeric values)
  • Use structured_output on /extract (deprecated — use /schema instead)
  • Use deprecated schema top-level parameter
  • Create overly complex nested structures
  • Use generic field names
  • Extract entire documents as single fields
  • Assume all fields will always exist

Next Steps

Quickstart Guide

See more examples

Schema Endpoint

Apply schemas to extracted documents