Overview

Schemas allow you to extract structured data from documents in a consistent format. This guide provides comprehensive guidelines for creating effective schemas.

Schema Basics

A schema is a JSON object that defines the structure of data you want to extract:
{
  "field_name": "data_type",
  "nested_object": {
    "sub_field": "string"
  },
  "array_field": ["string"]
}

Data Types

Primitive Types

TypeDescriptionExample
stringText values"John Doe"
integerWhole numbers42
floatDecimal numbers99.99
booleanTrue/false valuestrue
dateDate values"2024-01-15"

Complex Types

TypeDescriptionExample
objectNested structures{"name": "string", "age": "integer"}
arrayLists of items["string"] or [{"item": "string"}]

Schema Design Principles

1. Start Simple

Begin with basic fields and gradually add complexity:
// Start with this
{
  "invoice_number": "string",
  "total": "float"
}

// Then expand to this
{
  "invoice_number": "string",
  "date": "date",
  "vendor": {
    "name": "string",
    "address": "string"
  },
  "line_items": [{
    "description": "string",
    "amount": "float"
  }],
  "total": "float"
}

2. Use Descriptive Names

Field names should match document terminology:
// Good - matches invoice terminology
{
  "invoice_number": "string",
  "bill_to": "string",
  "remit_to": "string"
}

// Avoid - generic names
{
  "number": "string",
  "address1": "string",
  "address2": "string"
}

3. Handle Optional Fields

Make fields nullable when they might not exist:
{
  "required_field": "string",
  "optional_field": "string|null",
  "optional_number": "float|null"
}

Common Schema Patterns

Financial Documents

{
  "document_type": "string",
  "document_number": "string",
  "date": "date",
  "due_date": "date|null",
  "vendor": {
    "name": "string",
    "address": "string",
    "tax_id": "string|null"
  },
  "customer": {
    "name": "string",
    "account_number": "string|null"
  },
  "line_items": [{
    "description": "string",
    "quantity": "float",
    "unit_price": "float",
    "amount": "float"
  }],
  "subtotal": "float",
  "tax": "float",
  "total": "float",
  "payment_terms": "string|null"
}
{
  "document_title": "string",
  "case_number": "string|null",
  "parties": [{
    "name": "string",
    "role": "string",
    "representation": "string|null"
  }],
  "dates": {
    "filed": "date",
    "effective": "date",
    "expiration": "date|null"
  },
  "sections": [{
    "number": "string",
    "title": "string",
    "content": "string"
  }],
  "signatures": [{
    "name": "string",
    "title": "string|null",
    "date": "date"
  }]
}

Medical Records

{
  "patient": {
    "name": "string",
    "dob": "date",
    "mrn": "string"
  },
  "encounter": {
    "date": "date",
    "provider": "string",
    "location": "string"
  },
  "chief_complaint": "string",
  "history_of_present_illness": "string",
  "medications": [{
    "name": "string",
    "dosage": "string",
    "frequency": "string",
    "route": "string"
  }],
  "diagnoses": [{
    "code": "string",
    "description": "string",
    "type": "string"
  }],
  "plan": "string"
}

Advanced Techniques

Dynamic Fields

For documents with variable structures:
{
  "metadata": {
    "document_type": "string",
    "date": "date"
  },
  "dynamic_fields": {
    "field_1": "string|null",
    "field_2": "float|null",
    "field_3": "boolean|null"
  }
}

Conditional Extraction

Use schema_prompt to guide conditional extraction:
schema = {
  "contract_type": "string",
  "terms": [{
    "type": "string",
    "value": "string"
  }]
}

schema_prompt = """
If contract_type is 'lease', extract rental amount and duration as terms.
If contract_type is 'purchase', extract price and closing date as terms.
"""

Hierarchical Data

For documents with nested structures:
{
  "organization": {
    "name": "string",
    "departments": [{
      "name": "string",
      "head": "string",
      "divisions": [{
        "name": "string",
        "manager": "string",
        "employees": ["string"]
      }]
    }]
  }
}

Performance Optimization

1. Minimize Schema Complexity

// Efficient - flat structure
{
  "invoice_number": "string",
  "vendor_name": "string",
  "total": "float"
}

// Less efficient - deeply nested
{
  "invoice": {
    "header": {
      "number": "string",
      "vendor": {
        "details": {
          "name": "string"
        }
      }
    }
  }
}

2. Extract Only What You Need

// Good - specific fields
{
  "patient_name": "string",
  "diagnosis_code": "string",
  "treatment_plan": "string"
}

// Avoid - too broad
{
  "entire_medical_record": "string"
}

3. Use Appropriate Data Types

// Good - specific types
{
  "quantity": "integer",
  "price": "float",
  "is_taxable": "boolean"
}

// Avoid - everything as string
{
  "quantity": "string",
  "price": "string",
  "is_taxable": "string"
}

Validation and Testing

Schema Validation

Before using a schema, validate its structure:
import json

def validate_schema(schema):
    """Basic schema validation."""
    valid_types = ["string", "integer", "float", "boolean", "date"]
    
    def check_value(value):
        if isinstance(value, str):
            # Check for valid type or null option
            types = value.split("|")
            return all(t in valid_types + ["null"] for t in types)
        elif isinstance(value, dict):
            # Recursive check for objects
            return all(check_value(v) for v in value.values())
        elif isinstance(value, list) and len(value) == 1:
            # Check array type
            return check_value(value[0])
        return False
    
    try:
        return all(check_value(v) for v in schema.values())
    except:
        return False

# Test your schema
schema = {
    "invoice_number": "string",
    "amount": "float"
}
print(validate_schema(schema))  # True

Testing Schemas

Test with sample documents:
# Test with increasing complexity
test_schemas = [
    # Level 1: Basic fields
    {"invoice_number": "string", "total": "float"},
    
    # Level 2: Add nested object
    {"invoice_number": "string", "vendor": {"name": "string"}, "total": "float"},
    
    # Level 3: Add arrays
    {"invoice_number": "string", "items": [{"desc": "string", "amount": "float"}], "total": "float"}
]

for i, schema in enumerate(test_schemas):
    result = client.extract(file_path="test_invoice.pdf", schema=schema)
    print(f"Schema level {i+1}: {json.dumps(result, indent=2)}")

Error Handling

Common Schema Errors

ErrorCauseSolution
Invalid JSONSyntax error in schemaValidate JSON syntax
Unknown typeUsing unsupported data typeUse only supported types
Too complexDeeply nested structureSimplify schema
No matchesFields don’t match documentAdjust field names

Debugging Tips

  1. Start with a minimal schema and add fields incrementally
  2. Use schema prompts to provide context
  3. Check extracted content without schema first
  4. Verify field names match document terminology

Best Practices Summary

Next Steps