Overview

Schema extraction allows you to define the exact structure of data you want to extract from documents. Instead of getting raw text, you receive structured JSON that matches your specified schema, making it perfect for automated workflows and database integration.

How It Works

1

Define Schema

Create a JSON schema that describes your desired output structure
2

Send Request

Include the schema with your extraction request
3

AI Processing

Our AI analyzes the document and maps content to your schema
4

Receive Structured Data

Get back clean, structured JSON matching your schema

Basic Schema Format

Define your schema as a JSON object with field names and data types:
{
  "invoice_number": "string",
  "date": "date",
  "total": "float",
  "items": [{
    "description": "string",
    "quantity": "integer",
    "price": "float"
  }]
}

Supported Data Types

string

Text values of any length
{"name": "string"}

integer

Whole numbers without decimals
{"quantity": "integer"}

float

Numbers with decimal places
{"price": "float"}

date

Date values in various formats
{"due_date": "date"}

boolean

True/false values
{"is_paid": "boolean"}

array

Lists of items
{"tags": ["string"]}

Complex Schema Examples

Invoice Processing

Extract detailed invoice information:
schema = {
    "vendor": {
        "name": "string",
        "address": "string",
        "tax_id": "string"
    },
    "customer": {
        "name": "string",
        "address": "string",
        "email": "string"
    },
    "invoice_details": {
        "number": "string",
        "date": "date",
        "due_date": "date",
        "currency": "string"
    },
    "line_items": [{
        "description": "string",
        "quantity": "integer",
        "unit_price": "float",
        "tax_rate": "float",
        "amount": "float"
    }],
    "totals": {
        "subtotal": "float",
        "tax": "float",
        "total": "float"
    },
    "payment_terms": "string",
    "notes": "string"
}

result = client.extract(file_path="invoice.pdf", schema=schema)

Contract Analysis

Extract key contract terms:
schema = {
    "contract_type": "string",
    "parties": [{
        "name": "string",
        "role": "string",
        "representative": "string",
        "address": "string"
    }],
    "dates": {
        "effective_date": "date",
        "expiration_date": "date",
        "signed_date": "date"
    },
    "terms": {
        "payment_amount": "float",
        "payment_schedule": "string",
        "deliverables": ["string"],
        "termination_clause": "string"
    },
    "signatures": [{
        "name": "string",
        "title": "string",
        "date": "date"
    }]
}

result = client.extract(file_path="contract.pdf", schema=schema)

Medical Records

Extract patient information:
schema = {
    "patient": {
        "name": "string",
        "date_of_birth": "date",
        "medical_record_number": "string"
    },
    "visit": {
        "date": "date",
        "provider": "string",
        "chief_complaint": "string"
    },
    "vitals": {
        "blood_pressure": "string",
        "heart_rate": "integer",
        "temperature": "float",
        "weight": "float"
    },
    "diagnoses": [{
        "code": "string",
        "description": "string"
    }],
    "medications": [{
        "name": "string",
        "dosage": "string",
        "frequency": "string"
    }],
    "follow_up": "string"
}

result = client.extract(file_path="medical_record.pdf", schema=schema)

Advanced Features

Optional Fields

Make fields optional by using null as an alternative type:
schema = {
    "required_field": "string",
    "optional_field": "string|null",
    "optional_number": "float|null"
}

Schema Prompts

Provide additional context for better extraction:
result = client.extract(
    file_path="document.pdf",
    schema=schema,
    schema_prompt="This is a German invoice. Extract amounts in EUR. Pay special attention to VAT calculations"
)

Nested Objects

Create deeply nested structures:
schema = {
    "company": {
        "name": "string",
        "departments": [{
            "name": "string",
            "manager": {
                "name": "string",
                "email": "string"
            },
            "employees": [{
                "name": "string",
                "role": "string",
                "salary": "float"
            }]
        }]
    }
}

Dynamic Arrays

Extract variable-length lists:
schema = {
    "products": [{
        "sku": "string",
        "name": "string",
        "category": "string",
        "price": "float",
        "features": ["string"]
    }]
}

Best Practices

  • Start with basic schemas and add complexity gradually
  • Use clear, descriptive field names
  • Avoid deeply nested structures when possible
  • Test with sample documents first
  • Make optional fields nullable
  • Provide default values in your application
  • Use validation after extraction
  • Consider partial extraction strategies
  • Extract only needed fields
  • Use specific data types (not just “string”)
  • Process pages selectively with pages parameter
  • Consider chunking for very large schemas
  • Use descriptive field names that match document terminology
  • Provide context with schema prompts
  • Validate critical data points

Common Patterns

Table Extraction

Extract tabular data efficiently:
schema = {
    "table_data": [{
        "row_number": "integer",
        "columns": {
            "product": "string",
            "quantity": "integer",
            "price": "float",
            "total": "float"
        }
    }]
}

Key-Value Extraction

Extract form-like data:
schema = {
    "form_fields": {
        "field_name_1": "string",
        "field_name_2": "date",
        "field_name_3": "boolean"
    }
}

Multi-Document Extraction

Process related documents:
# Extract matching fields from multiple documents
schema = {
    "document_type": "string",
    "reference_number": "string",
    "date": "date",
    "common_fields": {
        "vendor": "string",
        "amount": "float"
    }
}

# Process invoice and purchase order with same schema
invoice_data = client.extract("invoice.pdf", schema=schema)
po_data = client.extract("purchase_order.pdf", schema=schema)

Error Handling

Schema Validation

The API validates your schema before processing:
try:
    result = client.extract(file_path="doc.pdf", schema=invalid_schema)
except Exception as e:
    if "Invalid schema" in str(e):
        print("Schema validation failed")

Partial Extraction

When some fields can’t be found:
{
  "invoice_number": "INV-12345",
  "date": "2024-01-15",
  "total": 1234.56,
  "optional_field": null  // Field not found in document
}

Integration Examples

Database Storage

# Extract and store in database
schema = {
    "invoice_number": "string",
    "vendor_name": "string",
    "total": "float",
    "date": "date"
}

result = client.extract(file_path="invoice.pdf", schema=schema)

# Insert into database
cursor.execute(
    "INSERT INTO invoices (number, vendor, total, date) VALUES (?, ?, ?, ?)",
    (result["invoice_number"], result["vendor_name"], result["total"], result["date"])
)

API Integration

# Extract and send to another API
schema = {
    "customer_id": "string",
    "order_items": [{
        "product_id": "string",
        "quantity": "integer"
    }]
}

result = client.extract(file_path="order.pdf", schema=schema)


## Next Steps

<CardGroup cols={2}>
  <Card title="API Reference" icon="book" href="/api-reference/endpoint/extract">
    See extraction endpoint details
  </Card>
  <Card title="Large Documents" icon="file-lines" href="/core-concepts/large-documents">
    Handle complex extractions
  </Card>
</CardGroup>