Overview

Schema extraction allows you to define the exact structure of data you want to extract from documents. Instead of getting raw text, you receive structured JSON that matches your specified schema, making it perfect for automated workflows and database integration.

How It Works

1

Define Schema

Create a JSON schema that describes your desired output structure
2

Send Request

Include the schema with your extraction request
3

AI Processing

Our AI analyzes the document and maps content to your schema
4

Receive Structured Data

Get back clean, structured JSON matching your schema

Basic Schema Format

Define your schema as a JSON object with field names and data types:
{
  "invoice_number": "string",
  "date": "date",
  "total": "float",
  "items": [{
    "description": "string",
    "quantity": "integer",
    "price": "float"
  }]
}

Supported Data Types

string

Text values of any length
{"name": "string"}

integer

Whole numbers without decimals
{"quantity": "integer"}

float

Numbers with decimal places
{"price": "float"}

date

Date values in various formats
{"due_date": "date"}

boolean

True/false values
{"is_paid": "boolean"}

array

Lists of items
{"tags": ["string"]}

Complex Schema Examples

Invoice Processing

Extract detailed invoice information:
schema = {
    "vendor": {
        "name": "string",
        "address": "string",
        "tax_id": "string"
    },
    "customer": {
        "name": "string",
        "address": "string",
        "email": "string"
    },
    "invoice_details": {
        "number": "string",
        "date": "date",
        "due_date": "date",
        "currency": "string"
    },
    "line_items": [{
        "description": "string",
        "quantity": "integer",
        "unit_price": "float",
        "tax_rate": "float",
        "amount": "float"
    }],
    "totals": {
        "subtotal": "float",
        "tax": "float",
        "total": "float"
    },
    "payment_terms": "string",
    "notes": "string"
}

result = client.extract(file_path="invoice.pdf", schema=schema)

Contract Analysis

Extract key contract terms:
schema = {
    "contract_type": "string",
    "parties": [{
        "name": "string",
        "role": "string",
        "representative": "string",
        "address": "string"
    }],
    "dates": {
        "effective_date": "date",
        "expiration_date": "date",
        "signed_date": "date"
    },
    "terms": {
        "payment_amount": "float",
        "payment_schedule": "string",
        "deliverables": ["string"],
        "termination_clause": "string"
    },
    "signatures": [{
        "name": "string",
        "title": "string",
        "date": "date"
    }]
}

result = client.extract(file_path="contract.pdf", schema=schema)

Medical Records

Extract patient information:
schema = {
    "patient": {
        "name": "string",
        "date_of_birth": "date",
        "medical_record_number": "string"
    },
    "visit": {
        "date": "date",
        "provider": "string",
        "chief_complaint": "string"
    },
    "vitals": {
        "blood_pressure": "string",
        "heart_rate": "integer",
        "temperature": "float",
        "weight": "float"
    },
    "diagnoses": [{
        "code": "string",
        "description": "string"
    }],
    "medications": [{
        "name": "string",
        "dosage": "string",
        "frequency": "string"
    }],
    "follow_up": "string"
}

result = client.extract(file_path="medical_record.pdf", schema=schema)

Advanced Features

Optional Fields

Make fields optional by using null as an alternative type:
schema = {
    "required_field": "string",
    "optional_field": "string|null",
    "optional_number": "float|null"
}

Schema Prompts

Provide additional context for better extraction:
result = client.extract(
    file_path="document.pdf",
    schema=schema,
    schema_prompt="This is a German invoice. Extract amounts in EUR. Pay special attention to VAT calculations"
)

Nested Objects

Create deeply nested structures:
schema = {
    "company": {
        "name": "string",
        "departments": [{
            "name": "string",
            "manager": {
                "name": "string",
                "email": "string"
            },
            "employees": [{
                "name": "string",
                "role": "string",
                "salary": "float"
            }]
        }]
    }
}

Dynamic Arrays

Extract variable-length lists:
schema = {
    "products": [{
        "sku": "string",
        "name": "string",
        "category": "string",
        "price": "float",
        "features": ["string"]
    }]
}

Best Practices

Common Patterns

Table Extraction

Extract tabular data efficiently:
schema = {
    "table_data": [{
        "row_number": "integer",
        "columns": {
            "product": "string",
            "quantity": "integer",
            "price": "float",
            "total": "float"
        }
    }]
}

Key-Value Extraction

Extract form-like data:
schema = {
    "form_fields": {
        "field_name_1": "string",
        "field_name_2": "date",
        "field_name_3": "boolean"
    }
}

Multi-Document Extraction

Process related documents:
# Extract matching fields from multiple documents
schema = {
    "document_type": "string",
    "reference_number": "string",
    "date": "date",
    "common_fields": {
        "vendor": "string",
        "amount": "float"
    }
}

# Process invoice and purchase order with same schema
invoice_data = client.extract("invoice.pdf", schema=schema)
po_data = client.extract("purchase_order.pdf", schema=schema)

Error Handling

Schema Validation

The API validates your schema before processing:
try:
    result = client.extract(file_path="doc.pdf", schema=invalid_schema)
except Exception as e:
    if "Invalid schema" in str(e):
        print("Schema validation failed")

Partial Extraction

When some fields can’t be found:
{
  "invoice_number": "INV-12345",
  "date": "2024-01-15",
  "total": 1234.56,
  "optional_field": null  // Field not found in document
}

Integration Examples

Database Storage

# Extract and store in database
schema = {
    "invoice_number": "string",
    "vendor_name": "string",
    "total": "float",
    "date": "date"
}

result = client.extract(file_path="invoice.pdf", schema=schema)

# Insert into database
cursor.execute(
    "INSERT INTO invoices (number, vendor, total, date) VALUES (?, ?, ?, ?)",
    (result["invoice_number"], result["vendor_name"], result["total"], result["date"])
)

API Integration

# Extract and send to another API
schema = {
    "customer_id": "string",
    "order_items": [{
        "product_id": "string",
        "quantity": "integer"
    }]
}

result = client.extract(file_path="order.pdf", schema=schema)

# Send to order processing API
response = requests.post(
    "https://api.example.com/orders",
    json=result
)

Next Steps