Overview
Schemas allow you to extract structured data from documents in a consistent format. This guide provides comprehensive guidelines for creating effective schemas.Schema Basics
A schema is a JSON object that defines the structure of data you want to extract:Data Types
Primitive Types
Type | Description | Example |
---|---|---|
string | Text values | "John Doe" |
integer | Whole numbers | 42 |
float | Decimal numbers | 99.99 |
boolean | True/false values | true |
date | Date values | "2024-01-15" |
Complex Types
Type | Description | Example |
---|---|---|
object | Nested structures | {"name": "string", "age": "integer"} |
array | Lists of items | ["string"] or [{"item": "string"}] |
Schema Design Principles
1. Start Simple
Begin with basic fields and gradually add complexity:2. Use Descriptive Names
Field names should match document terminology:3. Handle Optional Fields
Make fields nullable when they might not exist:Common Schema Patterns
Financial Documents
Legal Documents
Medical Records
Advanced Techniques
Dynamic Fields
For documents with variable structures:Conditional Extraction
Use schema_prompt to guide conditional extraction:Hierarchical Data
For documents with nested structures:Performance Optimization
1. Minimize Schema Complexity
2. Extract Only What You Need
3. Use Appropriate Data Types
Validation and Testing
Schema Validation
Before using a schema, validate its structure:Testing Schemas
Test with sample documents:Error Handling
Common Schema Errors
Error | Cause | Solution |
---|---|---|
Invalid JSON | Syntax error in schema | Validate JSON syntax |
Unknown type | Using unsupported data type | Use only supported types |
Too complex | Deeply nested structure | Simplify schema |
No matches | Fields don’t match document | Adjust field names |
Debugging Tips
- Start with a minimal schema and add fields incrementally
- Use schema prompts to provide context
- Check extracted content without schema first
- Verify field names match document terminology
Best Practices Summary
DO
DO
- Use descriptive field names
- Start simple and iterate
- Make optional fields nullable
- Test with real documents
- Validate schema structure
- Use appropriate data types
DON'T
DON'T
- Create overly complex nested structures
- Use generic field names
- Assume all fields will always exist
- Extract entire documents as single fields
- Mix data types inconsistently