Overview
Pulse API is designed to handle documents of any size, from single-page memos to thousand-page reports. This guide covers strategies for efficiently processing large documents while maintaining accuracy and minimizing costs. Very large files are defined as documents with 100+ pages, which require special handling for optimal performance.Size Thresholds
Understanding these thresholds helps you choose the right processing strategy:
Document Size | Processing Method | Response Type | Recommended Approach |
---|---|---|---|
< 50 pages | Synchronous | Direct JSON | Use /extract endpoint |
50-70 pages | Synchronous/Async | Direct JSON | Consider async for reliability |
> 70 pages | Either | S3 URL | Results delivered via URL |
> 100 pages | Async recommended | S3 URL | Use /extract_async |
> 500 pages | Async required | S3 URL | Consider page ranges |
Automatic Optimizations
Smart Async Switching
The production client automatically switches to async mode for large files:S3 URL Responses
For documents over 70 pages, results are automatically delivered via S3 URL:Processing Strategies
1. Full Document Processing
Process entire documents when you need complete context:2. Page Range Processing
Extract specific sections to reduce processing time:- Single page:
"5"
- Range:
"10-20"
- Multiple ranges:
"1-5,10-15,20"
- Mixed:
"1,3,5-10,15"
3. Chunked Processing
For very large documents, process in chunks:4. Selective Extraction
Extract only what you need to minimize processing:Async Processing Deep Dive
Starting Async Jobs
Polling for Completion
Parallel Processing
Process multiple large documents simultaneously:Optimization Techniques
1. Memory Management
For very large responses:2. Cost Optimization
Minimize pages processed to reduce costs:3. Performance Monitoring
Track processing performance:Common Patterns
Legal Document Processing
Financial Report Analysis
Error Handling
Handling Large Document Errors
Best Practices Summary
Use Async for Large Files
Use Async for Large Files
- Always use async processing for files > 50 pages
- Let the client auto-detect when to use async
- Implement proper polling with backoff
Optimize Page Ranges
Optimize Page Ranges
- Process only the pages you need
- Extract TOC first to navigate large documents
- Use chunking for very large documents
Handle S3 URLs Properly
Handle S3 URLs Properly
- Check for
is_url
in responses - Download results promptly (URLs expire in 24 hours)
- Implement retry logic for S3 downloads
Monitor Performance
Monitor Performance
- Track processing times and costs
- Adjust chunk sizes based on document type
- Use concurrent processing for multiple files