Pulse API is designed to handle documents of any size, from single-page memos to thousand-page reports. This guide covers strategies for efficiently processing large documents while maintaining accuracy and minimizing costs.Very large files are defined as documents with 100+ pages, which require special handling for optimal performance.
For documents over 70 pages — or any response payload above the 5 MB inline threshold — results are returned as a one-time download link of the form https://api.runpulse.com/large_results/{job_id}:
import osimport requestsAPI_KEY = os.environ["PULSE_API_KEY"]result = client.extract(file_url="large_report.pdf")if isinstance(result, dict) and result.get("is_url"): # One-time-use link. Authenticate with your API key and fetch immediately. content_response = requests.get( result["url"], headers={"x-api-key": API_KEY}, ) content_response.raise_for_status() actual_content = content_response.json()
/large_results/{job_id} links are single-use and expire 1 hour after the job completes. Once a successful download has streamed (or the hour has elapsed), the link returns 410 Gone. Always persist the payload to your own storage on first read.
Extract specific sections to reduce processing time:
# Process only specific pagesresult = client.extract( file_url="manual.pdf", pages="1-10,50-55,100", # First 10 pages, pages 50-55, and page 100 schema={"introduction": "string", "specifications": {}})
Page Range Syntax (1-indexed, page 1 is the first page):
Single page: "5" — extracts the 5th page
Range: "10-20" — extracts pages 10 through 20 (inclusive)
Extract only what you need to minimize processing:
# Extract only table of contents firsttoc_result = client.extract( file_url="technical_manual.pdf", pages="1-5", schema={"sections": [{"title": "string", "page": "integer"}]})# Then extract specific sections based on TOCfor section in toc_result["sections"]: if section["title"] == "Technical Specifications": specs = client.extract( file_url="technical_manual.pdf", pages=str(section["page"]), schema={"specifications": {}} )
def robust_large_document_processing(file_path, max_retries=3): """Process large documents with retry logic.""" for attempt in range(max_retries): try: result = client.extract( file_url=file_path, async_=True, # Per-request timeout. The Python SDK exposes timeout via the # `request_options` dict, not as a top-level kwarg. request_options={"timeout_in_seconds": 600}, # 10 minutes ) # Handle one-time large-result download link if result.get("is_url"): # Single-use link — fetch once and persist immediately. # Retry only on transient network errors, not on 410 (already consumed/expired). response = requests.get( result["url"], headers={"x-api-key": API_KEY}, timeout=60, ) response.raise_for_status() return response.json() return result except TimeoutError: if attempt < max_retries - 1: print(f"Timeout on attempt {attempt + 1}, retrying...") time.sleep(10 * (attempt + 1)) # Exponential backoff else: raise