Overview

Pulse API supports a wide range of document formats, enabling extraction from virtually any business document. All file types are processed with the same high accuracy and can be used with structured schema extraction.

Supported Formats

PDF Documents

Extension: .pdf
  • Text-based PDFs (searchable)
  • Image-based PDFs (scanned)
  • Mixed content PDFs
  • Password-protected PDFs (provide password)
  • Multi-page documents

Images

Extensions: .jpg, .jpeg, .png
  • High-resolution scans
  • Photographs of documents
  • Screenshots
  • Multi-page TIFFs
  • Handwritten content (with limitations)

Office Documents

Extensions: .docx, .pptx, .xlsx
  • Microsoft Word documents
  • PowerPoint presentations
  • Excel spreadsheets
  • Embedded images and charts
  • Complex formatting preserved

Web Documents

Extensions: .html, .htm
  • Static HTML pages
  • Saved web pages
  • HTML emails
  • Inline styles preserved
  • Embedded images extracted

Processing Large Files

For very large files, we recommend:
  • Using async processing (/extract_async)
  • Processing specific page ranges
  • Contacting support for optimization strategies

Format-Specific Features

PDF Processing

PDFs receive the most comprehensive processing:
# Extract with PDF-specific options
result = client.extract(
    file_path="document.pdf",
    pages="1-10",  # Process specific pages
    extract_figure=True,  # Extract embedded images
    return_html=False  # Get markdown output
)
Advanced PDF Features:
  • Preserve document structure
  • Extract form fields
  • Handle multi-column layouts
  • Process rotated pages
  • Extract embedded images

Image Processing

Images are processed using advanced OCR:
# High-quality image extraction
result = client.extract(
    file_path="scanned_document.jpg"
)
Image Best Practices:
  • Resolution: Minimum 150 DPI, recommended 300 DPI
  • Format: PNG for text, JPG for photos
  • Size: Keep under 10MB per image
  • Quality: Avoid blurry or skewed images

Office Document Processing

Office files maintain their structure:
# Extract structured data from Excel
schema = {
    "headers": ["string"],
    "data": [["string"]]
}

result = client.extract(
    file_path="spreadsheet.xlsx",
    schema=schema
)
Office Document Features:
  • Preserve table structures
  • Extract embedded objects
  • Handle multiple sheets/slides
  • Maintain formatting context

File Upload Methods

Direct Upload

Upload files directly in your API request:
# Direct file upload (recommended for files < 10MB)
with open("document.pdf", "rb") as f:
    response = requests.post(
        "https://api.runpulse.com/extract",
        headers={"x-api-key": API_KEY},
        files={"file": f}
    )

Pre-upload to S3

For larger files or reusable uploads:
# Upload to S3 first
upload_response = requests.post(
    "https://api.runpulse.com/convert",
    headers={"x-api-key": API_KEY},
    files={"file": open("large_file.pdf", "rb")}
)

file_url = upload_response.json()["s3_object_url"]

# Then extract using the URL
extract_response = requests.post(
    "https://api.runpulse.com/extract",
    headers={"x-api-key": API_KEY},
    json={"file-url": file_url}
)

URL-Based Processing

Process files from public URLs:
# Extract from URL
response = requests.post(
    "https://api.runpulse.com/extract",
    headers={"x-api-key": API_KEY},
    json={"file-url": "https://example.com/document.pdf"}
)

Unsupported Formats

The following formats are not currently supported:
  • Video files (.mp4, .avi, .mov)
  • Audio files (.mp3, .wav)
  • CAD files (.dwg, .dxf)
  • Legacy Office formats (.doc, .xls, .ppt)
  • Compressed archives (.zip, .rar)
  • Executable files (.exe, .app)
Need support for a specific format? Contact us at support@runpulse.com

Format Detection

Pulse API automatically detects file types based on:
  1. File extension
  2. MIME type
  3. File content analysis
# API automatically detects format
result = client.extract(file_path="document")  # Works even without extension

Best Practices by Document Type

Troubleshooting

Common Issues

Next Steps