Overview

Pulse API supports a wide range of document formats, enabling extraction from virtually any business document. All file types are processed with the same high accuracy and can be used with structured schema extraction.

Supported Formats

PDF Documents

Extension: .pdf
  • Text-based PDFs (searchable)
  • Image-based PDFs (scanned)
  • Mixed content PDFs
  • Password-protected PDFs (provide password)
  • Multi-page documents

Images

Extensions: .jpg, .jpeg, .png
  • High-resolution scans
  • Photographs of documents
  • Screenshots
  • Multi-page TIFFs
  • Handwritten content (with limitations)

Office Documents

Extensions: .docx, .pptx, .xlsx
  • Microsoft Word documents
  • PowerPoint presentations
  • Excel spreadsheets
  • Embedded images and charts
  • Complex formatting preserved

Web Documents

Extensions: .html, .htm
  • Static HTML pages
  • Saved web pages
  • HTML emails
  • Inline styles preserved
  • Embedded images extracted

Processing Large Files

For very large files, we recommend:
  • Using async processing (/extract_async)
  • Processing specific page ranges
  • Contacting support for optimization strategies

Format-Specific Features

PDF Processing

PDFs receive the most comprehensive processing:
# Extract with PDF-specific options
result = client.extract(
    file_path="document.pdf",
    pages="1-10",  # Process specific pages
    extract_figure=True,  # Extract embedded images
    return_html=False  # Get markdown output
)
Advanced PDF Features:
  • Preserve document structure
  • Extract form fields
  • Handle multi-column layouts
  • Process rotated pages
  • Extract embedded images

Image Processing

Images are processed using advanced OCR:
# High-quality image extraction
result = client.extract(
    file_path="scanned_document.jpg"
)
Image Best Practices:
  • Resolution: Minimum 150 DPI, recommended 300 DPI
  • Format: PNG for text, JPG for photos
  • Size: Keep under 10MB per image
  • Quality: Avoid blurry or skewed images

Office Document Processing

Office files maintain their structure:
# Extract structured data from Excel
schema = {
    "headers": ["string"],
    "data": [["string"]]
}

result = client.extract(
    file_path="spreadsheet.xlsx",
    schema=schema
)
Office Document Features:
  • Preserve table structures
  • Extract embedded objects
  • Handle multiple sheets/slides
  • Maintain formatting context

File Upload Methods

Direct Upload

Upload files directly in your API request:
# Direct file upload (recommended for files < 10MB)
with open("document.pdf", "rb") as f:
    response = requests.post(
        "https://dev.api.runpulse.com/extract",
        headers={"x-api-key": API_KEY},
        files={"file": f}
    )

Pre-upload to S3

For larger files or reusable uploads:
# Upload to S3 first
upload_response = requests.post(
    "https://dev.api.runpulse.com/convert",
    headers={"x-api-key": API_KEY},
    files={"file": open("large_file.pdf", "rb")}
)

file_url = upload_response.json()["s3_object_url"]

# Then extract using the URL
extract_response = requests.post(
    "https://dev.api.runpulse.com/extract",
    headers={"x-api-key": API_KEY},
    json={"file-url": file_url}
)

URL-Based Processing

Process files from public URLs:
# Extract from URL
response = requests.post(
    "https://dev.api.runpulse.com/extract",
    headers={"x-api-key": API_KEY},
    json={"file-url": "https://example.com/document.pdf"}
)

Unsupported Formats

The following formats are not currently supported:
  • Video files (.mp4, .avi, .mov)
  • Audio files (.mp3, .wav)
  • CAD files (.dwg, .dxf)
  • Legacy Office formats (.doc, .xls, .ppt)
  • Compressed archives (.zip, .rar)
  • Executable files (.exe, .app)
Need support for a specific format? Contact us at hello@trypulse.ai

Format Detection

Pulse API automatically detects file types based on:
  1. File extension
  2. MIME type
  3. File content analysis
# API automatically detects format
result = client.extract(file_path="document")  # Works even without extension

Best Practices by Document Type

Invoices, Statements, Reports
  • Use PDF format when possible
  • Ensure text is selectable (not scanned)
  • Include all pages for context
  • Use schema extraction for structured data
Manuals, Specifications, Diagrams
  • Use extract_figure=True for diagrams
  • Process in smaller chunks if very large
  • Maintain original format for tables
  • Consider HTML output for preservation
Clinical Notes, Lab Reports, Prescriptions
  • Use high-resolution scans (300+ DPI)
  • Process handwritten content carefully
  • Verify extracted data accuracy

Troubleshooting

Common Issues

Solution: Check that your file extension matches our supported formats
# Correct
client.extract(file_path="document.pdf")

# Incorrect
client.extract(file_path="document.doc")  # Use .docx instead
Solution: Reduce file size or use async processing
# For large files
client.extract(file_path="large_file.pdf", async_mode=True)
Solution: Verify file integrity and re-save if needed
# Check PDF integrity
pdfinfo document.pdf

Next Steps