> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runpulse.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Supported File Types

> File formats supported by Pulse API for document extraction

## Overview

Pulse API supports a wide range of document formats, enabling extraction from virtually any business document. All file types are processed with the same high accuracy and can be used with structured schema extraction.

## Supported Formats

<CardGroup cols={2}>
  <Card title="PDF Documents" icon="file-pdf">
    **Extension**: `.pdf`

    * Text-based PDFs (searchable)
    * Image-based PDFs (scanned)
    * Mixed content PDFs
    * Password-protected PDFs (provide password)
    * Multi-page documents
  </Card>

  <Card title="Images" icon="image">
    **Extensions**: `.jpg`, `.jpeg`, `.png`

    * High-resolution scans
    * Photographs of documents
    * Screenshots
    * Multi-page TIFFs
    * Handwritten content (with limitations)
  </Card>

  <Card title="Office Documents" icon="file-word">
    **Extensions**: `.docx`, `.pptx`, `.xlsx`, `.xls`

    * Microsoft Word documents
    * PowerPoint presentations
    * Excel spreadsheets (`.xlsx` and legacy `.xls`)
    * Embedded images and charts
    * Complex formatting preserved
    * [Spreadsheet options](/api-reference/endpoint/extract#spreadsheet-options) for hidden rows/columns/sheets
  </Card>

  <Card title="Web Documents" icon="globe">
    **Extensions**: `.html`, `.htm`

    * Static HTML pages
    * Saved web pages
    * HTML emails
    * Inline styles preserved
    * Embedded images extracted
  </Card>
</CardGroup>

## Processing Large Files

For very large files, we recommend:

* Using async processing (`/extract` with `async: true`) — see [Async Processing](/api-reference/async-processing)
* Processing specific page ranges
* Contacting support for optimization strategies

## Format-Specific Features

### PDF Processing

PDFs receive the most comprehensive processing:

```python theme={null}
# Extract with PDF-specific options
result = client.extract(
    file_url="document.pdf",
    pages="1-10",  # Process specific pages
    figure_processing={"description": True}
)
```

**Advanced PDF Features**:

* Preserve document structure
* Extract form fields
* Handle multi-column layouts
* Process rotated pages
* Extract embedded images

### Image Processing

Images are processed using advanced OCR:

```python theme={null}
# High-quality image extraction
result = client.extract(
    file_url="scanned_document.jpg"
)
```

**Image Best Practices**:

* **Resolution**: Minimum 150 DPI, recommended 300 DPI
* **Format**: PNG for text, JPG for photos
* **Size**: Keep under 10MB per image
* **Quality**: Avoid blurry or skewed images

### Office Document Processing

Office files maintain their structure:

```python theme={null}
from pulse import Pulse

client = Pulse(api_key="YOUR_API_KEY")

# Extract from Excel with spreadsheet-specific options
result = client.extract(
    file=open("spreadsheet.xlsx", "rb"),
    spreadsheet={
        "include_hidden_rows": True,
        "include_hidden_cols": True,
        "include_hidden_sheets": False,
    }
)

print(result.markdown)
```

**Office Document Features**:

* Preserve table structures
* Extract embedded objects
* Handle multiple sheets/slides
* Maintain formatting context
* Control hidden rows, columns, and sheets in Excel (see [Spreadsheet Options](/api-reference/endpoint/extract#spreadsheet-options))

## File Upload Methods

### Direct Upload

Upload files directly in your API request:

```python theme={null}
# Direct file upload (recommended for files < 10MB)
with open("document.pdf", "rb") as f:
    response = requests.post(
        "https://api.runpulse.com/extract",
        headers={"x-api-key": API_KEY},
        files={"file": f}
    )
```

### URL-Based Processing

Process files from public URLs:

```python theme={null}
# Extract from URL
response = requests.post(
    "https://api.runpulse.com/extract",
    headers={"x-api-key": API_KEY},
    json={"file_url": "https://www.impact-bank.com/user/file/dummy_statement.pdf"}
)
```

## Unsupported Formats

The following formats are **not** currently supported:

* Video files (`.mp4`, `.avi`, `.mov`)
* Audio files (`.mp3`, `.wav`)
* CAD files (`.dwg`, `.dxf`)
* Legacy Office formats (`.doc`, `.ppt`)
* Compressed archives (`.zip`, `.rar`)
* Executable files (`.exe`, `.app`)

<Info>
  Need support for a specific format? Contact us at [hello@trypulse.ai](mailto:hello@trypulse.ai)
</Info>

## Format Detection

Pulse API automatically detects file types based on:

1. File extension
2. MIME type
3. File content analysis

```python theme={null}
# API automatically detects format
result = client.extract(file_url="document")  # Works even without extension
```

## Best Practices by Document Type

<AccordionGroup>
  <Accordion title="Financial Documents">
    **Invoices, Statements, Reports**

    * Use PDF format when possible
    * Ensure text is selectable (not scanned)
    * Include all pages for context
    * Use schema extraction for structured data
  </Accordion>

  <Accordion title="Legal Documents">
    **Contracts, Agreements, Forms**

    * Maintain original formatting with PDF
    * Process complete documents for context
    * Use high-resolution scans for signatures
    * Process complex layouts with care
  </Accordion>

  <Accordion title="Technical Documents">
    **Manuals, Specifications, Diagrams**

    * Process in smaller chunks if very large
    * Maintain original format for tables
    * Consider HTML output for preservation
  </Accordion>

  <Accordion title="Medical Records">
    **Clinical Notes, Lab Reports, Prescriptions**

    * Use high-resolution scans (300+ DPI)
    * Process handwritten content carefully
    * Verify extracted data accuracy
  </Accordion>
</AccordionGroup>

## Troubleshooting

### Common Issues

<AccordionGroup>
  <Accordion title="FILE_001: Invalid file type">
    **Solution**: Check that your file extension matches our supported formats

    ```python theme={null}
    # Correct
    client.extract(file_url="document.pdf")

    # Incorrect
    client.extract(file_url="document.doc")  # Use .docx instead
    ```
  </Accordion>

  <Accordion title="FILE_002: File too large">
    **Solution**: Reduce file size or use async processing

    ```python theme={null}
    # For large files
    client.extract(file_url="large_file.pdf", async_=True)
    ```
  </Accordion>

  <Accordion title="FILE_003: File corrupted">
    **Solution**: Verify file integrity and re-save if needed

    ```bash theme={null}
    # Check PDF integrity
    pdfinfo document.pdf
    ```
  </Accordion>
</AccordionGroup>

## Next Steps

<CardGroup cols={2}>
  <Card title="Structured Output" icon="code" href="/api-reference/structured-output-guidelines">
    Extract structured data from any format
  </Card>

  <Card title="Large Documents" icon="file-lines" href="/api-reference/large-documents">
    Handle big files efficiently
  </Card>
</CardGroup>
