> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runpulse.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Extract File

> The primary endpoint for the Pulse API. Parses uploaded documents or remote 
file URLs and returns rich markdown content with optional structured data 
extraction based on user-provided schemas and extraction options.

Set `async: true` to return immediately with a job ID for polling via
`GET /job/{jobId}`. Otherwise processes synchronously.

Both sync and async modes return HTTP 200. When `async` is true the response
body contains `{ job_id, status, message }` instead of the full extraction result.


## Overview

<Info>
  **Pipeline Step 1** — Extract is always the first step. After extraction, you can optionally [split](/api-reference/endpoint/split) the document into topics, apply [schema extraction](/api-reference/endpoint/schema) to get structured data, or use [tables](/api-reference/endpoint/tables) for span-aware table extraction.
</Info>

Extract content from documents. Returns markdown or HTML formatted content with optional structured data extraction.

For large results (typically documents over 70 pages, or any response above 5 MB), the API returns a one-time download link at `https://api.runpulse.com/large_results/{job_id}` instead of inlining the payload. See [Large Document Response](#large-document-response-70-pages) below.

<Note>
  For large documents or batch processing workflows, set `async: true` to process asynchronously and poll for results via [GET /job/{'{'}jobId{'}'}](/api-reference/endpoint/poll).
</Note>

<Note>
  To process many files at once, use [Batch Extract](/api-reference/endpoint/batch-overview#batch-extract). It accepts an S3 prefix, local directory, or list of URLs and runs `/extract` on each file in parallel.
</Note>

### Async Mode

Set `async: true` to return immediately with a job ID for polling:

```json theme={null}
{
  "file_url": "https://example.com/document.pdf",
  "async": true
}
```

**Async Response (200):**

```json theme={null}
{
  "job_id": "abc123-def456",
  "status": "pending",
  "message": "Document processing started"
}
```

Use `GET /job/{job_id}` to poll for completion.

## Request

### Document Source

Provide the document using one of these methods:

| Field      | Type   | Description                                                    |
| ---------- | ------ | -------------------------------------------------------------- |
| `file`     | binary | Document file to upload directly (multipart/form-data).        |
| `file_url` | string | Public or pre-signed URL that Pulse will download and extract. |

### Extraction Options

| Field               | Type          | Default   | Description                                                                                                                                                                                              |
| ------------------- | ------------- | --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `model`             | string (enum) | `default` | Extraction model to use. One of `default` or `pulse-ultra-2`. `pulse-ultra-2` uses Pulse's vision-language model with built-in refinement, figure/chart extraction, and word-level bounding boxes.       |
| `pages`             | string        | -         | Page range filter (1-indexed). Supports segments like `1-2` or mixed ranges like `1-2,5`. Page 1 is the first page.                                                                                      |
| `figure_processing` | object        | -         | Settings that control how figures in the document are processed. These affect the **markdown output directly** and do not produce additional output fields. See [Figure Processing](#figure-processing). |
| `extensions`        | object        | -         | Settings that enable additional processing or alternate output formats. Each enabled extension produces a corresponding result under `response.extensions.*`. See [Extensions](#extensions).             |
| `spreadsheet`       | object        | -         | Settings for Excel/spreadsheet extraction. Controls handling of hidden rows, columns, and sheets. Only applies to `.xlsx` and `.xls` files. See [Spreadsheet Options](#spreadsheet-options).             |
| `storage`           | object        | -         | Options for persisting extraction artifacts. See [Storage Options](#storage-options).                                                                                                                    |
| `async`             | boolean       | `false`   | If `true`, returns immediately with a `job_id` for polling via `GET /job/{jobId}`.                                                                                                                       |
| `structured_output` | object        | -         | **⚠️ Deprecated** — Use the [`/schema`](/api-reference/endpoint/schema) endpoint after extraction instead. Still works for backward compatibility.                                                       |

### Figure Processing

Settings under `figure_processing` control how figures (images, charts, diagrams) in the document are processed. These settings affect the **markdown output directly** — for example, adding descriptive captions to figures or converting charts into markdown tables. They do **not** create additional output fields in the response.

| Field                           | Type    | Default | Description                                                                 |
| ------------------------------- | ------- | ------- | --------------------------------------------------------------------------- |
| `figure_processing.description` | boolean | `false` | Generate descriptive captions for extracted figures.                        |
| `figure_processing.show_images` | boolean | `false` | Embed base64-encoded images inline in figure tags. Increases response size. |

### Spreadsheet Options

Settings under `spreadsheet` control how Excel workbooks (`.xlsx`, `.xls`) are processed. By default, hidden rows, columns, and sheets are excluded from extraction output.

| Field                               | Type    | Default | Description                                            |
| ----------------------------------- | ------- | ------- | ------------------------------------------------------ |
| `spreadsheet.include_hidden_rows`   | boolean | `false` | Include rows that are hidden in the Excel workbook.    |
| `spreadsheet.include_hidden_cols`   | boolean | `false` | Include columns that are hidden in the Excel workbook. |
| `spreadsheet.include_hidden_sheets` | boolean | `false` | Include sheets that are hidden in the Excel workbook.  |

<Note>
  These settings accept both camelCase (`includeHiddenRows`) and snake\_case (`include_hidden_rows`) formats.
</Note>

### Pulse Ultra 2 Options

These options are available only when `model: pulse-ultra-2` is set. Passing any of them with the default model returns a 400 error listing the offending fields.

| Field                       | Type    | Default | Description                                                                                                                                                                                                                |
| --------------------------- | ------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `refine`                    | boolean | `false` | Run a full-page OCR and formatting correction pass after extraction. Improves accuracy on dense layouts, numerical values, and table structure. Adds \~1–2s per page. Overridden by `refine_options` if both are provided. |
| `refine_options`            | object  | -       | Granular refinement targets. Takes precedence over the boolean `refine` flag. See below.                                                                                                                                   |
| `refine_options.tables`     | boolean | `false` | Fix table cell values, structure, and headers against the source image.                                                                                                                                                    |
| `refine_options.text`       | boolean | `false` | Fix OCR errors, missing or extra content, and numerical accuracy (tables untouched).                                                                                                                                       |
| `refine_options.formatting` | boolean | `false` | Add strikethrough, italic, bold, super/subscript, and LaTeX formatting (tables untouched).                                                                                                                                 |
| `extract_figure`            | boolean | `false` | Convert charts and data visualizations into HTML `<table>` blocks, wrapped in `<figure-table>` tags. Useful for financial decks, dashboards, and scientific charts.                                                        |
| `figure_description`        | boolean | `false` | Generate a 1–2 paragraph natural-language description of each picture, wrapped in `<figure-description>` tags. Combines well with `extract_figure`.                                                                        |
| `additional_prompt`         | string  | `""`    | Extra context injected into the extraction prompt. Use to steer extraction toward a specific domain or attention focus. Max 4000 characters.                                                                               |
| `custom_image_prompt`       | string  | `""`    | Extra context appended to the prompt used by `figure_description` and `extract_figure`. Tunes image and chart interpretation. Max 2000 characters.                                                                         |
| `custom_refine_prompt`      | string  | `""`    | Extra context appended to the refinement prompt. Only applies when `refine: true` or `refine_options` is set. Max 2000 characters.                                                                                         |

#### Markdown output additions

When `extract_figure` or `figure_description` is enabled, figures in `response.markdown` include additional tags:

```html theme={null} theme={null}
<figure data-page="1">
  <figure-table>...HTML table for the chart...</figure-table>
  <figure-description>...1–2 paragraph description...</figure-description>
</figure>
```

When `refine` (or `refine_options`) is set, markdown content is post-processed page-by-page; output is cleaner but typically grows \~1.5–3x in size for dense documents. No new tags are introduced.

### Extensions

Settings under `extensions` enable additional processing passes or alternate output formats. Each enabled extension produces a **corresponding output field** under `response.extensions.*`. For example, enabling `extensions.chunking` produces `response.extensions.chunking`, and enabling `extensions.alt_outputs.return_html` produces `response.extensions.alt_outputs.html`.

| Field                                | Type      | Default | Description                                                                                                           |
| ------------------------------------ | --------- | ------- | --------------------------------------------------------------------------------------------------------------------- |
| `extensions.footnote_references`     | boolean   | `false` | Link footnote markers to their corresponding footnote text.                                                           |
| `extensions.chunking`                | object    | -       | Chunking configuration. See below.                                                                                    |
| `extensions.chunking.chunk_types`    | string\[] | -       | List of chunking strategies: `semantic`, `header`, `page`, `recursive`.                                               |
| `extensions.chunking.chunk_size`     | integer   | -       | Maximum characters per chunk.                                                                                         |
| `extensions.alt_outputs`             | object    | -       | Alternate output formats. See below.                                                                                  |
| `extensions.alt_outputs.wlbb`        | boolean   | `false` | Enable word-level bounding boxes (PDF only). Results in `response.extensions.alt_outputs.wlbb`.                       |
| `extensions.alt_outputs.return_html` | boolean   | `false` | Include HTML representation. `response.markdown` is still present; HTML is at `response.extensions.alt_outputs.html`. |
| `extensions.alt_outputs.return_xml`  | boolean   | `false` | Include XML representation (work in progress).                                                                        |

### `pulse-ultra-2` Rate Limits

Requests made with `model: pulse-ultra-2` are subject to dedicated rate limits, separate from standard extraction:

| Limit      | Value          |
| ---------- | -------------- |
| Per minute | 5 extractions  |
| Per hour   | 20 extractions |
| File size  | 50 MB          |
| Concurrent | 2 per API key  |

The concurrent limit is the one that most commonly applies in practice — long-running extractions held open while new requests arrive will trip it first.

### Storage Options

Control whether extractions are saved to your extraction library:

| Field                 | Type          | Default | Description                                                                           |
| --------------------- | ------------- | ------- | ------------------------------------------------------------------------------------- |
| `storage.enabled`     | boolean       | `true`  | Whether to persist extraction artifacts. Set to `false` for temporary extractions.    |
| `storage.folder_name` | string        | -       | Target folder name to save the extraction to. Creates the folder if it doesn't exist. |
| `storage.folder_id`   | string (uuid) | -       | Target folder ID to save the extraction to. Takes precedence over `folder_name`.      |

### Deprecated Fields

The following input fields are deprecated and will be removed in a future version. They are still accepted for backward compatibility.

| Field               | Replacement                                                                                                                                                         |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `show_images`       | Use `figure_processing.show_images`                                                                                                                                 |
| `chunking`          | Use `extensions.chunking.chunk_types` (array instead of comma-separated string)                                                                                     |
| `chunk_size`        | Use `extensions.chunking.chunk_size`                                                                                                                                |
| `return_html`       | Use `extensions.alt_outputs.return_html`                                                                                                                            |
| `structured_output` | Use [`/schema`](/api-reference/endpoint/schema) endpoint after extraction. Pass `extraction_id` + `schema_config`. Accepts `schema`, `schema_prompt`, and `effort`. |
| `schema`            | Use [`/schema`](/api-reference/endpoint/schema) endpoint after extraction                                                                                           |
| `schema_prompt`     | Use [`/schema`](/api-reference/endpoint/schema) endpoint with `schema_config.schema_prompt`                                                                         |
| `custom_prompt`     | No replacement                                                                                                                                                      |
| `thinking`          | No replacement                                                                                                                                                      |

<Note>
  When legacy input fields are used, the API returns a deprecation warning in the `warnings` array directing you to the updated field names. See the [latest documentation](https://docs.runpulse.com/api-reference/endpoint/extract) for details.
</Note>

## Response

The response structure varies based on document size to optimize for different use cases.

### Standard Response (Under 70 Pages)

For documents under 70 pages, results are returned directly in the response body:

```json theme={null}
{
  "markdown": "# Document Title\n\nExtracted content...",
  "page_count": 15,
  "extraction_id": "abc123-def456-ghi789",
  "extraction_url": "https://platform.runpulse.com/dashboard/extractions/abc123",
  "plan_info": {
    "pages_used": 15,
    "tier": "standard",
    "note": "Pulse Ultra"
  },
  "bounding_boxes": {
    "Title": [],
    "Text": [],
    "Tables": [],
    "Images": [],
    "markdown_with_ids": "<p data-bb-text-id=\"txt-1\">..."
  },
  "extensions": {
    "chunking": {
      "semantic": ["chunk 1...", "chunk 2..."],
      "header": ["section 1...", "section 2..."]
    },
    "altOutputs": {
      "html": "<html>...</html>"
    }
  },
  "warnings": []
}
```

#### Response Fields

| Field                           | Type          | Description                                                                                                                                                         |
| ------------------------------- | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `markdown`                      | string        | Clean markdown content extracted from the document. Always present.                                                                                                 |
| `page_count`                    | integer       | Total number of pages processed.                                                                                                                                    |
| `extraction_id`                 | string (uuid) | Persisted extraction ID. Present when storage is enabled (default). Use with `/split` and `/schema`.                                                                |
| `extraction_url`                | string        | URL to view the extraction in the Pulse Platform. Present when storage is enabled.                                                                                  |
| `plan_info`                     | object        | Billing information including pages used and plan tier.                                                                                                             |
| `bounding_boxes`                | object        | Detailed bounding box data for document elements. See [Bounding Boxes](/api-reference/bounding-boxes) for details.                                                  |
| `extensions`                    | object        | Output from enabled extensions. Only keys for enabled extensions are present. See below.                                                                            |
| `extensions.chunking`           | object        | Chunk results by strategy (when `extensions.chunking` is enabled).                                                                                                  |
| `extensions.footnoteReferences` | array         | List of detected footnotes with their in-text references (when `extensions.footnote_references` is enabled). See [Footnote References](#footnote-references) below. |
| `extensions.altOutputs.wlbb`    | object        | Word-level bounding boxes (when `extensions.alt_outputs.wlbb` is enabled).                                                                                          |
| `extensions.altOutputs.html`    | string        | HTML representation (when `extensions.alt_outputs.return_html` is enabled).                                                                                         |
| `extensions.altOutputs.xml`     | string        | XML representation (when `extensions.alt_outputs.return_xml` is enabled, WIP).                                                                                      |
| `warnings`                      | array         | Non-fatal warnings generated during extraction, including deprecation notices for legacy input usage.                                                               |

#### Deprecated Response Fields

| Field               | Replacement                                     | Description                                                       |
| ------------------- | ----------------------------------------------- | ----------------------------------------------------------------- |
| `html`              | `extensions.altOutputs.html`                    | Present when legacy `return_html` input is used.                  |
| `chunks`            | `extensions.chunking`                           | Present when legacy `chunking` input is used.                     |
| `plan-info`         | `plan_info`                                     | Present when only legacy inputs are used.                         |
| `structured_output` | Use [`/schema`](/api-reference/endpoint/schema) | Present when deprecated `structured_output` input was used.       |
| `input_schema`      | Use [`/schema`](/api-reference/endpoint/schema) | Echo of the applied schema (deprecated path only).                |
| `schema_error`      | Use [`/schema`](/api-reference/endpoint/schema) | Error message if schema processing failed (deprecated path only). |

### Large Document Response (70+ Pages)

For documents with 70 or more pages — or any response payload above the 5 MB inline threshold — the API returns a one-time download link to `/large_results/{job_id}` instead of inlining the payload. This prevents timeout issues and keeps the immediate response small.

```json theme={null}
{
  "is_url": true,
  "url": "https://api.runpulse.com/large_results/abc123-def456-ghi789",
  "plan_info": {
    "pages_used": 150,
    "tier": "standard"
  }
}
```

#### Large Document Response Fields

| Field       | Type    | Description                                                                                                                                                                                                                                                                                                                             |
| ----------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `is_url`    | boolean | Always `true` for large document responses. Use this to detect URL-based responses.                                                                                                                                                                                                                                                     |
| `url`       | string  | One-time download link of the form `https://api.runpulse.com/large_results/{job_id}`. The link streams the complete extraction result the first time it is fetched and is then invalidated (subsequent reads return `410 Gone`). It also expires 1 hour after the job completes. Authenticate the request with your `x-api-key` header. |
| `plan_info` | object  | Billing information including pages used and plan tier.                                                                                                                                                                                                                                                                                 |

<Warning>
  `/large_results/{job_id}` links are **single-use** and **expire 1 hour** after the job completes. Download and persist the payload immediately — do not pass the URL through queues or share it across workers.
</Warning>

#### Handling Large Document Responses

<CodeGroup>
  ```python Python theme={null}
  import requests
  from pulse import Pulse

  API_KEY = "YOUR_API_KEY"
  client = Pulse(api_key=API_KEY)

  response = client.extract(
      file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf"
  )

  if hasattr(response, "is_url") and response.is_url:
      full_result = requests.get(
          response.url,
          headers={"x-api-key": API_KEY},
      ).json()
      print(full_result["markdown"])
  else:
      print(response.markdown)
  ```

  ```typescript TypeScript theme={null}
  import { PulseClient } from 'pulse-ts-sdk';

  const API_KEY = "YOUR_API_KEY";
  const client = new PulseClient({ apiKey: API_KEY });

  const response = await client.extract({
      fileUrl: "https://www.impact-bank.com/user/file/dummy_statement.pdf"
  });

  if ((response as any).is_url) {
      const fullResult = await fetch((response as any).url, {
          headers: { "x-api-key": API_KEY },
      }).then(r => r.json());
      console.log(fullResult.markdown);
  } else {
      console.log(response.markdown);
  }
  ```

  ```bash curl theme={null}
  curl -X POST https://api.runpulse.com/extract \
    -H "x-api-key: YOUR_API_KEY" \
    -F "file=@large_document.pdf"

  # Response: {"is_url": true, "url": "https://api.runpulse.com/large_results/abc123-..."}

  # Fetch the result once (single-use link, valid for 1 hour after job completion)
  curl -H "x-api-key: YOUR_API_KEY" \
    "https://api.runpulse.com/large_results/abc123-..."
  ```
</CodeGroup>

<Note>
  Because `/large_results/{job_id}` is one-time use, persist the result to your own storage on first download. If you need to access the result later, enable `storage.enabled` and retrieve it from your extraction library on the Pulse Platform.
</Note>

## Example Usage

### Basic Extraction

<CodeGroup>
  ```python Python theme={null}
  from pulse import Pulse
  from pulse.types import (
      ExtractRequestFigureProcessing,
      ExtractRequestExtensions,
      ExtractRequestExtensionsAltOutputs,
  )

  client = Pulse(api_key="YOUR_API_KEY")

  # Extract from URL with figure processing and HTML output
  response = client.extract(
      file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
      figure_processing=ExtractRequestFigureProcessing(
          description=True,
      ),
      extensions=ExtractRequestExtensions(
          alt_outputs=ExtractRequestExtensionsAltOutputs(
              return_html=True,
          ),
      ),
  )

  print(f"Markdown: {response.markdown}")
  print(f"HTML: {response.extensions.alt_outputs.html}")
  print(f"Extraction ID: {response.extraction_id}")
  ```

  ```typescript TypeScript theme={null}
  import { PulseClient } from 'pulse-ts-sdk';

  const client = new PulseClient({ apiKey: "YOUR_API_KEY" });

  const response = await client.extract({
      fileUrl: "https://www.impact-bank.com/user/file/dummy_statement.pdf",
      figureProcessing: { description: true },
      extensions: { altOutputs: { returnHtml: true } }
  });

  console.log(`Markdown: ${response.markdown}`);
  console.log(`HTML: ${response.extensions?.altOutputs?.html}`);
  console.log(`Extraction ID: ${response.extraction_id}`);
  ```

  ```bash curl theme={null}
  # Extract from URL with figure processing
  curl -X POST https://api.runpulse.com/extract \
    -H "x-api-key: YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "file_url": "https://www.impact-bank.com/user/file/dummy_statement.pdf",
      "figureProcessing": {"description": true},
      "extensions": {"altOutputs": {"returnHtml": true}}
    }'
  ```
</CodeGroup>

### File Upload

<CodeGroup>
  ```python Python theme={null}
  from pulse.types import ExtractRequestFigureProcessing

  # Upload and extract a local file
  with open("document.pdf", "rb") as f:
      response = client.extract(
          file=f,
      figure_processing=ExtractRequestFigureProcessing(
          description=True,
      ),
      )
  ```

  ```typescript TypeScript theme={null}
  import * as fs from 'fs';

  const fileBuffer = fs.readFileSync("document.pdf");
  const blob = new Blob([fileBuffer], { type: 'application/pdf' });

  const response = await client.extract({
      file: blob,
  });
  ```

  ```bash curl theme={null}
  curl -X POST https://api.runpulse.com/extract \
    -H "x-api-key: YOUR_API_KEY" \
    -F "file=@document.pdf"
  ```
</CodeGroup>

### Structured Data (Extract → Schema)

<Warning>
  The `structured_output` parameter on `/extract` is **deprecated**. Use the [`/schema`](/api-reference/endpoint/schema) endpoint after extraction instead. This gives you better control, re-runnability, and support for split-mode schemas.
</Warning>

**Recommended two-step approach:**

<CodeGroup>
  ```python Python theme={null}
  # Step 1: Extract the document
  response = client.extract(
      file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf"
  )

  extraction_id = response.extraction_id

  # Step 2: Apply schema separately
  schema_result = client.schema(
      extraction_id=extraction_id,
      schema_config={
          "input_schema": {
              "type": "object",
              "properties": {
                  "total": {"type": "number"},
                  "vendor": {"type": "string"}
              }
          },
          "schema_prompt": "Extract invoice total and vendor"
      }
  )

  print(schema_result.schema_output)
  ```

  ```typescript TypeScript theme={null}
  // Step 1: Extract the document
  const response = await client.extract({
      fileUrl: "https://www.impact-bank.com/user/file/dummy_statement.pdf"
  });

  const extractionId = response.extraction_id;

  // Step 2: Apply schema separately
  const schemaResult = await client.schema({
      extraction_id: extractionId,
      schema_config: {
          input_schema: {
              type: "object",
              properties: {
                  total: { type: "number" },
                  vendor: { type: "string" }
              }
          },
          schema_prompt: "Extract invoice total and vendor"
      }
  });

  console.log(schemaResult.schema_output);
  ```

  ```bash curl theme={null}
  # Step 1: Extract the document
  curl -X POST https://api.runpulse.com/extract \
    -H "x-api-key: YOUR_API_KEY" \
    -F "file=@invoice.pdf"

  # Response includes extraction_id: "abc123-..."

  # Step 2: Apply schema
  curl -X POST https://api.runpulse.com/schema \
    -H "x-api-key: YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "extraction_id": "abc123-...",
      "schema_config": {
        "input_schema": {"type": "object", "properties": {"total": {"type": "number"}, "vendor": {"type": "string"}}},
        "schema_prompt": "Extract invoice total and vendor"
      }
    }'
  ```
</CodeGroup>

### Page Range and Chunking

<CodeGroup>
  ```python Python theme={null}
  from pulse.types import (
      ExtractRequestExtensions,
      ExtractRequestExtensionsChunking,
  )

  response = client.extract(
      file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
      pages="1-5,10",  # 1-indexed
      extensions=ExtractRequestExtensions(
          chunking=ExtractRequestExtensionsChunking(
              chunk_types=["semantic", "page"],
              chunk_size=1000,
          ),
      ),
  )

  # Chunk data is in extensions.chunking
  print(response.extensions.chunking.semantic)
  print(response.extensions.chunking.page)
  ```

  ```typescript TypeScript theme={null}
  const response = await client.extract({
      fileUrl: "https://www.impact-bank.com/user/file/dummy_statement.pdf",
      pages: "1-5,10",  // 1-indexed
      extensions: {
          chunking: {
              chunkTypes: ["semantic", "page"],
              chunkSize: 1000
          }
      }
  });

  // Chunk data is in extensions.chunking
  console.log(response.extensions?.chunking?.semantic);
  console.log(response.extensions?.chunking?.page);
  ```

  ```bash curl theme={null}
  curl -X POST https://api.runpulse.com/extract \
    -H "x-api-key: YOUR_API_KEY" \
    -F "file=@document.pdf" \
    -F "pages=1-5,10" \
    -F 'extensions={"chunking": {"chunkTypes": ["semantic", "page"], "chunkSize": 1000}}'
  ```
</CodeGroup>

### Footnote References

Enable `extensions.footnote_references` to detect footnote markers (e.g. `*`, `†`, `1`) in body text and link them to the footnote explanation paragraphs at the bottom of the page. Each result item includes the marker symbol, the bounding-box text ID of the footnote, and the bounding-box text IDs of all body-text paragraphs that reference it.

<CodeGroup>
  ```python Python theme={null}
  from pulse.types import ExtractRequestExtensions

  response = client.extract(
      file_url="https://example.com/research-paper.pdf",
      extensions=ExtractRequestExtensions(
          footnote_references=True,
      ),
  )

  # Footnote links are in extensions.footnote_references
  for ref in response.extensions.footnote_references:
      print(f"Marker: {ref.symbol}")
      print(f"  Footnote: {ref.footnote_text_id}")
      print(f"  Referenced by: {ref.reference_text_ids}")
  ```

  ```typescript TypeScript theme={null}
  const response = await client.extract({
      fileUrl: "https://example.com/research-paper.pdf",
      extensions: {
          footnoteReferences: true
      }
  });

  // Footnote links are in extensions.footnoteReferences
  for (const ref of response.extensions?.footnoteReferences ?? []) {
      console.log(`Marker: ${ref.symbol}`);
      console.log(`  Footnote: ${ref.footnoteTextId}`);
      console.log(`  Referenced by: ${ref.referenceTextIds}`);
  }
  ```

  ```bash curl theme={null}
  curl -X POST https://api.runpulse.com/extract \
    -H "x-api-key: YOUR_API_KEY" \
    -F "file=@research-paper.pdf" \
    -F 'extensions={"footnoteReferences": true}'
  ```
</CodeGroup>

#### Example Response

```json theme={null}
{
  "markdown": "...",
  "bounding_boxes": { ... },
  "extensions": {
    "footnoteReferences": [
      {
        "symbol": "*",
        "footnoteTextId": "txt-11",
        "referenceTextIds": ["txt-4", "txt-5", "txt-6", "txt-7", "txt-8"]
      },
      {
        "symbol": "†",
        "footnoteTextId": "txt-12",
        "referenceTextIds": ["txt-8"]
      },
      {
        "symbol": "4",
        "footnoteTextId": "txt-48",
        "referenceTextIds": ["txt-45"]
      }
    ]
  }
}
```

#### Footnote Reference Fields

| Field              | Type      | Description                                                                                                                                                                       |
| ------------------ | --------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `symbol`           | string    | The footnote marker symbol as detected in the document (e.g. `*`, `†`, `‡`, `1`, `#`).                                                                                            |
| `footnoteTextId`   | string    | The bounding-box text ID (e.g. `txt-11`) of the footnote explanation paragraph. Cross-reference with `bounding_boxes.Footer` to get the footnote's content and position.          |
| `referenceTextIds` | string\[] | Bounding-box text IDs of body-text paragraphs that contain a reference to this footnote. Cross-reference with `bounding_boxes.Text` to get each paragraph's content and position. |

<Info>
  Footnote reference detection uses Azure Document Intelligence for paragraph classification, supplemented by PyMuPDF native text extraction for accurate symbol identification. This handles common OCR confusion between visually similar symbols like `†`/`+` and `‡`/`#`. Supported marker types include numbered (`1`, `2`, `3`), symbolic (`*`, `†`, `‡`, `§`, `#`), and lettered (`a`, `b`, `c`) footnotes.
</Info>

### Excel Spreadsheet Options

<CodeGroup>
  ```python Python theme={null}
  from pulse import Pulse
  from pulse.types import ExtractRequestSpreadsheet

  client = Pulse(api_key="YOUR_API_KEY")

  # Extract from Excel with hidden content included
  response = client.extract(
      file=open("financials.xlsx", "rb"),
      spreadsheet=ExtractRequestSpreadsheet(
          include_hidden_rows=True,
          include_hidden_cols=True,
          include_hidden_sheets=False,
      ),
  )

  print(response.markdown)
  ```

  ```typescript TypeScript theme={null}
  import { PulseClient } from 'pulse-ts-sdk';

  const client = new PulseClient({
      headers: { 'x-api-key': 'YOUR_API_KEY' }
  });

  const response = await client.extract({
      file: fs.createReadStream("financials.xlsx"),
      spreadsheet: {
          includeHiddenRows: true,
          includeHiddenCols: true,
          includeHiddenSheets: false
      }
  });

  console.log(response.markdown);
  ```

  ```bash curl theme={null}
  curl -X POST https://api.runpulse.com/extract \
    -H "x-api-key: YOUR_API_KEY" \
    -F "file=@financials.xlsx" \
    -F 'spreadsheet={"includeHiddenRows": true, "includeHiddenCols": true, "includeHiddenSheets": false}'
  ```
</CodeGroup>

### Disable Storage

<CodeGroup>
  ```python Python theme={null}
  response = client.extract(
      file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
      storage={"enabled": False}
  )
  ```

  ```typescript TypeScript theme={null}
  const response = await client.extract({
      fileUrl: "https://www.impact-bank.com/user/file/dummy_statement.pdf",
      storage: { enabled: false }
  });
  ```

  ```bash curl theme={null}
  curl -X POST https://api.runpulse.com/extract \
    -H "x-api-key: YOUR_API_KEY" \
    -F "file=@document.pdf" \
    -F 'storage={"enabled": false}'
  ```
</CodeGroup>


## OpenAPI

````yaml POST /extract
openapi: 3.1.0
info:
  title: Pulse API
  description: >-
    Production-grade document extraction service that transforms complex
    documents  into structured, AI-ready data. This specification is the single
    source of truth  for the Pulse extraction APIs.
  version: 1.0.0
  contact:
    name: Pulse Support
    email: support@trypulse.ai
    url: https://docs.runpulse.com
servers:
  - url: https://api.runpulse.com
    description: Production server
security:
  - ApiKeyAuth: []
paths:
  /extract:
    post:
      tags:
        - Extract
      summary: Extract Document
      description: >
        The primary endpoint for the Pulse API. Parses uploaded documents or
        remote 

        file URLs and returns rich markdown content with optional structured
        data 

        extraction based on user-provided schemas and extraction options.


        Set `async: true` to return immediately with a job ID for polling via

        `GET /job/{jobId}`. Otherwise processes synchronously.


        Both sync and async modes return HTTP 200. When `async` is true the
        response

        body contains `{ job_id, status, message }` instead of the full
        extraction result.
      operationId: extractDocument
      requestBody:
        required: true
        content:
          multipart/form-data:
            schema:
              $ref: '#/components/schemas/ExtractMultipartInput'
          application/json:
            schema:
              $ref: '#/components/schemas/ExtractJsonInput'
      responses:
        '200':
          description: |
            When `async=false` (default): full extraction result with markdown,
            bounding boxes, chunks, etc.
            When `async=true`: job submission acknowledgement with `job_id`.
          content:
            application/json:
              schema:
                oneOf:
                  - $ref: '#/components/schemas/ExtractResponse'
                  - $ref: '#/components/schemas/AsyncSubmissionResponse'
        '400':
          $ref: '#/components/responses/BadRequest'
        '401':
          $ref: '#/components/responses/Unauthorized'
        '429':
          $ref: '#/components/responses/TooManyRequests'
        '500':
          $ref: '#/components/responses/InternalServerError'
      x-codeSamples:
        - lang: python
          label: Python SDK
          source: |
            from pulse import Pulse
            from pulse.types import (
                ExtractRequestFigureProcessing,
                ExtractRequestExtensions,
                ExtractRequestExtensionsChunking,
                ExtractRequestExtensionsAltOutputs,
            )

            client = Pulse(api_key="YOUR_API_KEY")

            # Basic extraction from URL
            response = client.extract(
                file_url="https://example.com/document.pdf"
            )
            print(response.markdown)
            print(response.extraction_id)

            # With figure processing and extensions
            response = client.extract(
                file_url="https://example.com/document.pdf",
                figure_processing=ExtractRequestFigureProcessing(
                    description=True,
                ),
                extensions=ExtractRequestExtensions(
                    chunking=ExtractRequestExtensionsChunking(
                        chunk_types=["semantic"],
                        chunk_size=1000,
                    ),
                    alt_outputs=ExtractRequestExtensionsAltOutputs(
                        return_html=True,
                    ),
                ),
            )
            print(response.extensions.chunking)
            print(response.extensions.alt_outputs.html)

            # Async extraction
            response = client.extract(
                file_url="https://example.com/document.pdf",
                async_=True
            )
            print(response.job_id)  # poll via client.jobs.get_job(job_id=...)
        - lang: typescript
          label: TypeScript SDK
          source: |
            import { PulseClient } from "pulse-ts-sdk";

            const client = new PulseClient({
                apiKey: "YOUR_API_KEY"
            });

            // Basic extraction from URL
            const response = await client.extract({
                fileUrl: "https://example.com/document.pdf"
            });
            console.log(response.markdown);
            console.log(response.extraction_id);

            // With figure processing and extensions
            const extResp = await client.extract({
                fileUrl: "https://example.com/document.pdf",
                figureProcessing: { description: true },
                extensions: {
                    chunking: { chunkTypes: ["semantic"], chunkSize: 1000 },
                    altOutputs: { returnHtml: true }
                }
            });
            console.log(extResp.extensions?.chunking);
            console.log(extResp.extensions?.altOutputs?.html);

            // Async extraction
            const asyncResp = await client.extract({
                fileUrl: "https://example.com/document.pdf",
                async: true
            });
            console.log(asyncResp.job_id); // poll via client.jobs.getJob(...)
        - lang: bash
          label: curl
          source: |
            # Basic extraction
            curl -X POST https://api.runpulse.com/extract \
              -H "x-api-key: YOUR_API_KEY" \
              -F "file=@document.pdf"

            # With figure processing and extensions
            curl -X POST https://api.runpulse.com/extract \
              -H "x-api-key: YOUR_API_KEY" \
              -F "file=@document.pdf" \
              -F 'figure_processing={"description": true}' \
              -F 'extensions={"chunking": {"chunkTypes": ["semantic"], "chunkSize": 1000}, "altOutputs": {"returnHtml": true}}'

            # Async extraction
            curl -X POST https://api.runpulse.com/extract \
              -H "x-api-key: YOUR_API_KEY" \
              -F "file=@document.pdf" \
              -F "async=true"
components:
  schemas:
    ExtractMultipartInput:
      description: Input schema for multipart/form-data requests (file upload or file_url).
      allOf:
        - $ref: '#/components/schemas/ExtractSourceMultipart'
        - $ref: '#/components/schemas/ExtractOptions'
    ExtractJsonInput:
      description: Input schema for JSON requests (file_url only).
      allOf:
        - $ref: '#/components/schemas/ExtractSourceJson'
        - $ref: '#/components/schemas/ExtractOptions'
    ExtractResponse:
      type: object
      description: >-
        Full extraction result returned by the synchronous `/extract` endpoint.
        Contains the extracted markdown, optional extensions output, bounding
        boxes, and storage metadata.
      properties:
        markdown:
          type: string
          description: >-
            Primary markdown content extracted from the document. Always present
            in the new format.
        extensions:
          type: object
          description: >-
            Output from enabled extensions. Each key corresponds to an extension
            that was enabled in the request under `extensions.*`. Only keys for
            enabled extensions are present.
          properties:
            chunking:
              type: object
              description: >-
                Chunk results by strategy. Present when `extensions.chunking`
                was provided in the request.
              properties:
                semantic:
                  type: array
                  items:
                    type: string
                  description: Semantically-segmented chunks.
                header:
                  type: array
                  items:
                    type: string
                  description: Chunks split by document headers/headings.
                page:
                  type: array
                  items:
                    type: string
                  description: One chunk per page.
                recursive:
                  type: array
                  items:
                    type: string
                  description: Recursively-split chunks respecting size limits.
            merge_tables:
              type: object
              description: >-
                Merge tables result/metadata. Present when
                `extensions.merge_tables` was enabled.
              additionalProperties: true
            footnote_references:
              type: array
              description: >-
                List of detected footnotes with their in-text references.
                Present when `extensions.footnote_references` was enabled. Each
                item links a footnote paragraph to the body-text paragraphs that
                reference it, using bounding-box text IDs.
              items:
                type: object
                properties:
                  symbol:
                    type: string
                    description: The footnote marker symbol (e.g. "*", "†", "1", "#").
                  footnoteTextId:
                    type: string
                    description: >-
                      The bounding-box text ID (e.g. "txt-15") of the footnote
                      explanation paragraph.
                  referenceTextIds:
                    type: array
                    description: >-
                      Bounding-box text IDs of body-text paragraphs that contain
                      a reference to this footnote marker.
                    items:
                      type: string
            alt_outputs:
              type: object
              description: >-
                Alternate output formats. Each key corresponds to an enabled alt
                output.
              properties:
                wlbb:
                  type: object
                  description: >-
                    Word-level bounding box data. Present when
                    `extensions.alt_outputs.wlbb` was enabled and input is a
                    PDF.
                  properties:
                    words:
                      type: array
                      description: List of detected words with their positions.
                      items:
                        type: object
                        properties:
                          id:
                            type: string
                          text:
                            type: string
                          page_number:
                            type: integer
                            minimum: 1
                          bounding_box:
                            type: array
                            items:
                              type: number
                            minItems: 8
                            maxItems: 8
                          average_word_confidence:
                            type: number
                    error:
                      type: string
                      description: Error message if word-level extraction failed.
                html:
                  type: string
                  description: >-
                    HTML representation of the document. Present when
                    `extensions.alt_outputs.return_html` was enabled.
                xml:
                  type: string
                  description: >-
                    XML representation of the document. Present when
                    `extensions.alt_outputs.return_xml` was enabled. (WIP)
        bounding_boxes:
          type: object
          description: >-
            Positional bounding-box data for text, titles, headers, footers,
            images, and tables.
          additionalProperties: true
        extraction_id:
          type: string
          format: uuid
          description: >-
            Persisted extraction ID. Present when storage is enabled (default).
            Use with `/split` and `/schema` endpoints.
        extraction_url:
          type: string
          description: >-
            URL to view the extraction on the Pulse platform. Present when
            storage is enabled.
        page_count:
          type: integer
          minimum: 1
          description: Number of pages processed.
        plan_info:
          type: object
          description: Billing tier and usage information.
          properties:
            tier:
              type: string
              description: Current plan tier name.
            pages_used:
              type: integer
              description: Cumulative pages used after this extraction.
            note:
              type: string
              description: Human-readable plan note.
        warnings:
          type: array
          items:
            type: string
          description: >-
            Non-fatal warnings generated during extraction. Includes deprecation
            notices when legacy input parameters are used, as well as processing
            warnings (e.g. word-level bounding box limitations).
        html:
          type: string
          deprecated: true
          description: >-
            **Deprecated** -- Use `extensions.alt_outputs.html` instead. Present
            when the legacy `return_html` input was used.
        chunks:
          type: object
          deprecated: true
          description: >-
            **Deprecated** -- Use `extensions.chunking` instead. Present when
            the legacy `chunking` input was used.
          properties:
            semantic:
              type: array
              items:
                type: string
            header:
              type: array
              items:
                type: string
            page:
              type: array
              items:
                type: string
            recursive:
              type: array
              items:
                type: string
        plan-info:
          type: object
          deprecated: true
          description: >-
            **Deprecated** -- Use `plan_info` (underscore) instead. Present when
            only legacy input parameters are used.
          properties:
            tier:
              type: string
            pages_used:
              type: integer
            note:
              type: string
        structured_output:
          type: object
          deprecated: true
          description: >-
            **Deprecated** -- Only present when the deprecated
            `structured_output` input parameter was used. Use the `/schema`
            endpoint instead.
          properties:
            values:
              type: object
              additionalProperties: true
            citations:
              type: object
              additionalProperties: true
        input_schema:
          type: object
          deprecated: true
          description: '**Deprecated** -- Echo of the schema that was applied.'
          additionalProperties: true
        schema_error:
          type: string
          deprecated: true
          description: '**Deprecated** -- Error message if schema processing failed.'
      additionalProperties: true
    AsyncSubmissionResponse:
      type: object
      description: >-
        Acknowledgement returned when a request is submitted for asynchronous
        processing. Poll GET /job/{job_id} to check status and retrieve results.
      required:
        - job_id
        - status
      properties:
        job_id:
          type: string
          description: Identifier assigned to the asynchronous job.
        status:
          type: string
          description: Initial status reported by the server.
          enum:
            - pending
            - processing
        message:
          type: string
          description: Human-readable description of the accepted job.
    ExtractSourceMultipart:
      type: object
      description: Document source definition for multipart/form-data requests.
      properties:
        file:
          type: string
          format: binary
          description: Document to upload directly. Required unless file_url is specified.
        file_url:
          type: string
          format: uri
          description: Public or pre-signed URL that Pulse will download and extract.
      required:
        - file
    ExtractOptions:
      type: object
      description: >-
        Common extraction options shared by synchronous and asynchronous
        endpoints.
      properties:
        model:
          type: string
          enum:
            - default
            - pulse-ultra-2
          description: >-
            Extraction model to use. `pulse-ultra-2` uses Pulse's
            vision-language model with built-in refinement, figure/chart
            extraction, and word-level bounding boxes. Omit or pass `default`
            for standard extraction.
        pages:
          type: string
          description: >-
            Page range filter (1-indexed, where page 1 is the first page).
            Supports segments such as `1-2` or mixed ranges like `1-2,5`.
          pattern: ^[0-9]+(-[0-9]+)?(,[0-9]+(-[0-9]+)?)*$
        figure_processing:
          type: object
          description: >-
            Settings that control how figures in the document are processed.
            These affect the markdown output directly (e.g. figure descriptions,
            chart-to-table conversion, image embedding) and do not produce
            additional output fields in the response.
          properties:
            description:
              type: boolean
              default: false
              description: Generate descriptive captions for extracted figures.
            show_images:
              type: boolean
              default: false
              description: >-
                Embed base64-encoded images inline in figure tags in the output.
                Increases response size.
        extensions:
          type: object
          description: >-
            Settings that enable additional processing passes or alternate
            output formats. Each enabled extension produces a corresponding
            output field under `response.extensions.*`.
          properties:
            merge_tables:
              type: boolean
              default: false
              description: Merge tables that span multiple pages into a single table.
            footnote_references:
              type: boolean
              default: false
              description: Link footnote markers to their corresponding footnote text.
            chunking:
              type: object
              description: >-
                Chunking configuration. When provided, the document is split
                into chunks using the specified strategies. Results appear in
                `response.extensions.chunking`.
              properties:
                chunk_types:
                  type: array
                  items:
                    type: string
                    enum:
                      - semantic
                      - header
                      - page
                      - recursive
                  description: >-
                    List of chunking strategies to apply (e.g. `["semantic",
                    "header", "page", "recursive"]`).
                chunk_size:
                  type: integer
                  minimum: 1
                  description: Maximum characters per chunk.
            alt_outputs:
              type: object
              description: >-
                Alternate output format options. Each enabled format produces a
                corresponding field under `response.extensions.alt_outputs`.
              properties:
                wlbb:
                  type: boolean
                  default: false
                  description: >-
                    Enable word-level bounding boxes. Runs an additional OCR
                    model to derive bounding boxes for each word. Only applies
                    to PDFs. Results in `response.extensions.alt_outputs.wlbb`.
                return_html:
                  type: boolean
                  default: false
                  description: >-
                    Include an HTML representation of the document. When
                    enabled, `response.markdown` is still present and the HTML
                    is available at `response.extensions.alt_outputs.html`.
                return_xml:
                  type: boolean
                  default: false
                  description: >-
                    Include an XML representation of the document. Results in
                    `response.extensions.alt_outputs.xml`. (Work in progress.)
        storage:
          type: object
          description: >-
            Options for persisting extraction artifacts. When enabled (default),
            artifacts are saved to storage and a database record is created.
          properties:
            enabled:
              type: boolean
              description: >-
                Whether to persist extraction artifacts. Set to false for
                temporary extractions with no storage or database record.
              default: true
            folder_name:
              type: string
              description: >-
                Target folder name to save the extraction to. Creates the folder
                if it doesn't exist.
            folder_id:
              type: string
              format: uuid
              description: >-
                Target folder ID to save the extraction to. Takes precedence
                over folder_name if both are provided.
        async:
          type: boolean
          description: >-
            If true, returns immediately with a job_id for polling via GET
            /job/{jobId}. Otherwise processes synchronously.
          default: false
        refine:
          type: boolean
          default: false
          description: >-
            Run a full-page OCR and formatting correction pass after extraction.
            Improves accuracy on dense layouts, numerical values, and table
            structure. Adds ~1-2s per page. Overridden by `refine_options` if
            both are provided. Requires `model: pulse-ultra-2`.
        refine_options:
          type: object
          description: >-
            Granular refinement targets. Takes precedence over the boolean
            `refine` flag. Requires `model: pulse-ultra-2`.
          properties:
            tables:
              type: boolean
              default: false
              description: >-
                Fix table cell values, structure, and headers against the source
                image.
            text:
              type: boolean
              default: false
              description: >-
                Fix OCR errors, missing or extra content, and numerical accuracy
                (tables untouched).
            formatting:
              type: boolean
              default: false
              description: >-
                Add strikethrough, italic, bold, super/subscript, and LaTeX
                formatting (tables untouched).
        extract_figure:
          type: boolean
          default: false
          description: >-
            Convert charts and data visualizations into HTML `<table>` blocks,
            wrapped in `<figure-table>` tags inside `response.markdown`. Useful
            for financial decks, dashboards, and scientific charts. Requires
            `model: pulse-ultra-2`.
        figure_description:
          type: boolean
          default: false
          description: >-
            Generate a 1-2 paragraph natural-language description of each
            picture, wrapped in `<figure-description>` tags inside
            `response.markdown`. Combines well with `extract_figure`. Requires
            `model: pulse-ultra-2`.
        additional_prompt:
          type: string
          default: ''
          maxLength: 4000
          description: >-
            Extra context injected into the extraction prompt. Use to steer
            extraction toward a specific domain or attention focus. Requires
            `model: pulse-ultra-2`.
        custom_image_prompt:
          type: string
          default: ''
          maxLength: 2000
          description: >-
            Extra context appended to the prompt used by `figure_description`
            and `extract_figure`. Tunes image and chart interpretation for your
            domain. Requires `model: pulse-ultra-2`.
        custom_refine_prompt:
          type: string
          default: ''
          maxLength: 2000
          description: >-
            Extra context appended to the refinement prompt. Only applies when
            `refine: true` or `refine_options` is set. Requires `model:
            pulse-ultra-2`.
        chunking:
          type: string
          deprecated: true
          description: >-
            **Deprecated** -- Use `extensions.chunking.chunk_types` instead.
            Comma-separated list of chunking strategies.
        chunk_size:
          type: integer
          minimum: 1
          deprecated: true
          description: '**Deprecated** -- Use `extensions.chunking.chunk_size` instead.'
        show_images:
          type: boolean
          deprecated: true
          default: false
          description: '**Deprecated** -- Use `figure_processing.show_images` instead.'
        return_html:
          type: boolean
          deprecated: true
          default: false
          description: '**Deprecated** -- Use `extensions.alt_outputs.return_html` instead.'
    ExtractSourceJson:
      type: object
      description: Document source definition for JSON requests.
      properties:
        file_url:
          type: string
          format: uri
          description: Public or pre-signed URL that Pulse will download and extract.
      required:
        - file_url
    ErrorResponse:
      type: object
      properties:
        error:
          type: object
          properties:
            code:
              type: string
              description: Error code (e.g., FILE_001, AUTH_002)
            message:
              type: string
              description: Human-readable error message
            details:
              type: object
              description: Additional error context
  responses:
    BadRequest:
      description: Bad request - Invalid parameters
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/ErrorResponse'
    Unauthorized:
      description: Unauthorized - Invalid or missing API key
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/ErrorResponse'
    TooManyRequests:
      description: Rate limit exceeded
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/ErrorResponse'
    InternalServerError:
      description: Internal server error
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/ErrorResponse'
  securitySchemes:
    ApiKeyAuth:
      type: apiKey
      in: header
      name: x-api-key
      description: API key for authentication

````