> ## Documentation Index
> Fetch the complete documentation index at: https://docs.runpulse.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Extract File

> The primary endpoint for the Pulse API. Parses uploaded documents or remote 
file URLs and returns rich markdown content with optional structured data 
extraction based on user-provided schemas and extraction options.

Set `async: true` to return immediately with a job ID for polling via
`GET /job/{jobId}`. Otherwise processes synchronously.

Both sync and async modes return HTTP 200. When `async` is true the response
body contains `{ job_id, status, message }` instead of the full extraction result.


## Overview

<Info>
  **Pipeline Step 1** — Extract is always the first step. After extraction, you can optionally [split](/api-reference/endpoint/split) the document into topics, apply [schema extraction](/api-reference/endpoint/schema) to get structured data, or use [tables](/api-reference/endpoint/tables) for span-aware table extraction.
</Info>

Extract content from documents. Returns markdown or HTML formatted content with optional structured data extraction.

For large results (typically documents over 70 pages, or any response above 5 MB), the API returns a one-time download link at `https://api.runpulse.com/large_results/{job_id}` instead of inlining the payload. See [Large Document Response](#large-document-response-70-pages) below.

<Note>
  For large documents or batch processing workflows, set `async: true` to process asynchronously and poll for results via [GET /job/{'{'}jobId{'}'}](/api-reference/endpoint/poll).
</Note>

<Note>
  To process many files at once, use [Batch Extract](/api-reference/endpoint/batch-overview#batch-extract). It accepts an S3 prefix, local directory, or list of URLs and runs `/extract` on each file in parallel.
</Note>

### Async Mode

Set `async: true` to return immediately with a job ID for polling:

```json theme={null}
{
  "file_url": "https://example.com/document.pdf",
  "async": true
}
```

**Async Response (200):**

```json theme={null}
{
  "job_id": "abc123-def456",
  "status": "pending",
  "message": "Document processing started"
}
```

Use `GET /job/{job_id}` to poll for completion.

## Request

### Document Source

Provide the document using one of these methods:

| Field      | Type   | Description                                                    |
| ---------- | ------ | -------------------------------------------------------------- |
| `file`     | binary | Document file to upload directly (multipart/form-data).        |
| `file_url` | string | Public or pre-signed URL that Pulse will download and extract. |

### Extraction Options

| Field               | Type          | Default   | Description                                                                                                                                                                                                                                                                              |
| ------------------- | ------------- | --------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `model`             | string (enum) | `default` | Extraction model to use. One of `default` or `pulse-ultra-2`. `pulse-ultra-2` uses Pulse's vision-language model with built-in refinement, figure/chart extraction, and word-level bounding boxes.                                                                                       |
| `pages`             | string        | -         | Page range filter (1-indexed). Supports segments like `1-2` or mixed ranges like `1-2,5`. Page 1 is the first page.                                                                                                                                                                      |
| `figure_processing` | object        | -         | Settings that control how figures in the document are processed. These affect the **markdown output directly** and do not produce additional output fields. See [Figure Processing](#figure-processing).                                                                                 |
| `extensions`        | object        | -         | Settings that enable additional processing or alternate output formats. Each enabled extension produces a corresponding result under `response.extensions.*`. See [Extensions](#extensions).                                                                                             |
| `spreadsheet`       | object        | -         | Settings for Excel/spreadsheet extraction. Controls handling of hidden rows, columns, sheets, and the automatic trimming of empty trailing rows/columns past the last data-bearing cell. Applies to `.xlsx`, `.xlsm`, and `.xls` files. See [Spreadsheet Options](#spreadsheet-options). |
| `storage`           | object        | -         | Options for persisting extraction artifacts. See [Storage Options](#storage-options).                                                                                                                                                                                                    |
| `async`             | boolean       | `false`   | If `true`, returns immediately with a `job_id` for polling via `GET /job/{jobId}`.                                                                                                                                                                                                       |
| `structured_output` | object        | -         | **⚠️ Deprecated** — Use the [`/schema`](/api-reference/endpoint/schema) endpoint after extraction instead. Still works for backward compatibility.                                                                                                                                       |

### Figure Processing

Settings under `figure_processing` control how figures (images, charts, diagrams) and embedded visuals are processed. Applies to both PDFs/images (figures detected from layout) and spreadsheets (charts and embedded images read directly from the workbook). Affects the markdown output and the `bounding_boxes.Images[]` array.

| Field                           | Type    | Default | Description                                                                                                                                                                                                                                                                                |
| ------------------------------- | ------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `figure_processing.description` | boolean | `false` | Generate descriptive captions for extracted visuals. Captions appear under `bounding_boxes.Images[].description` and inline in the markdown output. Applies to both detected charts and non-chart images.                                                                                  |
| `figure_processing.show_images` | boolean | `false` | Return image URLs for extracted visuals. URLs appear under `bounding_boxes.Images[].image_url` and resolve to a Pulse-hosted PNG/JPEG served from [`GET /results/{jobId}/images/{filename}`](/api-reference/endpoint/results-image). Applies to both detected charts and non-chart images. |

<Note>
  For spreadsheets specifically, `show_images: true` collects every embedded chart and image in the workbook and emits one entry per visual under `bounding_boxes.Images`, with chart-specific fields like `chart_type`, `chart_title`, and `source_ranges` populated. See [Bounding Boxes](/api-reference/bounding-boxes#images-array) for the full field list.
</Note>

### Spreadsheet Options

Settings under `spreadsheet` control how Excel workbooks (`.xlsx`, `.xlsm`, `.xls`) are processed. By default, hidden rows, columns, and sheets are excluded from extraction output, and cell values are rendered the way Excel displays them. Phantom-cell trimming is opt-in.

| Field                               | Type    | Default | Description                                                                                                                                                                                                                                                                                                                                                                                                          |
| ----------------------------------- | ------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `spreadsheet.include_hidden_rows`   | boolean | `false` | Include rows that are hidden in the Excel workbook.                                                                                                                                                                                                                                                                                                                                                                  |
| `spreadsheet.include_hidden_cols`   | boolean | `false` | Include columns that are hidden in the Excel workbook.                                                                                                                                                                                                                                                                                                                                                               |
| `spreadsheet.include_hidden_sheets` | boolean | `false` | Include sheets that are hidden in the Excel workbook.                                                                                                                                                                                                                                                                                                                                                                |
| `spreadsheet.use_raw_values`        | boolean | `false` | Emit the underlying numeric value for number cells instead of the Excel display-formatted text — e.g. `1201.67` rather than `$1,202` when the cell uses a rounded currency format. Useful when downstream processing needs exact amounts (cent-level precision) rather than what the workbook shows visually. Percent-formatted cells and dates keep their display rendering. Does not apply to legacy `.xls` files. |
| `spreadsheet.only_data_rows`        | boolean | `false` | When `true`, trim trailing empty rows past the last cell carrying a value or formula. See [Phantom-cell trimming](#phantom-cell-trimming-only_data_rows--only_data_cols) below.                                                                                                                                                                                                                                      |
| `spreadsheet.only_data_cols`        | boolean | `false` | When `true`, trim trailing empty columns past the last cell carrying a value or formula. Same rationale as `only_data_rows`.                                                                                                                                                                                                                                                                                         |

<Note>
  These settings accept both camelCase (`includeHiddenRows`, `onlyDataRows`) and snake\_case (`include_hidden_rows`, `only_data_rows`) formats.
</Note>

#### Phantom-cell trimming (`only_data_rows` / `only_data_cols`)

Excel files exported from claims systems, ERPs, and other automated pipelines routinely declare a "used range" that extends hundreds of thousands of rows past where the data actually ends. A typical case: a 57 MB workbook with only \~500 rows of real data, where the other \~1,000,000 rows are empty cells that exist only because they were once selected and styled. These phantom cells inflate file size by orders of magnitude and can exhaust parser memory on the extraction pipeline.

Set `only_data_rows: true` and `only_data_cols: true` to have Pulse scan each sheet once before parsing, find the largest row and column containing a value or formula, and ignore everything beyond that extent. Surviving cells keep their **original A1 coordinates** (e.g., a value at `B7` in the source is still `B7` in the output), so any citation or bounding box that references a specific cell remains stable. The trim only kicks in on large sheets (≥5 MB of XML per sheet), so small, well-formed workbooks pay no overhead either way.

Both flags default to `false`.

### Pulse Ultra 2 Options

These options are available only when `model: pulse-ultra-2` is set. Passing any of them with the default model returns a 400 error listing the offending fields.

| Field                       | Type    | Default | Description                                                                                                                                                                                                                |
| --------------------------- | ------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `refine`                    | boolean | `false` | Run a full-page OCR and formatting correction pass after extraction. Improves accuracy on dense layouts, numerical values, and table structure. Adds \~1–2s per page. Overridden by `refine_options` if both are provided. |
| `refine_options`            | object  | -       | Granular refinement targets. Takes precedence over the boolean `refine` flag. See below.                                                                                                                                   |
| `refine_options.tables`     | boolean | `false` | Fix table cell values, structure, and headers against the source image.                                                                                                                                                    |
| `refine_options.text`       | boolean | `false` | Fix OCR errors, missing or extra content, and numerical accuracy (tables untouched).                                                                                                                                       |
| `refine_options.formatting` | boolean | `false` | Add strikethrough, italic, bold, super/subscript, and LaTeX formatting (tables untouched).                                                                                                                                 |
| `extract_figure`            | boolean | `false` | Convert charts and data visualizations into HTML `<table>` blocks, wrapped in `<figure-table>` tags. Useful for financial decks, dashboards, and scientific charts.                                                        |
| `figure_description`        | boolean | `false` | Generate a 1–2 paragraph natural-language description of each picture, wrapped in `<figure-description>` tags. Combines well with `extract_figure`.                                                                        |
| `additional_prompt`         | string  | `""`    | Extra context injected into the extraction prompt. Use to steer extraction toward a specific domain or attention focus. Max 4000 characters.                                                                               |
| `custom_image_prompt`       | string  | `""`    | Extra context appended to the prompt used by `figure_description` and `extract_figure`. Tunes image and chart interpretation. Max 2000 characters.                                                                         |
| `custom_refine_prompt`      | string  | `""`    | Extra context appended to the refinement prompt. Only applies when `refine: true` or `refine_options` is set. Max 2000 characters.                                                                                         |

#### Markdown output additions

When `extract_figure` or `figure_description` is enabled, figures in `response.markdown` include additional tags:

```html theme={null}
<figure data-page="1">
  <figure-table>...HTML table for the chart...</figure-table>
  <figure-description>...1–2 paragraph description...</figure-description>
</figure>
```

When `refine` (or `refine_options`) is set, markdown content is post-processed page-by-page; output is cleaner but typically grows \~1.5–3x in size for dense documents. No new tags are introduced.

### Extensions

Settings under `extensions` enable additional processing passes or alternate output formats. Each enabled extension produces a **corresponding output field** under `response.extensions.*`. For example, enabling `extensions.chunking` produces `response.extensions.chunking`, and enabling `extensions.alt_outputs.return_html` produces `response.extensions.alt_outputs.html`.

| Field                                | Type      | Default | Description                                                                                                           |
| ------------------------------------ | --------- | ------- | --------------------------------------------------------------------------------------------------------------------- |
| `extensions.footnote_references`     | boolean   | `false` | Link footnote markers to their corresponding footnote text.                                                           |
| `extensions.chunking`                | object    | -       | Chunking configuration. See below.                                                                                    |
| `extensions.chunking.chunk_types`    | string\[] | -       | List of chunking strategies: `semantic`, `header`, `page`, `recursive`.                                               |
| `extensions.chunking.chunk_size`     | integer   | -       | Maximum characters per chunk.                                                                                         |
| `extensions.alt_outputs`             | object    | -       | Alternate output formats. See below.                                                                                  |
| `extensions.alt_outputs.wlbb`        | boolean   | `false` | Enable word-level bounding boxes (PDF only). Results in `response.extensions.alt_outputs.wlbb`.                       |
| `extensions.alt_outputs.return_html` | boolean   | `false` | Include HTML representation. `response.markdown` is still present; HTML is at `response.extensions.alt_outputs.html`. |
| `extensions.alt_outputs.return_xml`  | boolean   | `false` | Include XML representation (work in progress).                                                                        |

### `pulse-ultra-2` Rate Limits

Requests made with `model: pulse-ultra-2` are subject to dedicated rate limits, separate from standard extraction:

| Limit      | Value          |
| ---------- | -------------- |
| Per minute | 5 extractions  |
| Per hour   | 20 extractions |
| File size  | 50 MB          |
| Concurrent | 2 per API key  |

The concurrent limit is the one that most commonly applies in practice — long-running extractions held open while new requests arrive will trip it first.

### Storage Options

Control whether extractions are saved to your extraction library:

| Field                 | Type          | Default | Description                                                                           |
| --------------------- | ------------- | ------- | ------------------------------------------------------------------------------------- |
| `storage.enabled`     | boolean       | `true`  | Whether to persist extraction artifacts. Set to `false` for temporary extractions.    |
| `storage.folder_name` | string        | -       | Target folder name to save the extraction to. Creates the folder if it doesn't exist. |
| `storage.folder_id`   | string (uuid) | -       | Target folder ID to save the extraction to. Takes precedence over `folder_name`.      |

### Deprecated Fields

The following input fields are deprecated and will be removed in a future version. They are still accepted for backward compatibility.

| Field               | Replacement                                                                                                                                                         |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `show_images`       | Use `figure_processing.show_images`                                                                                                                                 |
| `chunking`          | Use `extensions.chunking.chunk_types` (array instead of comma-separated string)                                                                                     |
| `chunk_size`        | Use `extensions.chunking.chunk_size`                                                                                                                                |
| `return_html`       | Use `extensions.alt_outputs.return_html`                                                                                                                            |
| `structured_output` | Use [`/schema`](/api-reference/endpoint/schema) endpoint after extraction. Pass `extraction_id` + `schema_config`. Accepts `schema`, `schema_prompt`, and `effort`. |
| `schema`            | Use [`/schema`](/api-reference/endpoint/schema) endpoint after extraction                                                                                           |
| `schema_prompt`     | Use [`/schema`](/api-reference/endpoint/schema) endpoint with `schema_config.schema_prompt`                                                                         |
| `custom_prompt`     | No replacement                                                                                                                                                      |
| `thinking`          | No replacement                                                                                                                                                      |

<Note>
  When legacy input fields are used, the API returns a deprecation warning in the `warnings` array directing you to the updated field names. See the [latest documentation](https://docs.runpulse.com/api-reference/endpoint/extract) for details.
</Note>

## Response

The response structure varies based on document size to optimize for different use cases.

### Standard Response (Under 70 Pages)

For documents under 70 pages, results are returned directly in the response body:

```json theme={null}
{
  "markdown": "# Document Title\n\nExtracted content...",
  "page_count": 15,
  "extraction_id": "abc123-def456-ghi789",
  "extraction_url": "https://platform.runpulse.com/dashboard/extractions/abc123",
  "credits_used": 1.0,
  "plan_info": {
    "tier": "growth",
    "pages_used": 15,
    "total_credits_used": 49.5,
    "note": "Pulse Ultra"
  },
  "bounding_boxes": {
    "Title": [],
    "Text": [],
    "Tables": [],
    "Images": [
      {
        "id": "excel_image_1_1",
        "visual_type": "chart",
        "image_url": "https://api.runpulse.com/results/abc123-def456-ghi789/images/excel_image_1_1.png",
        "chart_type": "BarChart",
        "chart_title": "Revenue",
        "excel_range": "D2",
        "sheet_name": "Charts"
      }
    ],
    "markdown_with_ids": "<p data-bb-text-id=\"txt-1\">..."
  },
  "extensions": {
    "chunking": {
      "semantic": ["chunk 1...", "chunk 2..."],
      "header": ["section 1...", "section 2..."]
    },
    "altOutputs": {
      "html": "<html>...</html>"
    }
  },
  "warnings": []
}
```

#### Response Fields

| Field                           | Type          | Description                                                                                                                                                                                                                     |
| ------------------------------- | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `markdown`                      | string        | Clean markdown content extracted from the document. Always present.                                                                                                                                                             |
| `page_count`                    | integer       | Total number of pages processed.                                                                                                                                                                                                |
| `extraction_id`                 | string (uuid) | Persisted extraction ID. Present when storage is enabled (default). Use with `/split` and `/schema`.                                                                                                                            |
| `extraction_url`                | string        | URL to view the extraction in the Pulse Platform. Present when storage is enabled.                                                                                                                                              |
| `credits_used`                  | number        | Credits consumed by **this request**. Only present when the org has the credit billing system enabled.                                                                                                                          |
| `plan_info`                     | object        | Billing tier and **cumulative** usage information for the calling org, including this request. Includes `tier`, `total_credits_used` (primary billing metric), `pages_used` (legacy), and an optional `note`.                   |
| `bounding_boxes`                | object        | Typed bounding-box data — `Images`, `Tables`, `Text`, `Title`, `Footer`, plus `markdown_with_ids`. See [Bounding Boxes](/api-reference/bounding-boxes) for the full field list including the chart/image fields under `Images`. |
| `extensions`                    | object        | Output from enabled extensions. Only keys for enabled extensions are present. See below.                                                                                                                                        |
| `extensions.chunking`           | object        | Chunk results by strategy (when `extensions.chunking` is enabled).                                                                                                                                                              |
| `extensions.footnoteReferences` | array         | List of detected footnotes with their in-text references (when `extensions.footnote_references` is enabled). See [Footnote References](#footnote-references) below.                                                             |
| `extensions.altOutputs.wlbb`    | object        | Word-level bounding boxes (when `extensions.alt_outputs.wlbb` is enabled).                                                                                                                                                      |
| `extensions.altOutputs.html`    | string        | HTML representation (when `extensions.alt_outputs.return_html` is enabled).                                                                                                                                                     |
| `extensions.altOutputs.xml`     | string        | XML representation (when `extensions.alt_outputs.return_xml` is enabled, WIP).                                                                                                                                                  |
| `warnings`                      | array         | Non-fatal warnings generated during extraction, including deprecation notices for legacy input usage.                                                                                                                           |

#### Deprecated Response Fields

| Field               | Replacement                                     | Description                                                       |
| ------------------- | ----------------------------------------------- | ----------------------------------------------------------------- |
| `html`              | `extensions.altOutputs.html`                    | Present when legacy `return_html` input is used.                  |
| `chunks`            | `extensions.chunking`                           | Present when legacy `chunking` input is used.                     |
| `plan-info`         | `plan_info`                                     | Present when only legacy inputs are used.                         |
| `structured_output` | Use [`/schema`](/api-reference/endpoint/schema) | Present when deprecated `structured_output` input was used.       |
| `input_schema`      | Use [`/schema`](/api-reference/endpoint/schema) | Echo of the applied schema (deprecated path only).                |
| `schema_error`      | Use [`/schema`](/api-reference/endpoint/schema) | Error message if schema processing failed (deprecated path only). |

### Large Document Response (70+ Pages)

For documents with 70 or more pages — or any response payload above the 5 MB inline threshold — the API returns a one-time download link to `/large_results/{job_id}` instead of inlining the payload. This prevents timeout issues and keeps the immediate response small.

```json theme={null}
{
  "is_url": true,
  "url": "https://api.runpulse.com/large_results/abc123-def456-ghi789",
  "plan_info": {
    "tier": "growth",
    "pages_used": 150,
    "total_credits_used": 312.0
  }
}
```

#### Large Document Response Fields

| Field       | Type    | Description                                                                                                                                                                                                                                                                                                                             |
| ----------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `is_url`    | boolean | Always `true` for large document responses. Use this to detect URL-based responses.                                                                                                                                                                                                                                                     |
| `url`       | string  | One-time download link of the form `https://api.runpulse.com/large_results/{job_id}`. The link streams the complete extraction result the first time it is fetched and is then invalidated (subsequent reads return `410 Gone`). It also expires 1 hour after the job completes. Authenticate the request with your `x-api-key` header. |
| `plan_info` | object  | Billing information including pages used and plan tier.                                                                                                                                                                                                                                                                                 |

<Warning>
  `/large_results/{job_id}` links are **single-use** and **expire 1 hour** after the job completes. Download and persist the payload immediately — do not pass the URL through queues or share it across workers.
</Warning>

#### Handling Large Document Responses

<CodeGroup>
  ```python Python theme={null}
  import requests
  from pulse import Pulse

  API_KEY = "YOUR_API_KEY"
  client = Pulse(api_key=API_KEY)

  response = client.extract(
      file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf"
  )

  if hasattr(response, "is_url") and response.is_url:
      full_result = requests.get(
          response.url,
          headers={"x-api-key": API_KEY},
      ).json()
      print(full_result["markdown"])
  else:
      print(response.markdown)
  ```

  ```typescript TypeScript theme={null}
  import { PulseClient } from 'pulse-ts-sdk';

  const API_KEY = "YOUR_API_KEY";
  const client = new PulseClient({ apiKey: API_KEY });

  const response = await client.extract({
      fileUrl: "https://www.impact-bank.com/user/file/dummy_statement.pdf"
  });

  if ((response as any).is_url) {
      const fullResult = await fetch((response as any).url, {
          headers: { "x-api-key": API_KEY },
      }).then(r => r.json());
      console.log(fullResult.markdown);
  } else {
      console.log(response.markdown);
  }
  ```

  ```bash curl theme={null}
  curl -X POST https://api.runpulse.com/extract \
    -H "x-api-key: YOUR_API_KEY" \
    -F "file=@large_document.pdf"

  # Response: {"is_url": true, "url": "https://api.runpulse.com/large_results/abc123-..."}

  # Fetch the result once (single-use link, valid for 1 hour after job completion)
  curl -H "x-api-key: YOUR_API_KEY" \
    "https://api.runpulse.com/large_results/abc123-..."
  ```
</CodeGroup>

<Note>
  Because `/large_results/{job_id}` is one-time use, persist the result to your own storage on first download. If you need to access the result later, enable `storage.enabled` and retrieve it from your extraction library on the Pulse Platform.
</Note>

## Example Usage

### Basic Extraction

<CodeGroup>
  ```python Python theme={null}
  from pulse import Pulse
  from pulse.types import (
      ExtractRequestFigureProcessing,
      ExtractRequestExtensions,
      ExtractRequestExtensionsAltOutputs,
  )

  client = Pulse(api_key="YOUR_API_KEY")

  # Extract from URL with figure processing and HTML output
  response = client.extract(
      file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
      figure_processing=ExtractRequestFigureProcessing(
          description=True,
      ),
      extensions=ExtractRequestExtensions(
          alt_outputs=ExtractRequestExtensionsAltOutputs(
              return_html=True,
          ),
      ),
  )

  print(f"Markdown: {response.markdown}")
  print(f"HTML: {response.extensions.alt_outputs.html}")
  print(f"Extraction ID: {response.extraction_id}")
  ```

  ```typescript TypeScript theme={null}
  import { PulseClient } from 'pulse-ts-sdk';

  const client = new PulseClient({ apiKey: "YOUR_API_KEY" });

  const response = await client.extract({
      fileUrl: "https://www.impact-bank.com/user/file/dummy_statement.pdf",
      figureProcessing: { description: true },
      extensions: { altOutputs: { returnHtml: true } }
  });

  console.log(`Markdown: ${response.markdown}`);
  console.log(`HTML: ${response.extensions?.altOutputs?.html}`);
  console.log(`Extraction ID: ${response.extraction_id}`);
  ```

  ```bash curl theme={null}
  # Extract from URL with figure processing
  curl -X POST https://api.runpulse.com/extract \
    -H "x-api-key: YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "file_url": "https://www.impact-bank.com/user/file/dummy_statement.pdf",
      "figureProcessing": {"description": true},
      "extensions": {"altOutputs": {"returnHtml": true}}
    }'
  ```
</CodeGroup>

### File Upload

<CodeGroup>
  ```python Python theme={null}
  from pulse.types import ExtractRequestFigureProcessing

  # Upload and extract a local file
  with open("document.pdf", "rb") as f:
      response = client.extract(
          file=f,
      figure_processing=ExtractRequestFigureProcessing(
          description=True,
      ),
      )
  ```

  ```typescript TypeScript theme={null}
  import * as fs from 'fs';

  const fileBuffer = fs.readFileSync("document.pdf");
  const blob = new Blob([fileBuffer], { type: 'application/pdf' });

  const response = await client.extract({
      file: blob,
  });
  ```

  ```bash curl theme={null}
  curl -X POST https://api.runpulse.com/extract \
    -H "x-api-key: YOUR_API_KEY" \
    -F "file=@document.pdf"
  ```
</CodeGroup>

### Structured Data (Extract → Schema)

<Warning>
  The `structured_output` parameter on `/extract` is **deprecated**. Use the [`/schema`](/api-reference/endpoint/schema) endpoint after extraction instead. This gives you better control, re-runnability, and support for split-mode schemas.
</Warning>

**Recommended two-step approach:**

<CodeGroup>
  ```python Python theme={null}
  # Step 1: Extract the document
  response = client.extract(
      file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf"
  )

  extraction_id = response.extraction_id

  # Step 2: Apply schema separately
  schema_result = client.schema(
      extraction_id=extraction_id,
      schema_config={
          "input_schema": {
              "type": "object",
              "properties": {
                  "total": {"type": "number"},
                  "vendor": {"type": "string"}
              }
          },
          "schema_prompt": "Extract invoice total and vendor"
      }
  )

  print(schema_result.schema_output)
  ```

  ```typescript TypeScript theme={null}
  // Step 1: Extract the document
  const response = await client.extract({
      fileUrl: "https://www.impact-bank.com/user/file/dummy_statement.pdf"
  });

  const extractionId = response.extraction_id;

  // Step 2: Apply schema separately
  const schemaResult = await client.schema({
      extraction_id: extractionId,
      schema_config: {
          input_schema: {
              type: "object",
              properties: {
                  total: { type: "number" },
                  vendor: { type: "string" }
              }
          },
          schema_prompt: "Extract invoice total and vendor"
      }
  });

  console.log(schemaResult.schema_output);
  ```

  ```bash curl theme={null}
  # Step 1: Extract the document
  curl -X POST https://api.runpulse.com/extract \
    -H "x-api-key: YOUR_API_KEY" \
    -F "file=@invoice.pdf"

  # Response includes extraction_id: "abc123-..."

  # Step 2: Apply schema
  curl -X POST https://api.runpulse.com/schema \
    -H "x-api-key: YOUR_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "extraction_id": "abc123-...",
      "schema_config": {
        "input_schema": {"type": "object", "properties": {"total": {"type": "number"}, "vendor": {"type": "string"}}},
        "schema_prompt": "Extract invoice total and vendor"
      }
    }'
  ```
</CodeGroup>

### Page Range and Chunking

<CodeGroup>
  ```python Python theme={null}
  from pulse.types import (
      ExtractRequestExtensions,
      ExtractRequestExtensionsChunking,
  )

  response = client.extract(
      file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
      pages="1-5,10",  # 1-indexed
      extensions=ExtractRequestExtensions(
          chunking=ExtractRequestExtensionsChunking(
              chunk_types=["semantic", "page"],
              chunk_size=1000,
          ),
      ),
  )

  # Chunk data is in extensions.chunking
  print(response.extensions.chunking.semantic)
  print(response.extensions.chunking.page)
  ```

  ```typescript TypeScript theme={null}
  const response = await client.extract({
      fileUrl: "https://www.impact-bank.com/user/file/dummy_statement.pdf",
      pages: "1-5,10",  // 1-indexed
      extensions: {
          chunking: {
              chunkTypes: ["semantic", "page"],
              chunkSize: 1000
          }
      }
  });

  // Chunk data is in extensions.chunking
  console.log(response.extensions?.chunking?.semantic);
  console.log(response.extensions?.chunking?.page);
  ```

  ```bash curl theme={null}
  curl -X POST https://api.runpulse.com/extract \
    -H "x-api-key: YOUR_API_KEY" \
    -F "file=@document.pdf" \
    -F "pages=1-5,10" \
    -F 'extensions={"chunking": {"chunkTypes": ["semantic", "page"], "chunkSize": 1000}}'
  ```
</CodeGroup>

### Footnote References

Enable `extensions.footnote_references` to detect footnote markers (e.g. `*`, `†`, `1`) in body text and link them to the footnote explanation paragraphs at the bottom of the page. Each result item includes the marker symbol, the bounding-box text ID of the footnote, and the bounding-box text IDs of all body-text paragraphs that reference it.

<CodeGroup>
  ```python Python theme={null}
  from pulse.types import ExtractRequestExtensions

  response = client.extract(
      file_url="https://example.com/research-paper.pdf",
      extensions=ExtractRequestExtensions(
          footnote_references=True,
      ),
  )

  # Footnote links are in extensions.footnote_references
  for ref in response.extensions.footnote_references:
      print(f"Marker: {ref.symbol}")
      print(f"  Footnote: {ref.footnote_text_id}")
      print(f"  Referenced by: {ref.reference_text_ids}")
  ```

  ```typescript TypeScript theme={null}
  const response = await client.extract({
      fileUrl: "https://example.com/research-paper.pdf",
      extensions: {
          footnoteReferences: true
      }
  });

  // Footnote links are in extensions.footnoteReferences
  for (const ref of response.extensions?.footnoteReferences ?? []) {
      console.log(`Marker: ${ref.symbol}`);
      console.log(`  Footnote: ${ref.footnoteTextId}`);
      console.log(`  Referenced by: ${ref.referenceTextIds}`);
  }
  ```

  ```bash curl theme={null}
  curl -X POST https://api.runpulse.com/extract \
    -H "x-api-key: YOUR_API_KEY" \
    -F "file=@research-paper.pdf" \
    -F 'extensions={"footnoteReferences": true}'
  ```
</CodeGroup>

#### Example Response

```json theme={null}
{
  "markdown": "...",
  "bounding_boxes": { ... },
  "extensions": {
    "footnoteReferences": [
      {
        "symbol": "*",
        "footnoteTextId": "txt-11",
        "referenceTextIds": ["txt-4", "txt-5", "txt-6", "txt-7", "txt-8"]
      },
      {
        "symbol": "†",
        "footnoteTextId": "txt-12",
        "referenceTextIds": ["txt-8"]
      },
      {
        "symbol": "4",
        "footnoteTextId": "txt-48",
        "referenceTextIds": ["txt-45"]
      }
    ]
  }
}
```

#### Footnote Reference Fields

| Field              | Type      | Description                                                                                                                                                                       |
| ------------------ | --------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `symbol`           | string    | The footnote marker symbol as detected in the document (e.g. `*`, `†`, `‡`, `1`, `#`).                                                                                            |
| `footnoteTextId`   | string    | The bounding-box text ID (e.g. `txt-11`) of the footnote explanation paragraph. Cross-reference with `bounding_boxes.Footer` to get the footnote's content and position.          |
| `referenceTextIds` | string\[] | Bounding-box text IDs of body-text paragraphs that contain a reference to this footnote. Cross-reference with `bounding_boxes.Text` to get each paragraph's content and position. |

<Info>
  Footnote reference detection uses Azure Document Intelligence for paragraph classification, supplemented by PyMuPDF native text extraction for accurate symbol identification. This handles common OCR confusion between visually similar symbols like `†`/`+` and `‡`/`#`. Supported marker types include numbered (`1`, `2`, `3`), symbolic (`*`, `†`, `‡`, `§`, `#`), and lettered (`a`, `b`, `c`) footnotes.
</Info>

### Excel Spreadsheet Options

<CodeGroup>
  ```python Python theme={null}
  from pulse import Pulse
  from pulse.types import ExtractRequestSpreadsheet

  client = Pulse(api_key="YOUR_API_KEY")

  # Extract from Excel with hidden content included
  response = client.extract(
      file=open("financials.xlsx", "rb"),
      spreadsheet=ExtractRequestSpreadsheet(
          include_hidden_rows=True,
          include_hidden_cols=True,
          include_hidden_sheets=False,
      ),
  )

  print(response.markdown)
  ```

  ```typescript TypeScript theme={null}
  import { PulseClient } from 'pulse-ts-sdk';

  const client = new PulseClient({
      headers: { 'x-api-key': 'YOUR_API_KEY' }
  });

  const response = await client.extract({
      file: fs.createReadStream("financials.xlsx"),
      spreadsheet: {
          includeHiddenRows: true,
          includeHiddenCols: true,
          includeHiddenSheets: false
      }
  });

  console.log(response.markdown);
  ```

  ```bash curl theme={null}
  curl -X POST https://api.runpulse.com/extract \
    -H "x-api-key: YOUR_API_KEY" \
    -F "file=@financials.xlsx" \
    -F 'spreadsheet={"includeHiddenRows": true, "includeHiddenCols": true, "includeHiddenSheets": false}'
  ```
</CodeGroup>

<Note>
  Workbooks exported from claims systems, ERPs, and other automated pipelines often declare a "used range" that extends hundreds of thousands of rows past where the data actually ends. Set `spreadsheet.only_data_rows: true` and `spreadsheet.only_data_cols: true` to have Pulse trim those trailing empty "phantom" rows and columns before parsing. Surviving cells keep their original A1 coordinates, so any citation or bounding box that references a specific cell remains stable. Both flags default to `false`. See [Spreadsheet Options](/api-reference/extract-options#spreadsheet-options) for the full reference.
</Note>

### Excel Charts and Embedded Images

When you set `figure_processing.show_images: true` on an Excel workbook, every embedded chart and image is collected from the workbook directly and returned under `bounding_boxes.Images[]`. Each entry carries a Pulse-hosted `image_url` you can fetch via [`results.getImage`](/api-reference/endpoint/results-image) (or any HTTP client with your API key) to get the raw PNG/JPEG bytes.

<CodeGroup>
  ```python Python theme={null}
  import re
  from pulse import Pulse
  from pulse.types import ExtractRequestFigureProcessing

  client = Pulse(api_key="YOUR_API_KEY")

  # 1) Extract the workbook with show_images enabled.
  response = client.extract(
      file=open("financials.xlsx", "rb"),
      figure_processing=ExtractRequestFigureProcessing(
          show_images=True,
          description=False,
      ),
  )

  # 2) Walk the typed Images array.
  for img in response.bounding_boxes.images or []:
      print(f"{img.id}: {img.visual_type} '{img.chart_title}' @ {img.excel_range}")
      print(f"    url: {img.image_url}")

  # 3) Fetch the bytes for one chart.
  img = response.bounding_boxes.images[0]
  m = re.search(r"/results/([^/]+)/images/([^/?#]+)", img.image_url)
  job_id, filename = m.group(1), m.group(2)

  chunks = list(client.results.get_image(job_id=job_id, filename=filename))
  with open("chart.png", "wb") as f:
      f.write(b"".join(chunks))
  ```

  ```typescript TypeScript theme={null}
  import { PulseClient } from "pulse-ts-sdk";
  import * as fs from "node:fs";

  const client = new PulseClient({ apiKey: "YOUR_API_KEY" });

  // 1) Extract the workbook with show_images enabled.
  const response = await client.extract({
      file: fs.createReadStream("financials.xlsx"),
      figureProcessing: { showImages: true, description: false },
  });

  // 2) Walk the typed Images array.
  for (const img of response.boundingBoxes?.Images ?? []) {
      console.log(
          `${img.id}: ${img.visualType} '${img.chartTitle}' @ ${img.excelRange}`,
      );
      console.log(`    url: ${img.imageUrl}`);
  }

  // 3) Fetch the bytes for one chart.
  const url = response.boundingBoxes?.Images?.[0]?.imageUrl;
  const m = url?.match(/\/results\/([^/]+)\/images\/([^/?#]+)/);
  const [, jobId, filename] = m!;
  const image = await client.results.getImage({ jobId, filename });
  // Persist `image` per your runtime (e.g. `await image.bytes()`).
  ```

  ```bash curl theme={null}
  # Step 1: extract and capture an image_url from the response.
  curl -sS -X POST https://api.runpulse.com/extract \
    -H "x-api-key: YOUR_API_KEY" \
    -F "file=@financials.xlsx" \
    -F 'figure_processing={"show_images": true}' \
    | jq -r '.bounding_boxes.Images[0].image_url'

  # Step 2: fetch the PNG bytes.
  curl -sS -X GET "https://api.runpulse.com/results/$JOB_ID/images/excel_image_1_1.png" \
    -H "x-api-key: YOUR_API_KEY" \
    -o chart.png
  ```
</CodeGroup>

#### Example `bounding_boxes.Images` Entry

```json theme={null}
{
  "id": "excel_image_1_1",
  "visual_type": "chart",
  "page_number": 1,
  "bounding_box": [],
  "image_url": "https://api.runpulse.com/results/13e3e75f-.../images/excel_image_1_1.png",
  "sheet_name": "Charts",
  "excel_range": "D2",
  "chart_type": "BarChart",
  "chart_title": "Revenue",
  "source_ranges": ["'Charts'!$A$2:$A$5", "'Charts'!$B$2:$B$5"],
  "description": "Bar chart showing revenue by quarter."
}
```

See [Bounding Boxes — Images Array](/api-reference/bounding-boxes#images-array) for the full field reference and [Get Result Image](/api-reference/endpoint/results-image) for the auth requirement on `image_url`.

### Disable Storage

<CodeGroup>
  ```python Python theme={null}
  response = client.extract(
      file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
      storage={"enabled": False}
  )
  ```

  ```typescript TypeScript theme={null}
  const response = await client.extract({
      fileUrl: "https://www.impact-bank.com/user/file/dummy_statement.pdf",
      storage: { enabled: false }
  });
  ```

  ```bash curl theme={null}
  curl -X POST https://api.runpulse.com/extract \
    -H "x-api-key: YOUR_API_KEY" \
    -F "file=@document.pdf" \
    -F 'storage={"enabled": false}'
  ```
</CodeGroup>


## OpenAPI

````yaml POST /extract
openapi: 3.1.0
info:
  title: Pulse API
  description: >-
    Production-grade document extraction service that transforms complex
    documents  into structured, AI-ready data. This specification is the single
    source of truth  for the Pulse extraction APIs.
  version: 1.0.0
  contact:
    name: Pulse Support
    email: support@trypulse.ai
    url: https://docs.runpulse.com
servers:
  - url: https://api.runpulse.com
    description: Production server
security:
  - ApiKeyAuth: []
paths:
  /extract:
    post:
      tags:
        - Extract
      summary: Extract Document
      description: >
        The primary endpoint for the Pulse API. Parses uploaded documents or
        remote 

        file URLs and returns rich markdown content with optional structured
        data 

        extraction based on user-provided schemas and extraction options.


        Set `async: true` to return immediately with a job ID for polling via

        `GET /job/{jobId}`. Otherwise processes synchronously.


        Both sync and async modes return HTTP 200. When `async` is true the
        response

        body contains `{ job_id, status, message }` instead of the full
        extraction result.
      operationId: extractDocument
      requestBody:
        required: true
        content:
          multipart/form-data:
            schema:
              $ref: '#/components/schemas/ExtractMultipartInput'
          application/json:
            schema:
              $ref: '#/components/schemas/ExtractJsonInput'
      responses:
        '200':
          description: |
            When `async=false` (default): full extraction result with markdown,
            bounding boxes, chunks, etc.
            When `async=true`: job submission acknowledgement with `job_id`.
          content:
            application/json:
              schema:
                oneOf:
                  - $ref: '#/components/schemas/ExtractResponse'
                  - $ref: '#/components/schemas/AsyncSubmissionResponse'
        '400':
          $ref: '#/components/responses/BadRequest'
        '401':
          $ref: '#/components/responses/Unauthorized'
        '429':
          $ref: '#/components/responses/TooManyRequests'
        '500':
          $ref: '#/components/responses/InternalServerError'
      x-codeSamples:
        - lang: python
          label: Python SDK
          source: |
            from pulse import Pulse
            from pulse.types import (
                ExtractRequestFigureProcessing,
                ExtractRequestExtensions,
                ExtractRequestExtensionsChunking,
                ExtractRequestExtensionsAltOutputs,
            )

            client = Pulse(api_key="YOUR_API_KEY")

            # Basic extraction from URL
            response = client.extract(
                file_url="https://example.com/document.pdf"
            )
            print(response.markdown)
            print(response.extraction_id)

            # With figure processing and extensions
            response = client.extract(
                file_url="https://example.com/document.pdf",
                figure_processing=ExtractRequestFigureProcessing(
                    description=True,
                ),
                extensions=ExtractRequestExtensions(
                    chunking=ExtractRequestExtensionsChunking(
                        chunk_types=["semantic"],
                        chunk_size=1000,
                    ),
                    alt_outputs=ExtractRequestExtensionsAltOutputs(
                        return_html=True,
                    ),
                ),
            )
            print(response.extensions.chunking)
            print(response.extensions.alt_outputs.html)

            # Async extraction
            response = client.extract(
                file_url="https://example.com/document.pdf",
                async_=True
            )
            print(response.job_id)  # poll via client.jobs.get_job(job_id=...)
        - lang: typescript
          label: TypeScript SDK
          source: |
            import { PulseClient } from "pulse-ts-sdk";

            const client = new PulseClient({
                apiKey: "YOUR_API_KEY"
            });

            // Basic extraction from URL
            const response = await client.extract({
                fileUrl: "https://example.com/document.pdf"
            });
            console.log(response.markdown);
            console.log(response.extraction_id);

            // With figure processing and extensions
            const extResp = await client.extract({
                fileUrl: "https://example.com/document.pdf",
                figureProcessing: { description: true },
                extensions: {
                    chunking: { chunkTypes: ["semantic"], chunkSize: 1000 },
                    altOutputs: { returnHtml: true }
                }
            });
            console.log(extResp.extensions?.chunking);
            console.log(extResp.extensions?.altOutputs?.html);

            // Async extraction
            const asyncResp = await client.extract({
                fileUrl: "https://example.com/document.pdf",
                async: true
            });
            console.log(asyncResp.job_id); // poll via client.jobs.getJob(...)
        - lang: bash
          label: curl
          source: |
            # Basic extraction
            curl -X POST https://api.runpulse.com/extract \
              -H "x-api-key: YOUR_API_KEY" \
              -F "file=@document.pdf"

            # With figure processing and extensions
            curl -X POST https://api.runpulse.com/extract \
              -H "x-api-key: YOUR_API_KEY" \
              -F "file=@document.pdf" \
              -F 'figure_processing={"description": true}' \
              -F 'extensions={"chunking": {"chunkTypes": ["semantic"], "chunkSize": 1000}, "altOutputs": {"returnHtml": true}}'

            # Async extraction
            curl -X POST https://api.runpulse.com/extract \
              -H "x-api-key: YOUR_API_KEY" \
              -F "file=@document.pdf" \
              -F "async=true"
components:
  schemas:
    ExtractMultipartInput:
      description: Input schema for multipart/form-data requests (file upload or file_url).
      allOf:
        - $ref: '#/components/schemas/ExtractSourceMultipart'
        - $ref: '#/components/schemas/ExtractOptions'
    ExtractJsonInput:
      description: Input schema for JSON requests (file_url only).
      allOf:
        - $ref: '#/components/schemas/ExtractSourceJson'
        - $ref: '#/components/schemas/ExtractOptions'
    ExtractResponse:
      type: object
      description: >-
        Full extraction result returned by the synchronous `/extract` endpoint.
        Contains the extracted markdown, optional extensions output, bounding
        boxes, and storage metadata.
      properties:
        markdown:
          type: string
          description: >-
            Primary markdown content extracted from the document. Always present
            in the new format.
        extensions:
          type: object
          description: >-
            Output from enabled extensions. Each key corresponds to an extension
            that was enabled in the request under `extensions.*`. Only keys for
            enabled extensions are present.
          properties:
            chunking:
              type: object
              description: >-
                Chunk results by strategy. Present when `extensions.chunking`
                was provided in the request.
              properties:
                semantic:
                  type: array
                  items:
                    type: string
                  description: Semantically-segmented chunks.
                header:
                  type: array
                  items:
                    type: string
                  description: Chunks split by document headers/headings.
                page:
                  type: array
                  items:
                    type: string
                  description: One chunk per page.
                recursive:
                  type: array
                  items:
                    type: string
                  description: Recursively-split chunks respecting size limits.
            merge_tables:
              type: object
              description: >-
                Merge tables result/metadata. Present when
                `extensions.merge_tables` was enabled.
              additionalProperties: true
            footnote_references:
              type: array
              description: >-
                List of detected footnotes with their in-text references.
                Present when `extensions.footnote_references` was enabled. Each
                item links a footnote paragraph to the body-text paragraphs that
                reference it, using bounding-box text IDs.
              items:
                type: object
                properties:
                  symbol:
                    type: string
                    description: The footnote marker symbol (e.g. "*", "†", "1", "#").
                  footnoteTextId:
                    type: string
                    description: >-
                      The bounding-box text ID (e.g. "txt-15") of the footnote
                      explanation paragraph.
                  referenceTextIds:
                    type: array
                    description: >-
                      Bounding-box text IDs of body-text paragraphs that contain
                      a reference to this footnote marker.
                    items:
                      type: string
            alt_outputs:
              type: object
              description: >-
                Alternate output formats. Each key corresponds to an enabled alt
                output.
              properties:
                wlbb:
                  type: object
                  description: >-
                    Word-level bounding box data. Present when
                    `extensions.alt_outputs.wlbb` was enabled and input is a
                    PDF.
                  properties:
                    words:
                      type: array
                      description: List of detected words with their positions.
                      items:
                        type: object
                        properties:
                          id:
                            type: string
                          text:
                            type: string
                          page_number:
                            type: integer
                            minimum: 1
                          bounding_box:
                            type: array
                            items:
                              type: number
                            minItems: 8
                            maxItems: 8
                          average_word_confidence:
                            type: number
                    error:
                      type: string
                      description: Error message if word-level extraction failed.
                html:
                  type: string
                  description: >-
                    HTML representation of the document. Present when
                    `extensions.alt_outputs.return_html` was enabled.
                xml:
                  type: string
                  description: >-
                    XML representation of the document. Present when
                    `extensions.alt_outputs.return_xml` was enabled. (WIP)
        bounding_boxes:
          allOf:
            - $ref: '#/components/schemas/BoundingBoxes'
          description: >-
            Positional bounding-box data for text, titles, headers, footers,
            images, and tables. `Images` carries chart/image visuals (with
            `image_url` when `figure_processing.show_images` is enabled),
            `Tables` the detected tables, and `Text`/`Title`/`Footer` the
            paragraph/title/footer regions. Additional keys (e.g.
            `markdown_with_ids`, `defined_names`) round-trip without being
            typed.
        extraction_id:
          type: string
          format: uuid
          description: >-
            Persisted extraction ID. Present when storage is enabled (default).
            Use with `/split` and `/schema` endpoints.
        extraction_url:
          type: string
          description: >-
            URL to view the extraction on the Pulse platform. Present when
            storage is enabled.
        page_count:
          type: integer
          minimum: 1
          description: Number of pages processed.
        plan_info:
          allOf:
            - $ref: '#/components/schemas/PlanInfo'
          description: >-
            Billing tier and cumulative usage information. Includes
            `total_credits_used` (primary billing metric) and `pages_used`
            (legacy compatibility).
        credits_used:
          type: number
          format: float
          nullable: true
          description: >-
            Credits consumed by **this request**. Only present when the
            organization has the credit billing system enabled.
        warnings:
          type: array
          items:
            type: string
          description: >-
            Non-fatal warnings generated during extraction. Includes deprecation
            notices when legacy input parameters are used, as well as processing
            warnings (e.g. word-level bounding box limitations).
        html:
          type: string
          deprecated: true
          description: >-
            **Deprecated** -- Use `extensions.alt_outputs.html` instead. Present
            when the legacy `return_html` input was used.
        chunks:
          type: object
          deprecated: true
          description: >-
            **Deprecated** -- Use `extensions.chunking` instead. Present when
            the legacy `chunking` input was used.
          properties:
            semantic:
              type: array
              items:
                type: string
            header:
              type: array
              items:
                type: string
            page:
              type: array
              items:
                type: string
            recursive:
              type: array
              items:
                type: string
        plan-info:
          type: object
          deprecated: true
          description: >-
            **Deprecated** -- Use `plan_info` (underscore) instead. Present when
            only legacy input parameters are used.
          properties:
            tier:
              type: string
            pages_used:
              type: integer
            note:
              type: string
        structured_output:
          type: object
          deprecated: true
          description: >-
            **Deprecated** -- Only present when the deprecated
            `structured_output` input parameter was used. Use the `/schema`
            endpoint instead.
          properties:
            values:
              type: object
              additionalProperties: true
            citations:
              type: object
              additionalProperties: true
        input_schema:
          type: object
          deprecated: true
          description: '**Deprecated** -- Echo of the schema that was applied.'
          additionalProperties: true
        schema_error:
          type: string
          deprecated: true
          description: '**Deprecated** -- Error message if schema processing failed.'
      additionalProperties: true
    AsyncSubmissionResponse:
      type: object
      description: >-
        Acknowledgement returned when a request is submitted for asynchronous
        processing. Poll GET /job/{job_id} to check status and retrieve results.
      required:
        - job_id
        - status
      properties:
        job_id:
          type: string
          description: Identifier assigned to the asynchronous job.
        status:
          type: string
          description: Initial status reported by the server.
          enum:
            - pending
            - processing
        message:
          type: string
          description: Human-readable description of the accepted job.
    ExtractSourceMultipart:
      type: object
      description: Document source definition for multipart/form-data requests.
      properties:
        file:
          type: string
          format: binary
          description: Document to upload directly. Required unless file_url is specified.
        file_url:
          type: string
          format: uri
          description: Public or pre-signed URL that Pulse will download and extract.
      required:
        - file
    ExtractOptions:
      type: object
      description: >-
        Common extraction options shared by synchronous and asynchronous
        endpoints.
      properties:
        model:
          type: string
          enum:
            - default
            - pulse-ultra-2
          description: >-
            Extraction model to use. `pulse-ultra-2` uses Pulse's
            vision-language model with built-in refinement, figure/chart
            extraction, and word-level bounding boxes. Omit or pass `default`
            for standard extraction.
        pages:
          type: string
          description: >-
            Page range filter (1-indexed, where page 1 is the first page).
            Supports segments such as `1-2` or mixed ranges like `1-2,5`.
          pattern: ^[0-9]+(-[0-9]+)?(,[0-9]+(-[0-9]+)?)*$
        figure_processing:
          type: object
          description: >-
            Settings that control how figures and embedded visuals are
            processed. Applies to both PDFs/images (figures detected from
            layout) and spreadsheets (charts and embedded images read directly
            from the workbook). Affects the markdown output and the
            `bounding_boxes.Images[]` array.
          properties:
            description:
              type: boolean
              default: false
              description: >-
                Generate descriptive captions for extracted visuals. When
                `true`, applies to both detected charts and non-chart images.
                Captions appear under `bounding_boxes.Images[].description`.
            show_images:
              type: boolean
              default: false
              description: >-
                Return image URLs for extracted visuals. When `true`, URLs
                appear under `bounding_boxes.Images[].image_url` (Pulse-hosted,
                served from `GET /results/{jobId}/images/{filename}`). Applies
                to both charts and non-chart images.
        spreadsheet:
          type: object
          description: >-
            Settings for Excel/spreadsheet extraction. Controls handling of
            hidden rows, columns, and sheets, whether numeric cells are rendered
            using their display format or underlying raw value, and optional
            trimming of empty phantom rows/columns past the last data-bearing
            cell. Applies to `.xlsx`, `.xlsm`, and `.xls` files. Accepts both
            camelCase and snake_case field names.
          properties:
            include_hidden_rows:
              type: boolean
              default: false
              description: Include rows that are hidden in the Excel workbook.
            include_hidden_cols:
              type: boolean
              default: false
              description: Include columns that are hidden in the Excel workbook.
            include_hidden_sheets:
              type: boolean
              default: false
              description: Include sheets that are hidden in the Excel workbook.
            use_raw_values:
              type: boolean
              default: false
              description: >-
                Emit the underlying numeric value for number cells instead of
                the Excel display-formatted text (e.g. `1201.67` rather than
                `$1,202` when the cell uses a rounded currency format).
                Percent-formatted cells and dates keep their display rendering.
                Does not apply to legacy `.xls` files.
            only_data_rows:
              type: boolean
              default: false
              description: >-
                When true, trim trailing empty rows past the last cell carrying
                a value or formula before parsing. Excel exports from claims
                systems and ERPs routinely declare a used range with hundreds of
                thousands of empty-but-styled phantom rows that inflate file
                size and exhaust parser memory; enabling this strips them out
                without touching any cell that actually has data. Surviving
                cells keep their original A1 coordinates so citations that
                reference a specific cell remain stable. Defaults to false.
            only_data_cols:
              type: boolean
              default: false
              description: >-
                When true, trim trailing empty columns past the last cell
                carrying a value or formula. Same rationale and
                coordinate-stability guarantee as `only_data_rows`. Defaults to
                false.
        extensions:
          type: object
          description: >-
            Settings that enable additional processing passes or alternate
            output formats. Each enabled extension produces a corresponding
            output field under `response.extensions.*`.
          properties:
            merge_tables:
              type: boolean
              default: false
              description: Merge tables that span multiple pages into a single table.
            footnote_references:
              type: boolean
              default: false
              description: Link footnote markers to their corresponding footnote text.
            chunking:
              type: object
              description: >-
                Chunking configuration. When provided, the document is split
                into chunks using the specified strategies. Results appear in
                `response.extensions.chunking`.
              properties:
                chunk_types:
                  type: array
                  items:
                    type: string
                    enum:
                      - semantic
                      - header
                      - page
                      - recursive
                  description: >-
                    List of chunking strategies to apply (e.g. `["semantic",
                    "header", "page", "recursive"]`).
                chunk_size:
                  type: integer
                  minimum: 1
                  description: Maximum characters per chunk.
            alt_outputs:
              type: object
              description: >-
                Alternate output format options. Each enabled format produces a
                corresponding field under `response.extensions.alt_outputs`.
              properties:
                wlbb:
                  type: boolean
                  default: false
                  description: >-
                    Enable word-level bounding boxes. Runs an additional OCR
                    model to derive bounding boxes for each word. Only applies
                    to PDFs. Results in `response.extensions.alt_outputs.wlbb`.
                return_html:
                  type: boolean
                  default: false
                  description: >-
                    Include an HTML representation of the document. When
                    enabled, `response.markdown` is still present and the HTML
                    is available at `response.extensions.alt_outputs.html`.
                return_xml:
                  type: boolean
                  default: false
                  description: >-
                    Include an XML representation of the document. Results in
                    `response.extensions.alt_outputs.xml`. (Work in progress.)
        storage:
          type: object
          description: >-
            Options for persisting extraction artifacts. When enabled (default),
            artifacts are saved to storage and a database record is created.
          properties:
            enabled:
              type: boolean
              description: >-
                Whether to persist extraction artifacts. Set to false for
                temporary extractions with no storage or database record.
              default: true
            folder_name:
              type: string
              description: >-
                Target folder name to save the extraction to. Creates the folder
                if it doesn't exist.
            folder_id:
              type: string
              format: uuid
              description: >-
                Target folder ID to save the extraction to. Takes precedence
                over folder_name if both are provided.
        async:
          type: boolean
          description: >-
            If true, returns immediately with a job_id for polling via GET
            /job/{jobId}. Otherwise processes synchronously.
          default: false
        refine:
          type: boolean
          default: false
          description: >-
            Run a full-page OCR and formatting correction pass after extraction.
            Improves accuracy on dense layouts, numerical values, and table
            structure. Adds ~1-2s per page. Overridden by `refine_options` if
            both are provided. Requires `model: pulse-ultra-2`.
        refine_options:
          type: object
          description: >-
            Granular refinement targets. Takes precedence over the boolean
            `refine` flag. Requires `model: pulse-ultra-2`.
          properties:
            tables:
              type: boolean
              default: false
              description: >-
                Fix table cell values, structure, and headers against the source
                image.
            text:
              type: boolean
              default: false
              description: >-
                Fix OCR errors, missing or extra content, and numerical accuracy
                (tables untouched).
            formatting:
              type: boolean
              default: false
              description: >-
                Add strikethrough, italic, bold, super/subscript, and LaTeX
                formatting (tables untouched).
        extract_figure:
          type: boolean
          default: false
          description: >-
            Convert charts and data visualizations into HTML `<table>` blocks,
            wrapped in `<figure-table>` tags inside `response.markdown`. Useful
            for financial decks, dashboards, and scientific charts. Requires
            `model: pulse-ultra-2`.
        figure_description:
          type: boolean
          default: false
          description: >-
            Generate a 1-2 paragraph natural-language description of each
            picture, wrapped in `<figure-description>` tags inside
            `response.markdown`. Combines well with `extract_figure`. Requires
            `model: pulse-ultra-2`.
        additional_prompt:
          type: string
          default: ''
          maxLength: 4000
          description: >-
            Extra context injected into the extraction prompt. Use to steer
            extraction toward a specific domain or attention focus. Requires
            `model: pulse-ultra-2`.
        custom_image_prompt:
          type: string
          default: ''
          maxLength: 2000
          description: >-
            Extra context appended to the prompt used by `figure_description`
            and `extract_figure`. Tunes image and chart interpretation for your
            domain. Requires `model: pulse-ultra-2`.
        custom_refine_prompt:
          type: string
          default: ''
          maxLength: 2000
          description: >-
            Extra context appended to the refinement prompt. Only applies when
            `refine: true` or `refine_options` is set. Requires `model:
            pulse-ultra-2`.
        chunking:
          type: string
          deprecated: true
          description: >-
            **Deprecated** -- Use `extensions.chunking.chunk_types` instead.
            Comma-separated list of chunking strategies.
        chunk_size:
          type: integer
          minimum: 1
          deprecated: true
          description: '**Deprecated** -- Use `extensions.chunking.chunk_size` instead.'
        show_images:
          type: boolean
          deprecated: true
          default: false
          description: '**Deprecated** -- Use `figure_processing.show_images` instead.'
        return_html:
          type: boolean
          deprecated: true
          default: false
          description: '**Deprecated** -- Use `extensions.alt_outputs.return_html` instead.'
    ExtractSourceJson:
      type: object
      description: Document source definition for JSON requests.
      properties:
        file_url:
          type: string
          format: uri
          description: Public or pre-signed URL that Pulse will download and extract.
      required:
        - file_url
    BoundingBoxes:
      type: object
      description: |
        Positional bounding-box data for text, titles, headers, footers,
        images, and tables. Used by the frontend for annotation overlays
        and by SDK consumers to access detected visuals. Keys not listed
        here round-trip unchanged for forward compatibility (e.g.
        `markdown_with_ids`, `defined_names`, `Header`, `Page Number`).
      properties:
        Images:
          type: array
          items:
            $ref: '#/components/schemas/BoundingBoxImage'
          description: |
            Detected or embedded visuals. `image_url` is populated when
            `figure_processing.show_images` was enabled.
        Tables:
          type: array
          items:
            $ref: '#/components/schemas/BoundingBoxTable'
          description: >-
            Detected tables. Each entry carries a `table_info` block plus
            optional `cell_data`.
        Text:
          type: array
          items:
            $ref: '#/components/schemas/BoundingBoxItem'
          description: |
            Body-text paragraphs and detected text regions. For
            spreadsheets, includes free-form text outside table regions.
        Title:
          type: array
          items:
            $ref: '#/components/schemas/BoundingBoxItem'
          description: |
            Detected title regions. Spreadsheets emit one `Sheet: <name>`
            title per processed sheet.
        Footer:
          type: array
          items:
            $ref: '#/components/schemas/BoundingBoxItem'
          description: >-
            Detected footer regions (e.g. spreadsheet "Totals" rows, PDF page
            footers).
        markdown_with_ids:
          type: string
          description: |
            Markdown variant with stable `data-bb-*-id` attributes in
            figure / table / text tags. Lets clients join rendered HTML
            back to bounding-box entries by id.
      additionalProperties: true
    PlanInfo:
      type: object
      description: |
        Cumulative billing snapshot for the calling organization. Returned
        by every endpoint that consumes credits (extract, schema, tables,
        split, form, and their batch / pipeline equivalents). Includes the
        in-flight request's contribution, so every response reflects
        post-request state.
      properties:
        tier:
          type: string
          description: Billing tier, e.g. `"trial"`, `"growth"`, `"pulse_ultra_2"`.
        total_credits_used:
          type: number
          format: float
          description: |
            Total credits consumed by the organization to date, including
            this request. The primary billing metric going forward.
        pages_used:
          type: integer
          minimum: 0
          description: |
            Total pages processed by the organization to date, including
            this request. Kept for backward compatibility with clients
            that haven't migrated to `total_credits_used`.
        note:
          type: string
          description: |
            Optional human-readable note about billing state for this
            response (e.g. trial credits remaining). Omitted when no
            note applies.
    ErrorResponse:
      type: object
      properties:
        error:
          type: object
          properties:
            code:
              type: string
              description: Error code (e.g., FILE_001, AUTH_002)
            message:
              type: string
              description: Human-readable error message
            details:
              type: object
              description: Additional error context
    BoundingBoxImage:
      type: object
      description: |
        Detected or embedded visual (chart or image) returned under
        `bounding_boxes.Images`. For PDFs/images, populated when figure
        detection is enabled. For spreadsheets, populated for embedded
        charts and images. `image_url` is set when
        `figure_processing.show_images` is enabled.
      required:
        - id
        - visual_type
      properties:
        id:
          type: string
          description: |
            Stable visual identifier (e.g. `excel_image_1_1`, `fig-3`).
            Use this to join with the `data-bb-image-id` attribute in the
            markdown output.
        content:
          type: string
          description: |
            Short caption for the visual (e.g. `Chart: Revenue`). Populated
            by the spreadsheet parser; PDF figures may leave this empty
            when no caption is detected.
        visual_type:
          allOf:
            - $ref: '#/components/schemas/VisualType'
          description: Drives whether figure_processing options apply to this entry.
        page_number:
          type: integer
          minimum: 1
          description: 1-indexed page or sheet index.
        bounding_box:
          type: array
          items:
            type: number
          description: |
            Document coordinate polygon when available. Spreadsheet visuals
            typically use an empty array and rely on `excel_range`.
        image_url:
          type: string
          description: |
            Pulse-hosted URL for the visual image bytes. Present when
            `figure_processing.show_images` was enabled. Spreadsheet
            visuals proxy through `GET /results/{jobId}/images/{filename}`;
            inline figure flows may use a data URI.
        description:
          type: string
          description: |
            Generated visual description. Present when
            `figure_processing.description` was enabled.
        classification:
          type: object
          description: Visual classification metadata when classification was run.
          additionalProperties: true
          properties:
            confidence:
              type: number
              minimum: 0
              maximum: 1
            model:
              type: string
            error:
              type: string
        sheet_name:
          type: string
          description: Spreadsheet-only sheet name.
        sheet_index:
          type: integer
          description: Spreadsheet-only parsed sheet index after hidden-sheet filtering.
        workbook_sheet_index:
          type: integer
          description: Spreadsheet-only original workbook sheet index.
        excel_range:
          type: string
          description: Spreadsheet-only anchor or covered cell range for the visual.
        chart_type:
          type: string
          description: Spreadsheet chart class name (e.g. `BarChart`, `LineChart`).
        chart_title:
          type: string
          description: Spreadsheet chart title when available.
        source_ranges:
          type: array
          items:
            type: string
          description: Spreadsheet chart source ranges (e.g. `["Charts!$B$1:$B$3"]`).
        render_error:
          type: string
          description: |
            Optional non-fatal rendering error for spreadsheet visuals.
            When set, the visual entry is still returned but `image_url`
            may be omitted.
        description_error:
          type: string
          description: Optional non-fatal description-generation error.
      additionalProperties: true
    BoundingBoxTable:
      type: object
      description: |
        Detected table region returned under `bounding_boxes.Tables`.
        Carries a structured `table_info` block plus optional `cell_data`.
        Kept loose with `additionalProperties: true` to round-trip future
        fields the server may add.
      properties:
        table_info:
          type: object
          additionalProperties: true
          properties:
            id:
              type: string
            dimensions:
              type: array
              items:
                type: integer
            excel_range:
              type: string
            sheet_name:
              type: string
            sheet_index:
              type: integer
            workbook_sheet_index:
              type: integer
            section_index:
              type: integer
            section_type:
              type: string
            section_name:
              type: string
            table_name:
              type: string
            layout_type:
              type: string
            is_chart:
              type: boolean
            chart_type:
              type: string
            chart_title:
              type: string
            source_ranges:
              type: array
              items:
                type: string
            location:
              type: object
              additionalProperties: true
        cell_data:
          type: array
          description: |
            Table cell payload when available. Shape varies by extraction
            path; SDK consumers should treat as opaque.
          items:
            type: object
            additionalProperties: true
      additionalProperties: true
    BoundingBoxItem:
      type: object
      description: |
        Common base shape used for `Text`, `Title`, and `Footer` entries.
        Spreadsheet rows may carry additional sheet-scoped fields such as
        `excel_range` and `sheet_name`; forward-compatible extras are
        accepted.
      properties:
        id:
          type: string
          description: Stable bounding-box identifier (e.g. `txt-1`, `excel_title_0`).
        content:
          type: string
          description: Plain-text content of the region.
        page_number:
          type: integer
          minimum: 1
          description: 1-indexed page or sheet number.
        bounding_box:
          type: array
          items:
            type: number
          description: |
            Document coordinate polygon when available. Spreadsheet visuals
            may use an empty array and `excel_range` instead.
        excel_range:
          type: string
          description: Spreadsheet-only anchor or covered cell range.
        sheet_name:
          type: string
          description: Spreadsheet-only sheet name.
        sheet_index:
          type: integer
          description: Spreadsheet-only parsed sheet index after hidden-sheet filtering.
        workbook_sheet_index:
          type: integer
          description: Spreadsheet-only original workbook sheet index.
      additionalProperties: true
    VisualType:
      type: string
      enum:
        - chart
        - image
      description: |
        Visual class of a bounding-box image. `chart` covers data
        visualizations (bar/line/pie etc., including spreadsheet chart
        objects). `image` covers non-chart embedded or detected visuals
        (logos, photos, screenshots).
  responses:
    BadRequest:
      description: Bad request - Invalid parameters
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/ErrorResponse'
    Unauthorized:
      description: Unauthorized - Invalid or missing API key
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/ErrorResponse'
    TooManyRequests:
      description: Rate limit exceeded
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/ErrorResponse'
    InternalServerError:
      description: Internal server error
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/ErrorResponse'
  securitySchemes:
    ApiKeyAuth:
      type: apiKey
      in: header
      name: x-api-key
      description: API key for authentication

````