Extract File
The primary endpoint for the Pulse API. Parses uploaded documents or remote file URLs and returns rich markdown content with optional structured data extraction based on user-provided schemas and extraction options.
Set async: true to return immediately with a job ID for polling via
GET /job/{jobId}. Otherwise processes synchronously.
Both sync and async modes return HTTP 200. When async is true the response
body contains { job_id, status, message } instead of the full extraction result.
Overview
https://api.runpulse.com/large_results/{job_id} instead of inlining the payload. See Large Document Response below.
async: true to process asynchronously and poll for results via GET /job/jobId./extract on each file in parallel.Async Mode
Setasync: true to return immediately with a job ID for polling:
GET /job/{job_id} to poll for completion.
Request
Document Source
Provide the document using one of these methods:| Field | Type | Description |
|---|---|---|
file | binary | Document file to upload directly (multipart/form-data). |
file_url | string | Public or pre-signed URL that Pulse will download and extract. |
Extraction Options
| Field | Type | Default | Description |
|---|---|---|---|
model | string (enum) | default | Extraction model to use. One of default or pulse-ultra-2. pulse-ultra-2 uses Pulse’s vision-language model with built-in refinement, figure/chart extraction, and word-level bounding boxes. |
pages | string | - | Page range filter (1-indexed). Supports segments like 1-2 or mixed ranges like 1-2,5. Page 1 is the first page. |
figure_processing | object | - | Settings that control how figures in the document are processed. These affect the markdown output directly and do not produce additional output fields. See Figure Processing. |
extensions | object | - | Settings that enable additional processing or alternate output formats. Each enabled extension produces a corresponding result under response.extensions.*. See Extensions. |
spreadsheet | object | - | Settings for Excel/spreadsheet extraction. Controls handling of hidden rows, columns, sheets, and the automatic trimming of empty trailing rows/columns past the last data-bearing cell. Applies to .xlsx, .xlsm, and .xls files. See Spreadsheet Options. |
storage | object | - | Options for persisting extraction artifacts. See Storage Options. |
async | boolean | false | If true, returns immediately with a job_id for polling via GET /job/{jobId}. |
structured_output | object | - | ⚠️ Deprecated — Use the /schema endpoint after extraction instead. Still works for backward compatibility. |
Figure Processing
Settings underfigure_processing control how figures (images, charts, diagrams) and embedded visuals are processed. Applies to both PDFs/images (figures detected from layout) and spreadsheets (charts and embedded images read directly from the workbook). Affects the markdown output and the bounding_boxes.Images[] array.
| Field | Type | Default | Description |
|---|---|---|---|
figure_processing.description | boolean | false | Generate descriptive captions for extracted visuals. Captions appear under bounding_boxes.Images[].description and inline in the markdown output. Applies to both detected charts and non-chart images. |
figure_processing.show_images | boolean | false | Return image URLs for extracted visuals. URLs appear under bounding_boxes.Images[].image_url and resolve to a Pulse-hosted PNG/JPEG served from GET /results/{jobId}/images/{filename}. Applies to both detected charts and non-chart images. |
show_images: true collects every embedded chart and image in the workbook and emits one entry per visual under bounding_boxes.Images, with chart-specific fields like chart_type, chart_title, and source_ranges populated. See Bounding Boxes for the full field list.Spreadsheet Options
Settings underspreadsheet control how Excel workbooks (.xlsx, .xlsm, .xls) are processed. By default, hidden rows, columns, and sheets are excluded from extraction output, and cell values are rendered the way Excel displays them. Phantom-cell trimming is opt-in.
| Field | Type | Default | Description |
|---|---|---|---|
spreadsheet.include_hidden_rows | boolean | false | Include rows that are hidden in the Excel workbook. |
spreadsheet.include_hidden_cols | boolean | false | Include columns that are hidden in the Excel workbook. |
spreadsheet.include_hidden_sheets | boolean | false | Include sheets that are hidden in the Excel workbook. |
spreadsheet.use_raw_values | boolean | false | Emit the underlying numeric value for number cells instead of the Excel display-formatted text — e.g. 1201.67 rather than $1,202 when the cell uses a rounded currency format. Useful when downstream processing needs exact amounts (cent-level precision) rather than what the workbook shows visually. Percent-formatted cells and dates keep their display rendering. Does not apply to legacy .xls files. |
spreadsheet.only_data_rows | boolean | false | When true, trim trailing empty rows past the last cell carrying a value or formula. See Phantom-cell trimming below. |
spreadsheet.only_data_cols | boolean | false | When true, trim trailing empty columns past the last cell carrying a value or formula. Same rationale as only_data_rows. |
includeHiddenRows, onlyDataRows) and snake_case (include_hidden_rows, only_data_rows) formats.Phantom-cell trimming (only_data_rows / only_data_cols)
Excel files exported from claims systems, ERPs, and other automated pipelines routinely declare a “used range” that extends hundreds of thousands of rows past where the data actually ends. A typical case: a 57 MB workbook with only ~500 rows of real data, where the other ~1,000,000 rows are empty cells that exist only because they were once selected and styled. These phantom cells inflate file size by orders of magnitude and can exhaust parser memory on the extraction pipeline.
Set only_data_rows: true and only_data_cols: true to have Pulse scan each sheet once before parsing, find the largest row and column containing a value or formula, and ignore everything beyond that extent. Surviving cells keep their original A1 coordinates (e.g., a value at B7 in the source is still B7 in the output), so any citation or bounding box that references a specific cell remains stable. The trim only kicks in on large sheets (≥5 MB of XML per sheet), so small, well-formed workbooks pay no overhead either way.
Both flags default to false.
Pulse Ultra 2 Options
These options are available only whenmodel: pulse-ultra-2 is set. Passing any of them with the default model returns a 400 error listing the offending fields.
| Field | Type | Default | Description |
|---|---|---|---|
refine | boolean | false | Run a full-page OCR and formatting correction pass after extraction. Improves accuracy on dense layouts, numerical values, and table structure. Adds ~1–2s per page. Overridden by refine_options if both are provided. |
refine_options | object | - | Granular refinement targets. Takes precedence over the boolean refine flag. See below. |
refine_options.tables | boolean | false | Fix table cell values, structure, and headers against the source image. |
refine_options.text | boolean | false | Fix OCR errors, missing or extra content, and numerical accuracy (tables untouched). |
refine_options.formatting | boolean | false | Add strikethrough, italic, bold, super/subscript, and LaTeX formatting (tables untouched). |
extract_figure | boolean | false | Convert charts and data visualizations into HTML <table> blocks, wrapped in <figure-table> tags. Useful for financial decks, dashboards, and scientific charts. |
figure_description | boolean | false | Generate a 1–2 paragraph natural-language description of each picture, wrapped in <figure-description> tags. Combines well with extract_figure. |
additional_prompt | string | "" | Extra context injected into the extraction prompt. Use to steer extraction toward a specific domain or attention focus. Max 4000 characters. |
custom_image_prompt | string | "" | Extra context appended to the prompt used by figure_description and extract_figure. Tunes image and chart interpretation. Max 2000 characters. |
custom_refine_prompt | string | "" | Extra context appended to the refinement prompt. Only applies when refine: true or refine_options is set. Max 2000 characters. |
Markdown output additions
Whenextract_figure or figure_description is enabled, figures in response.markdown include additional tags:
refine (or refine_options) is set, markdown content is post-processed page-by-page; output is cleaner but typically grows ~1.5–3x in size for dense documents. No new tags are introduced.
Extensions
Settings underextensions enable additional processing passes or alternate output formats. Each enabled extension produces a corresponding output field under response.extensions.*. For example, enabling extensions.chunking produces response.extensions.chunking, and enabling extensions.alt_outputs.return_html produces response.extensions.alt_outputs.html.
| Field | Type | Default | Description |
|---|---|---|---|
extensions.footnote_references | boolean | false | Link footnote markers to their corresponding footnote text. |
extensions.chunking | object | - | Chunking configuration. See below. |
extensions.chunking.chunk_types | string[] | - | List of chunking strategies: semantic, header, page, recursive. |
extensions.chunking.chunk_size | integer | - | Maximum characters per chunk. |
extensions.alt_outputs | object | - | Alternate output formats. See below. |
extensions.alt_outputs.wlbb | boolean | false | Enable word-level bounding boxes (PDF only). Results in response.extensions.alt_outputs.wlbb. |
extensions.alt_outputs.return_html | boolean | false | Include HTML representation. response.markdown is still present; HTML is at response.extensions.alt_outputs.html. |
extensions.alt_outputs.return_xml | boolean | false | Include XML representation (work in progress). |
pulse-ultra-2 Rate Limits
Requests made with model: pulse-ultra-2 are subject to dedicated rate limits, separate from standard extraction:
| Limit | Value |
|---|---|
| Per minute | 5 extractions |
| Per hour | 20 extractions |
| File size | 50 MB |
| Concurrent | 2 per API key |
Storage Options
Control whether extractions are saved to your extraction library:| Field | Type | Default | Description |
|---|---|---|---|
storage.enabled | boolean | true | Whether to persist extraction artifacts. Set to false for temporary extractions. |
storage.folder_name | string | - | Target folder name to save the extraction to. Creates the folder if it doesn’t exist. |
storage.folder_id | string (uuid) | - | Target folder ID to save the extraction to. Takes precedence over folder_name. |
Deprecated Fields
The following input fields are deprecated and will be removed in a future version. They are still accepted for backward compatibility.| Field | Replacement |
|---|---|
show_images | Use figure_processing.show_images |
chunking | Use extensions.chunking.chunk_types (array instead of comma-separated string) |
chunk_size | Use extensions.chunking.chunk_size |
return_html | Use extensions.alt_outputs.return_html |
structured_output | Use /schema endpoint after extraction. Pass extraction_id + schema_config. Accepts schema, schema_prompt, and effort. |
schema | Use /schema endpoint after extraction |
schema_prompt | Use /schema endpoint with schema_config.schema_prompt |
custom_prompt | No replacement |
thinking | No replacement |
warnings array directing you to the updated field names. See the latest documentation for details.Response
The response structure varies based on document size to optimize for different use cases.Standard Response (Under 70 Pages)
For documents under 70 pages, results are returned directly in the response body:Response Fields
| Field | Type | Description |
|---|---|---|
markdown | string | Clean markdown content extracted from the document. Always present. |
page_count | integer | Total number of pages processed. |
extraction_id | string (uuid) | Persisted extraction ID. Present when storage is enabled (default). Use with /split and /schema. |
extraction_url | string | URL to view the extraction in the Pulse Platform. Present when storage is enabled. |
credits_used | number | Credits consumed by this request. Only present when the org has the credit billing system enabled. |
plan_info | object | Billing tier and cumulative usage information for the calling org, including this request. Includes tier, total_credits_used (primary billing metric), pages_used (legacy), and an optional note. |
bounding_boxes | object | Typed bounding-box data — Images, Tables, Text, Title, Footer, plus markdown_with_ids. See Bounding Boxes for the full field list including the chart/image fields under Images. |
extensions | object | Output from enabled extensions. Only keys for enabled extensions are present. See below. |
extensions.chunking | object | Chunk results by strategy (when extensions.chunking is enabled). |
extensions.footnoteReferences | array | List of detected footnotes with their in-text references (when extensions.footnote_references is enabled). See Footnote References below. |
extensions.altOutputs.wlbb | object | Word-level bounding boxes (when extensions.alt_outputs.wlbb is enabled). |
extensions.altOutputs.html | string | HTML representation (when extensions.alt_outputs.return_html is enabled). |
extensions.altOutputs.xml | string | XML representation (when extensions.alt_outputs.return_xml is enabled, WIP). |
warnings | array | Non-fatal warnings generated during extraction, including deprecation notices for legacy input usage. |
Deprecated Response Fields
| Field | Replacement | Description |
|---|---|---|
html | extensions.altOutputs.html | Present when legacy return_html input is used. |
chunks | extensions.chunking | Present when legacy chunking input is used. |
plan-info | plan_info | Present when only legacy inputs are used. |
structured_output | Use /schema | Present when deprecated structured_output input was used. |
input_schema | Use /schema | Echo of the applied schema (deprecated path only). |
schema_error | Use /schema | Error message if schema processing failed (deprecated path only). |
Large Document Response (70+ Pages)
For documents with 70 or more pages — or any response payload above the 5 MB inline threshold — the API returns a one-time download link to/large_results/{job_id} instead of inlining the payload. This prevents timeout issues and keeps the immediate response small.
Large Document Response Fields
| Field | Type | Description |
|---|---|---|
is_url | boolean | Always true for large document responses. Use this to detect URL-based responses. |
url | string | One-time download link of the form https://api.runpulse.com/large_results/{job_id}. The link streams the complete extraction result the first time it is fetched and is then invalidated (subsequent reads return 410 Gone). It also expires 1 hour after the job completes. Authenticate the request with your x-api-key header. |
plan_info | object | Billing information including pages used and plan tier. |
Handling Large Document Responses
/large_results/{job_id} is one-time use, persist the result to your own storage on first download. If you need to access the result later, enable storage.enabled and retrieve it from your extraction library on the Pulse Platform.Example Usage
Basic Extraction
File Upload
Structured Data (Extract → Schema)
Recommended two-step approach:Page Range and Chunking
Footnote References
Enableextensions.footnote_references to detect footnote markers (e.g. *, †, 1) in body text and link them to the footnote explanation paragraphs at the bottom of the page. Each result item includes the marker symbol, the bounding-box text ID of the footnote, and the bounding-box text IDs of all body-text paragraphs that reference it.
Example Response
Footnote Reference Fields
| Field | Type | Description |
|---|---|---|
symbol | string | The footnote marker symbol as detected in the document (e.g. *, †, ‡, 1, #). |
footnoteTextId | string | The bounding-box text ID (e.g. txt-11) of the footnote explanation paragraph. Cross-reference with bounding_boxes.Footer to get the footnote’s content and position. |
referenceTextIds | string[] | Bounding-box text IDs of body-text paragraphs that contain a reference to this footnote. Cross-reference with bounding_boxes.Text to get each paragraph’s content and position. |
†/+ and ‡/#. Supported marker types include numbered (1, 2, 3), symbolic (*, †, ‡, §, #), and lettered (a, b, c) footnotes.Excel Spreadsheet Options
spreadsheet.only_data_rows: true and spreadsheet.only_data_cols: true to have Pulse trim those trailing empty “phantom” rows and columns before parsing. Surviving cells keep their original A1 coordinates, so any citation or bounding box that references a specific cell remains stable. Both flags default to false. See Spreadsheet Options for the full reference.Excel Charts and Embedded Images
When you setfigure_processing.show_images: true on an Excel workbook, every embedded chart and image is collected from the workbook directly and returned under bounding_boxes.Images[]. Each entry carries a Pulse-hosted image_url you can fetch via results.getImage (or any HTTP client with your API key) to get the raw PNG/JPEG bytes.
Example bounding_boxes.Images Entry
image_url.
Disable Storage
Authorizations
API key for authentication
Body
Input schema for multipart/form-data requests (file upload or file_url).
Document to upload directly. Required unless file_url is specified.
Public or pre-signed URL that Pulse will download and extract.
Extraction model to use. pulse-ultra-2 uses Pulse's vision-language model with built-in refinement, figure/chart extraction, and word-level bounding boxes. Omit or pass default for standard extraction.
default, pulse-ultra-2 Page range filter (1-indexed, where page 1 is the first page). Supports segments such as 1-2 or mixed ranges like 1-2,5.
^[0-9]+(-[0-9]+)?(,[0-9]+(-[0-9]+)?)*$Settings that control how figures and embedded visuals are processed. Applies to both PDFs/images (figures detected from layout) and spreadsheets (charts and embedded images read directly from the workbook). Affects the markdown output and the bounding_boxes.Images[] array.
Settings for Excel/spreadsheet extraction. Controls handling of hidden rows, columns, and sheets, whether numeric cells are rendered using their display format or underlying raw value, and optional trimming of empty phantom rows/columns past the last data-bearing cell. Applies to .xlsx, .xlsm, and .xls files. Accepts both camelCase and snake_case field names.
Settings that enable additional processing passes or alternate output formats. Each enabled extension produces a corresponding output field under response.extensions.*.
Options for persisting extraction artifacts. When enabled (default), artifacts are saved to storage and a database record is created.
If true, returns immediately with a job_id for polling via GET /job/{jobId}. Otherwise processes synchronously.
Run a full-page OCR and formatting correction pass after extraction. Improves accuracy on dense layouts, numerical values, and table structure. Adds ~1-2s per page. Overridden by refine_options if both are provided. Requires model: pulse-ultra-2.
Granular refinement targets. Takes precedence over the boolean refine flag. Requires model: pulse-ultra-2.
Convert charts and data visualizations into HTML <table> blocks, wrapped in <figure-table> tags inside response.markdown. Useful for financial decks, dashboards, and scientific charts. Requires model: pulse-ultra-2.
Generate a 1-2 paragraph natural-language description of each picture, wrapped in <figure-description> tags inside response.markdown. Combines well with extract_figure. Requires model: pulse-ultra-2.
Extra context injected into the extraction prompt. Use to steer extraction toward a specific domain or attention focus. Requires model: pulse-ultra-2.
4000Extra context appended to the prompt used by figure_description and extract_figure. Tunes image and chart interpretation for your domain. Requires model: pulse-ultra-2.
2000Extra context appended to the refinement prompt. Only applies when refine: true or refine_options is set. Requires model: pulse-ultra-2.
2000Deprecated -- Use extensions.chunking.chunk_types instead. Comma-separated list of chunking strategies.
Deprecated -- Use extensions.chunking.chunk_size instead.
x >= 1Deprecated -- Use figure_processing.show_images instead.
Deprecated -- Use extensions.alt_outputs.return_html instead.
Response
When async=false (default): full extraction result with markdown,
bounding boxes, chunks, etc.
When async=true: job submission acknowledgement with job_id.
- Option 1
- Option 2
Full extraction result returned by the synchronous /extract endpoint. Contains the extracted markdown, optional extensions output, bounding boxes, and storage metadata.
Primary markdown content extracted from the document. Always present in the new format.
Output from enabled extensions. Each key corresponds to an extension that was enabled in the request under extensions.*. Only keys for enabled extensions are present.
Positional bounding-box data for text, titles, headers, footers, images, and tables. Images carries chart/image visuals (with image_url when figure_processing.show_images is enabled), Tables the detected tables, and Text/Title/Footer the paragraph/title/footer regions. Additional keys (e.g. markdown_with_ids, defined_names) round-trip without being typed.
Persisted extraction ID. Present when storage is enabled (default). Use with /split and /schema endpoints.
URL to view the extraction on the Pulse platform. Present when storage is enabled.
Number of pages processed.
x >= 1Billing tier and cumulative usage information. Includes total_credits_used (primary billing metric) and pages_used (legacy compatibility).
Credits consumed by this request. Only present when the organization has the credit billing system enabled.
Non-fatal warnings generated during extraction. Includes deprecation notices when legacy input parameters are used, as well as processing warnings (e.g. word-level bounding box limitations).
Deprecated -- Use extensions.alt_outputs.html instead. Present when the legacy return_html input was used.
Deprecated -- Use extensions.chunking instead. Present when the legacy chunking input was used.
Deprecated -- Use plan_info (underscore) instead. Present when only legacy input parameters are used.
Deprecated -- Only present when the deprecated structured_output input parameter was used. Use the /schema endpoint instead.
Deprecated -- Echo of the schema that was applied.
Deprecated -- Error message if schema processing failed.
