The primary endpoint for the Pulse API. Parses uploaded documents or remote file URLs and returns rich markdown content with optional structured data extraction based on user-provided schemas and extraction options.
Set async: true to return immediately with a job ID for polling via
GET /job/{jobId}. Otherwise processes synchronously.
Both sync and async modes return HTTP 200. When async is true the response
body contains { job_id, status, message } instead of the full extraction result.
Documentation Index
Fetch the complete documentation index at: https://docs.runpulse.com/llms.txt
Use this file to discover all available pages before exploring further.
https://api.runpulse.com/large_results/{job_id} instead of inlining the payload. See Large Document Response below.
async: true to process asynchronously and poll for results via GET /job/jobId./extract on each file in parallel.async: true to return immediately with a job ID for polling:
GET /job/{job_id} to poll for completion.
| Field | Type | Description |
|---|---|---|
file | binary | Document file to upload directly (multipart/form-data). |
file_url | string | Public or pre-signed URL that Pulse will download and extract. |
| Field | Type | Default | Description |
|---|---|---|---|
model | string (enum) | default | Extraction model to use. One of default or pulse-ultra-2. pulse-ultra-2 uses Pulse’s vision-language model with built-in refinement, figure/chart extraction, and word-level bounding boxes. |
pages | string | - | Page range filter (1-indexed). Supports segments like 1-2 or mixed ranges like 1-2,5. Page 1 is the first page. |
figure_processing | object | - | Settings that control how figures in the document are processed. These affect the markdown output directly and do not produce additional output fields. See Figure Processing. |
extensions | object | - | Settings that enable additional processing or alternate output formats. Each enabled extension produces a corresponding result under response.extensions.*. See Extensions. |
spreadsheet | object | - | Settings for Excel/spreadsheet extraction. Controls handling of hidden rows, columns, and sheets. Only applies to .xlsx and .xls files. See Spreadsheet Options. |
storage | object | - | Options for persisting extraction artifacts. See Storage Options. |
async | boolean | false | If true, returns immediately with a job_id for polling via GET /job/{jobId}. |
structured_output | object | - | ⚠️ Deprecated — Use the /schema endpoint after extraction instead. Still works for backward compatibility. |
figure_processing control how figures (images, charts, diagrams) in the document are processed. These settings affect the markdown output directly — for example, adding descriptive captions to figures or converting charts into markdown tables. They do not create additional output fields in the response.
| Field | Type | Default | Description |
|---|---|---|---|
figure_processing.description | boolean | false | Generate descriptive captions for extracted figures. |
figure_processing.show_images | boolean | false | Embed base64-encoded images inline in figure tags. Increases response size. |
spreadsheet control how Excel workbooks (.xlsx, .xls) are processed. By default, hidden rows, columns, and sheets are excluded from extraction output.
| Field | Type | Default | Description |
|---|---|---|---|
spreadsheet.include_hidden_rows | boolean | false | Include rows that are hidden in the Excel workbook. |
spreadsheet.include_hidden_cols | boolean | false | Include columns that are hidden in the Excel workbook. |
spreadsheet.include_hidden_sheets | boolean | false | Include sheets that are hidden in the Excel workbook. |
includeHiddenRows) and snake_case (include_hidden_rows) formats.model: pulse-ultra-2 is set. Passing any of them with the default model returns a 400 error listing the offending fields.
| Field | Type | Default | Description |
|---|---|---|---|
refine | boolean | false | Run a full-page OCR and formatting correction pass after extraction. Improves accuracy on dense layouts, numerical values, and table structure. Adds ~1–2s per page. Overridden by refine_options if both are provided. |
refine_options | object | - | Granular refinement targets. Takes precedence over the boolean refine flag. See below. |
refine_options.tables | boolean | false | Fix table cell values, structure, and headers against the source image. |
refine_options.text | boolean | false | Fix OCR errors, missing or extra content, and numerical accuracy (tables untouched). |
refine_options.formatting | boolean | false | Add strikethrough, italic, bold, super/subscript, and LaTeX formatting (tables untouched). |
extract_figure | boolean | false | Convert charts and data visualizations into HTML <table> blocks, wrapped in <figure-table> tags. Useful for financial decks, dashboards, and scientific charts. |
figure_description | boolean | false | Generate a 1–2 paragraph natural-language description of each picture, wrapped in <figure-description> tags. Combines well with extract_figure. |
additional_prompt | string | "" | Extra context injected into the extraction prompt. Use to steer extraction toward a specific domain or attention focus. Max 4000 characters. |
custom_image_prompt | string | "" | Extra context appended to the prompt used by figure_description and extract_figure. Tunes image and chart interpretation. Max 2000 characters. |
custom_refine_prompt | string | "" | Extra context appended to the refinement prompt. Only applies when refine: true or refine_options is set. Max 2000 characters. |
extract_figure or figure_description is enabled, figures in response.markdown include additional tags:
refine (or refine_options) is set, markdown content is post-processed page-by-page; output is cleaner but typically grows ~1.5–3x in size for dense documents. No new tags are introduced.
extensions enable additional processing passes or alternate output formats. Each enabled extension produces a corresponding output field under response.extensions.*. For example, enabling extensions.chunking produces response.extensions.chunking, and enabling extensions.alt_outputs.return_html produces response.extensions.alt_outputs.html.
| Field | Type | Default | Description |
|---|---|---|---|
extensions.footnote_references | boolean | false | Link footnote markers to their corresponding footnote text. |
extensions.chunking | object | - | Chunking configuration. See below. |
extensions.chunking.chunk_types | string[] | - | List of chunking strategies: semantic, header, page, recursive. |
extensions.chunking.chunk_size | integer | - | Maximum characters per chunk. |
extensions.alt_outputs | object | - | Alternate output formats. See below. |
extensions.alt_outputs.wlbb | boolean | false | Enable word-level bounding boxes (PDF only). Results in response.extensions.alt_outputs.wlbb. |
extensions.alt_outputs.return_html | boolean | false | Include HTML representation. response.markdown is still present; HTML is at response.extensions.alt_outputs.html. |
extensions.alt_outputs.return_xml | boolean | false | Include XML representation (work in progress). |
pulse-ultra-2 Rate Limitsmodel: pulse-ultra-2 are subject to dedicated rate limits, separate from standard extraction:
| Limit | Value |
|---|---|
| Per minute | 5 extractions |
| Per hour | 20 extractions |
| File size | 50 MB |
| Concurrent | 2 per API key |
| Field | Type | Default | Description |
|---|---|---|---|
storage.enabled | boolean | true | Whether to persist extraction artifacts. Set to false for temporary extractions. |
storage.folder_name | string | - | Target folder name to save the extraction to. Creates the folder if it doesn’t exist. |
storage.folder_id | string (uuid) | - | Target folder ID to save the extraction to. Takes precedence over folder_name. |
| Field | Replacement |
|---|---|
show_images | Use figure_processing.show_images |
chunking | Use extensions.chunking.chunk_types (array instead of comma-separated string) |
chunk_size | Use extensions.chunking.chunk_size |
return_html | Use extensions.alt_outputs.return_html |
structured_output | Use /schema endpoint after extraction. Pass extraction_id + schema_config. Accepts schema, schema_prompt, and effort. |
schema | Use /schema endpoint after extraction |
schema_prompt | Use /schema endpoint with schema_config.schema_prompt |
custom_prompt | No replacement |
thinking | No replacement |
warnings array directing you to the updated field names. See the latest documentation for details.| Field | Type | Description |
|---|---|---|
markdown | string | Clean markdown content extracted from the document. Always present. |
page_count | integer | Total number of pages processed. |
extraction_id | string (uuid) | Persisted extraction ID. Present when storage is enabled (default). Use with /split and /schema. |
extraction_url | string | URL to view the extraction in the Pulse Platform. Present when storage is enabled. |
plan_info | object | Billing information including pages used and plan tier. |
bounding_boxes | object | Detailed bounding box data for document elements. See Bounding Boxes for details. |
extensions | object | Output from enabled extensions. Only keys for enabled extensions are present. See below. |
extensions.chunking | object | Chunk results by strategy (when extensions.chunking is enabled). |
extensions.footnoteReferences | array | List of detected footnotes with their in-text references (when extensions.footnote_references is enabled). See Footnote References below. |
extensions.altOutputs.wlbb | object | Word-level bounding boxes (when extensions.alt_outputs.wlbb is enabled). |
extensions.altOutputs.html | string | HTML representation (when extensions.alt_outputs.return_html is enabled). |
extensions.altOutputs.xml | string | XML representation (when extensions.alt_outputs.return_xml is enabled, WIP). |
warnings | array | Non-fatal warnings generated during extraction, including deprecation notices for legacy input usage. |
| Field | Replacement | Description |
|---|---|---|
html | extensions.altOutputs.html | Present when legacy return_html input is used. |
chunks | extensions.chunking | Present when legacy chunking input is used. |
plan-info | plan_info | Present when only legacy inputs are used. |
structured_output | Use /schema | Present when deprecated structured_output input was used. |
input_schema | Use /schema | Echo of the applied schema (deprecated path only). |
schema_error | Use /schema | Error message if schema processing failed (deprecated path only). |
/large_results/{job_id} instead of inlining the payload. This prevents timeout issues and keeps the immediate response small.
| Field | Type | Description |
|---|---|---|
is_url | boolean | Always true for large document responses. Use this to detect URL-based responses. |
url | string | One-time download link of the form https://api.runpulse.com/large_results/{job_id}. The link streams the complete extraction result the first time it is fetched and is then invalidated (subsequent reads return 410 Gone). It also expires 1 hour after the job completes. Authenticate the request with your x-api-key header. |
plan_info | object | Billing information including pages used and plan tier. |
/large_results/{job_id} is one-time use, persist the result to your own storage on first download. If you need to access the result later, enable storage.enabled and retrieve it from your extraction library on the Pulse Platform.extensions.footnote_references to detect footnote markers (e.g. *, †, 1) in body text and link them to the footnote explanation paragraphs at the bottom of the page. Each result item includes the marker symbol, the bounding-box text ID of the footnote, and the bounding-box text IDs of all body-text paragraphs that reference it.
| Field | Type | Description |
|---|---|---|
symbol | string | The footnote marker symbol as detected in the document (e.g. *, †, ‡, 1, #). |
footnoteTextId | string | The bounding-box text ID (e.g. txt-11) of the footnote explanation paragraph. Cross-reference with bounding_boxes.Footer to get the footnote’s content and position. |
referenceTextIds | string[] | Bounding-box text IDs of body-text paragraphs that contain a reference to this footnote. Cross-reference with bounding_boxes.Text to get each paragraph’s content and position. |
†/+ and ‡/#. Supported marker types include numbered (1, 2, 3), symbolic (*, †, ‡, §, #), and lettered (a, b, c) footnotes.API key for authentication
Input schema for multipart/form-data requests (file upload or file_url).
Document to upload directly. Required unless file_url is specified.
Public or pre-signed URL that Pulse will download and extract.
Extraction model to use. pulse-ultra-2 uses Pulse's vision-language model with built-in refinement, figure/chart extraction, and word-level bounding boxes. Omit or pass default for standard extraction.
default, pulse-ultra-2 Page range filter (1-indexed, where page 1 is the first page). Supports segments such as 1-2 or mixed ranges like 1-2,5.
^[0-9]+(-[0-9]+)?(,[0-9]+(-[0-9]+)?)*$Settings that control how figures in the document are processed. These affect the markdown output directly (e.g. figure descriptions, chart-to-table conversion, image embedding) and do not produce additional output fields in the response.
Settings that enable additional processing passes or alternate output formats. Each enabled extension produces a corresponding output field under response.extensions.*.
Options for persisting extraction artifacts. When enabled (default), artifacts are saved to storage and a database record is created.
If true, returns immediately with a job_id for polling via GET /job/{jobId}. Otherwise processes synchronously.
Run a full-page OCR and formatting correction pass after extraction. Improves accuracy on dense layouts, numerical values, and table structure. Adds ~1-2s per page. Overridden by refine_options if both are provided. Requires model: pulse-ultra-2.
Granular refinement targets. Takes precedence over the boolean refine flag. Requires model: pulse-ultra-2.
Convert charts and data visualizations into HTML <table> blocks, wrapped in <figure-table> tags inside response.markdown. Useful for financial decks, dashboards, and scientific charts. Requires model: pulse-ultra-2.
Generate a 1-2 paragraph natural-language description of each picture, wrapped in <figure-description> tags inside response.markdown. Combines well with extract_figure. Requires model: pulse-ultra-2.
Extra context injected into the extraction prompt. Use to steer extraction toward a specific domain or attention focus. Requires model: pulse-ultra-2.
4000Extra context appended to the prompt used by figure_description and extract_figure. Tunes image and chart interpretation for your domain. Requires model: pulse-ultra-2.
2000Extra context appended to the refinement prompt. Only applies when refine: true or refine_options is set. Requires model: pulse-ultra-2.
2000Deprecated -- Use extensions.chunking.chunk_types instead. Comma-separated list of chunking strategies.
Deprecated -- Use extensions.chunking.chunk_size instead.
x >= 1Deprecated -- Use figure_processing.show_images instead.
Deprecated -- Use extensions.alt_outputs.return_html instead.
When async=false (default): full extraction result with markdown,
bounding boxes, chunks, etc.
When async=true: job submission acknowledgement with job_id.
Full extraction result returned by the synchronous /extract endpoint. Contains the extracted markdown, optional extensions output, bounding boxes, and storage metadata.
Primary markdown content extracted from the document. Always present in the new format.
Output from enabled extensions. Each key corresponds to an extension that was enabled in the request under extensions.*. Only keys for enabled extensions are present.
Positional bounding-box data for text, titles, headers, footers, images, and tables.
Persisted extraction ID. Present when storage is enabled (default). Use with /split and /schema endpoints.
URL to view the extraction on the Pulse platform. Present when storage is enabled.
Number of pages processed.
x >= 1Billing tier and usage information.
Non-fatal warnings generated during extraction. Includes deprecation notices when legacy input parameters are used, as well as processing warnings (e.g. word-level bounding box limitations).
Deprecated -- Use extensions.alt_outputs.html instead. Present when the legacy return_html input was used.
Deprecated -- Use extensions.chunking instead. Present when the legacy chunking input was used.
Deprecated -- Use plan_info (underscore) instead. Present when only legacy input parameters are used.
Deprecated -- Only present when the deprecated structured_output input parameter was used. Use the /schema endpoint instead.
Deprecated -- Echo of the schema that was applied.
Deprecated -- Error message if schema processing failed.