The primary endpoint for the Pulse API. Parses uploaded documents or remote file URLs and returns rich markdown content with optional structured data extraction based on user-provided schemas and extraction options.
Set async: true to return immediately with a job ID for polling via
GET /job/{jobId}. Otherwise processes synchronously.
Both sync and async modes return HTTP 200. When async is true the response
body contains { job_id, status, message } instead of the full extraction result.
async: true to process asynchronously and poll for results via GET /job/jobId./extract on each file in parallel.async: true to return immediately with a job ID for polling:
GET /job/{job_id} to poll for completion.
| Field | Type | Description |
|---|---|---|
file | binary | Document file to upload directly (multipart/form-data). |
file_url | string | Public or pre-signed URL that Pulse will download and extract. |
| Field | Type | Default | Description |
|---|---|---|---|
pages | string | - | Page range filter (1-indexed). Supports segments like 1-2 or mixed ranges like 1-2,5. Page 1 is the first page. |
figure_processing | object | - | Settings that control how figures in the document are processed. These affect the markdown output directly and do not produce additional output fields. See Figure Processing. |
extensions | object | - | Settings that enable additional processing or alternate output formats. Each enabled extension produces a corresponding result under response.extensions.*. See Extensions. |
storage | object | - | Options for persisting extraction artifacts. See Storage Options. |
async | boolean | false | If true, returns immediately with a job_id for polling via GET /job/{jobId}. |
structured_output | object | - | ⚠️ Deprecated — Use the /schema endpoint after extraction instead. Still works for backward compatibility. |
figure_processing control how figures (images, charts, diagrams) in the document are processed. These settings affect the markdown output directly — for example, adding descriptive captions to figures or converting charts into markdown tables. They do not create additional output fields in the response.
| Field | Type | Default | Description |
|---|---|---|---|
figure_processing.description | boolean | false | Generate descriptive captions for extracted figures. |
figure_processing.show_images | boolean | false | Embed base64-encoded images inline in figure tags. Increases response size. |
extensions enable additional processing passes or alternate output formats. Each enabled extension produces a corresponding output field under response.extensions.*. For example, enabling extensions.chunking produces response.extensions.chunking, and enabling extensions.alt_outputs.return_html produces response.extensions.alt_outputs.html.
| Field | Type | Default | Description |
|---|---|---|---|
extensions.merge_tables | boolean | false | Merge tables that span multiple pages into a single table. |
extensions.footnote_references | boolean | false | Link footnote markers to their corresponding footnote text. |
extensions.chunking | object | - | Chunking configuration. See below. |
extensions.chunking.chunk_types | string[] | - | List of chunking strategies: semantic, header, page, recursive. |
extensions.chunking.chunk_size | integer | - | Maximum characters per chunk. |
extensions.alt_outputs | object | - | Alternate output formats. See below. |
extensions.alt_outputs.wlbb | boolean | false | Enable word-level bounding boxes (PDF only). Results in response.extensions.alt_outputs.wlbb. |
extensions.alt_outputs.return_html | boolean | false | Include HTML representation. response.markdown is still present; HTML is at response.extensions.alt_outputs.html. |
extensions.alt_outputs.return_xml | boolean | false | Include XML representation (work in progress). |
| Field | Type | Default | Description |
|---|---|---|---|
storage.enabled | boolean | true | Whether to persist extraction artifacts. Set to false for temporary extractions. |
storage.folder_name | string | - | Target folder name to save the extraction to. Creates the folder if it doesn’t exist. |
storage.folder_id | string (uuid) | - | Target folder ID to save the extraction to. Takes precedence over folder_name. |
| Field | Replacement |
|---|---|
extract_figure | No replacement |
figure_description | Use figure_processing.description |
show_images | Use figure_processing.show_images |
chunking | Use extensions.chunking.chunk_types (array instead of comma-separated string) |
chunk_size | Use extensions.chunking.chunk_size |
return_html | Use extensions.alt_outputs.return_html |
structured_output | Use /schema endpoint after extraction. Pass extraction_id + schema_config. Accepts schema, schema_prompt, and effort. |
schema | Use /schema endpoint after extraction |
schema_prompt | Use /schema endpoint with schema_config.schema_prompt |
custom_prompt | No replacement |
thinking | No replacement |
warnings array directing you to the updated field names. See the latest documentation for details.| Field | Type | Description |
|---|---|---|
markdown | string | Clean markdown content extracted from the document. Always present. |
page_count | integer | Total number of pages processed. |
extraction_id | string (uuid) | Persisted extraction ID. Present when storage is enabled (default). Use with /split and /schema. |
extraction_url | string | URL to view the extraction in the Pulse Platform. Present when storage is enabled. |
plan_info | object | Billing information including pages used and plan tier. |
bounding_boxes | object | Detailed bounding box data for document elements. See Bounding Boxes for details. |
extensions | object | Output from enabled extensions. Only keys for enabled extensions are present. See below. |
extensions.chunking | object | Chunk results by strategy (when extensions.chunking is enabled). |
extensions.merge_tables | object | Merge tables result (when extensions.merge_tables is enabled). |
extensions.footnote_references | array | List of detected footnotes with their in-text references (when extensions.footnote_references is enabled). See Footnote References below. |
extensions.alt_outputs.wlbb | object | Word-level bounding boxes (when extensions.alt_outputs.wlbb is enabled). |
extensions.alt_outputs.html | string | HTML representation (when extensions.alt_outputs.return_html is enabled). |
extensions.alt_outputs.xml | string | XML representation (when extensions.alt_outputs.return_xml is enabled, WIP). |
warnings | array | Non-fatal warnings generated during extraction, including deprecation notices for legacy input usage. |
| Field | Replacement | Description |
|---|---|---|
html | extensions.alt_outputs.html | Present when legacy return_html input is used. |
chunks | extensions.chunking | Present when legacy chunking input is used. |
plan-info | plan_info | Present when only legacy inputs are used. |
structured_output | Use /schema | Present when deprecated structured_output input was used. |
input_schema | Use /schema | Echo of the applied schema (deprecated path only). |
schema_error | Use /schema | Error message if schema processing failed (deprecated path only). |
| Field | Type | Description |
|---|---|---|
is_url | boolean | Always true for large document responses. Use this to detect URL-based responses. |
url | string | Pre-signed S3 URL containing the complete extraction results. Expires after 24 hours. |
plan_info | object | Billing information including pages used and plan tier. |
storage.enabled is true and retrieve results from your extraction library.extensions.footnote_references to detect footnote markers (e.g. *, †, 1) in body text and link them to the footnote explanation paragraphs at the bottom of the page. Each result item includes the marker symbol, the bounding-box text ID of the footnote, and the bounding-box text IDs of all body-text paragraphs that reference it.
| Field | Type | Description |
|---|---|---|
symbol | string | The footnote marker symbol as detected in the document (e.g. *, †, ‡, 1, #). |
footnoteTextId | string | The bounding-box text ID (e.g. txt-11) of the footnote explanation paragraph. Cross-reference with bounding_boxes.Footer to get the footnote’s content and position. |
referenceTextIds | string[] | Bounding-box text IDs of body-text paragraphs that contain a reference to this footnote. Cross-reference with bounding_boxes.Text to get each paragraph’s content and position. |
†/+ and ‡/#. Supported marker types include numbered (1, 2, 3), symbolic (*, †, ‡, §, #), and lettered (a, b, c) footnotes.API key for authentication
Input schema for multipart/form-data requests (file upload or file_url).
Document to upload directly. Required unless file_url is specified.
Public or pre-signed URL that Pulse will download and extract.
Page range filter (1-indexed, where page 1 is the first page). Supports segments such as 1-2 or mixed ranges like 1-2,5.
^[0-9]+(-[0-9]+)?(,[0-9]+(-[0-9]+)?)*$Settings that control how figures in the document are processed. These affect the markdown output directly (e.g. figure descriptions, chart-to-table conversion, image embedding) and do not produce additional output fields in the response.
Settings that enable additional processing passes or alternate output formats. Each enabled extension produces a corresponding output field under response.extensions.*.
Options for persisting extraction artifacts. When enabled (default), artifacts are saved to storage and a database record is created.
If true, returns immediately with a job_id for polling via GET /job/{jobId}. Otherwise processes synchronously.
Deprecated -- Use extensions.chunking.chunk_types instead. Comma-separated list of chunking strategies.
Deprecated -- Use extensions.chunking.chunk_size instead.
x >= 1Deprecated -- No replacement.
Deprecated -- Use figure_processing.description instead.
Deprecated -- Use figure_processing.show_images instead.
Deprecated -- Use extensions.alt_outputs.return_html instead.
When async=false (default): full extraction result with markdown,
bounding boxes, chunks, etc.
When async=true: job submission acknowledgement with job_id.
Full extraction result returned by the synchronous /extract endpoint. Contains the extracted markdown, optional extensions output, bounding boxes, and storage metadata.
Primary markdown content extracted from the document. Always present in the new format.
Output from enabled extensions. Each key corresponds to an extension that was enabled in the request under extensions.*. Only keys for enabled extensions are present.
Positional bounding-box data for text, titles, headers, footers, images, and tables.
Persisted extraction ID. Present when storage is enabled (default). Use with /split and /schema endpoints.
URL to view the extraction on the Pulse platform. Present when storage is enabled.
Number of pages processed.
x >= 1Billing tier and usage information.
Non-fatal warnings generated during extraction. Includes deprecation notices when legacy input parameters are used, as well as processing warnings (e.g. word-level bounding box limitations).
Deprecated -- Use extensions.alt_outputs.html instead. Present when the legacy return_html input was used.
Deprecated -- Use extensions.chunking instead. Present when the legacy chunking input was used.
Deprecated -- Use plan_info (underscore) instead. Present when only legacy input parameters are used.
Deprecated -- Only present when the deprecated structured_output input parameter was used. Use the /schema endpoint instead.
Deprecated -- Echo of the schema that was applied.
Deprecated -- Error message if schema processing failed.