Extract File Async (Deprecated)
Deprecated: Use /extract with async: true instead.
Starts an asynchronous extraction job. The request mirrors the synchronous options but returns immediately with a job identifier that clients can poll for completion status.
Overview
The asynchronous extraction endpoint accepts the same input parameters as the synchronous/extract endpoint but returns immediately with a job identifier. Use this endpoint for:
- Large documents that may take longer to process
- Batch processing workflows
- Non-blocking integrations
Migration
Replace calls to/extract_async with /extract and add async: true:
Request
Document Source
Provide the document using one of these methods:| Field | Type | Description |
|---|---|---|
file | binary | Document file to upload directly (multipart/form-data). |
file_url | string | Public or pre-signed URL that Pulse will download and extract. |
Extraction Options
| Field | Type | Default | Description |
|---|---|---|---|
model | string (enum) | default | Extraction model to use. One of default or pulse-ultra-2. pulse-ultra-2 uses Pulse’s vision-language model with built-in refinement, figure/chart extraction, and word-level bounding boxes. |
pages | string | - | Page range filter (1-indexed). Supports segments like 1-2 or mixed ranges like 1-2,5. Page 1 is the first page. |
figure_processing | object | - | Settings that control how figures in the document are processed. These affect the markdown output directly and do not produce additional output fields. See Figure Processing. |
extensions | object | - | Settings that enable additional processing or alternate output formats. Each enabled extension produces a corresponding result under response.extensions.*. See Extensions. |
spreadsheet | object | - | Settings for Excel/spreadsheet extraction. Controls handling of hidden rows, columns, and sheets. Applies to .xlsx, .xlsm, and .xls files. See Spreadsheet Options. |
storage | object | - | Options for persisting extraction artifacts. See Storage Options. |
async | boolean | false | If true, returns immediately with a job_id for polling via GET /job/{jobId}. |
structured_output | object | - | ⚠️ Deprecated — Use the /schema endpoint after extraction instead. Still works for backward compatibility. |
Figure Processing
Settings underfigure_processing control how figures (images, charts, diagrams) and embedded visuals are processed. Applies to both PDFs/images (figures detected from layout) and spreadsheets (charts and embedded images read directly from the workbook). Affects the markdown output and the bounding_boxes.Images[] array.
| Field | Type | Default | Description |
|---|---|---|---|
figure_processing.description | boolean | false | Generate descriptive captions for extracted visuals. Captions appear under bounding_boxes.Images[].description and inline in the markdown output. Applies to both detected charts and non-chart images. |
figure_processing.show_images | boolean | false | Return image URLs for extracted visuals. URLs appear under bounding_boxes.Images[].image_url and resolve to a Pulse-hosted PNG/JPEG served from GET /results/{jobId}/images/{filename}. Applies to both detected charts and non-chart images. |
show_images: true collects every embedded chart and image in the workbook and emits one entry per visual under bounding_boxes.Images, with chart-specific fields like chart_type, chart_title, and source_ranges populated. See Bounding Boxes for the full field list.Spreadsheet Options
Settings underspreadsheet control how Excel workbooks (.xlsx, .xlsm, .xls) are processed. By default, hidden rows, columns, and sheets are excluded from extraction output, and cell values are rendered the way Excel displays them.
| Field | Type | Default | Description |
|---|---|---|---|
spreadsheet.include_hidden_rows | boolean | false | Include rows that are hidden in the Excel workbook. |
spreadsheet.include_hidden_cols | boolean | false | Include columns that are hidden in the Excel workbook. |
spreadsheet.include_hidden_sheets | boolean | false | Include sheets that are hidden in the Excel workbook. |
spreadsheet.use_raw_values | boolean | false | Emit the underlying numeric value for number cells instead of the Excel display-formatted text — e.g. 1201.67 rather than $1,202 when the cell uses a rounded currency format. Useful when downstream processing needs exact amounts (cent-level precision) rather than what the workbook shows visually. Percent-formatted cells and dates keep their display rendering. Does not apply to legacy .xls files. |
includeHiddenRows) and snake_case (include_hidden_rows) formats.Pulse Ultra 2 Options
These options are available only whenmodel: pulse-ultra-2 is set. Passing any of them with the default model returns a 400 error listing the offending fields.
| Field | Type | Default | Description |
|---|---|---|---|
refine | boolean | false | Run a full-page OCR and formatting correction pass after extraction. Improves accuracy on dense layouts, numerical values, and table structure. Adds ~1–2s per page. Overridden by refine_options if both are provided. |
refine_options | object | - | Granular refinement targets. Takes precedence over the boolean refine flag. See below. |
refine_options.tables | boolean | false | Fix table cell values, structure, and headers against the source image. |
refine_options.text | boolean | false | Fix OCR errors, missing or extra content, and numerical accuracy (tables untouched). |
refine_options.formatting | boolean | false | Add strikethrough, italic, bold, super/subscript, and LaTeX formatting (tables untouched). |
extract_figure | boolean | false | Convert charts and data visualizations into HTML <table> blocks, wrapped in <figure-table> tags. Useful for financial decks, dashboards, and scientific charts. |
figure_description | boolean | false | Generate a 1–2 paragraph natural-language description of each picture, wrapped in <figure-description> tags. Combines well with extract_figure. |
additional_prompt | string | "" | Extra context injected into the extraction prompt. Use to steer extraction toward a specific domain or attention focus. Max 4000 characters. |
custom_image_prompt | string | "" | Extra context appended to the prompt used by figure_description and extract_figure. Tunes image and chart interpretation. Max 2000 characters. |
custom_refine_prompt | string | "" | Extra context appended to the refinement prompt. Only applies when refine: true or refine_options is set. Max 2000 characters. |
Markdown output additions
Whenextract_figure or figure_description is enabled, figures in response.markdown include additional tags:
refine (or refine_options) is set, markdown content is post-processed page-by-page; output is cleaner but typically grows ~1.5–3x in size for dense documents. No new tags are introduced.
Extensions
Settings underextensions enable additional processing passes or alternate output formats. Each enabled extension produces a corresponding output field under response.extensions.*. For example, enabling extensions.chunking produces response.extensions.chunking, and enabling extensions.alt_outputs.return_html produces response.extensions.alt_outputs.html.
| Field | Type | Default | Description |
|---|---|---|---|
extensions.footnote_references | boolean | false | Link footnote markers to their corresponding footnote text. |
extensions.chunking | object | - | Chunking configuration. See below. |
extensions.chunking.chunk_types | string[] | - | List of chunking strategies: semantic, header, page, recursive. |
extensions.chunking.chunk_size | integer | - | Maximum characters per chunk. |
extensions.alt_outputs | object | - | Alternate output formats. See below. |
extensions.alt_outputs.wlbb | boolean | false | Enable word-level bounding boxes (PDF only). Results in response.extensions.alt_outputs.wlbb. |
extensions.alt_outputs.return_html | boolean | false | Include HTML representation. response.markdown is still present; HTML is at response.extensions.alt_outputs.html. |
extensions.alt_outputs.return_xml | boolean | false | Include XML representation (work in progress). |
Storage Options
Control whether extractions are saved to your extraction library:| Field | Type | Default | Description |
|---|---|---|---|
storage.enabled | boolean | true | Whether to persist extraction artifacts. Set to false for temporary extractions. |
storage.folder_name | string | - | Target folder name to save the extraction to. Creates the folder if it doesn’t exist. |
storage.folder_id | string (uuid) | - | Target folder ID to save the extraction to. Takes precedence over folder_name. |
Deprecated Fields
The following input fields are deprecated and will be removed in a future version. They are still accepted for backward compatibility.| Field | Replacement |
|---|---|
show_images | Use figure_processing.show_images |
chunking | Use extensions.chunking.chunk_types (array instead of comma-separated string) |
chunk_size | Use extensions.chunking.chunk_size |
return_html | Use extensions.alt_outputs.return_html |
structured_output | Use /schema endpoint after extraction. Pass extraction_id + schema_config. Accepts schema, schema_prompt, and effort. |
schema | Use /schema endpoint after extraction |
schema_prompt | Use /schema endpoint with schema_config.schema_prompt |
custom_prompt | No replacement |
thinking | No replacement |
warnings array directing you to the updated field names. See the latest documentation for details.Response
When you submit a document for async extraction, you’ll receive a response containing the job metadata:Response Fields
| Field | Type | Description |
|---|---|---|
job_id | string | Unique identifier for the extraction job. Use this to poll for results with the Poll Job endpoint. |
status | string | Initial job status. Typically pending when first submitted. |
queuedAt | string | ISO 8601 timestamp indicating when the job was accepted. |
Retrieving Results
After submitting an async extraction, poll the job status endpoint to retrieve results:Example Usage
Submit Async Extraction
With Structured Output
Cancel a Job
Authorizations
API key for authentication
Body
Input schema for multipart/form-data requests (file upload or file_url).
Document to upload directly. Required unless file_url is specified.
Public or pre-signed URL that Pulse will download and extract.
Extraction model to use. pulse-ultra-2 uses Pulse's vision-language model with built-in refinement, figure/chart extraction, and word-level bounding boxes. Omit or pass default for standard extraction.
default, pulse-ultra-2 Page range filter (1-indexed, where page 1 is the first page). Supports segments such as 1-2 or mixed ranges like 1-2,5.
^[0-9]+(-[0-9]+)?(,[0-9]+(-[0-9]+)?)*$Settings that control how figures and embedded visuals are processed. Applies to both PDFs/images (figures detected from layout) and spreadsheets (charts and embedded images read directly from the workbook). Affects the markdown output and the bounding_boxes.Images[] array.
Settings for Excel/spreadsheet extraction. Controls handling of hidden rows, columns, and sheets, and whether numeric cells are rendered using their display format or underlying raw value. Applies to .xlsx, .xlsm, and .xls files. Accepts both camelCase and snake_case field names.
Settings that enable additional processing passes or alternate output formats. Each enabled extension produces a corresponding output field under response.extensions.*.
Options for persisting extraction artifacts. When enabled (default), artifacts are saved to storage and a database record is created.
If true, returns immediately with a job_id for polling via GET /job/{jobId}. Otherwise processes synchronously.
Run a full-page OCR and formatting correction pass after extraction. Improves accuracy on dense layouts, numerical values, and table structure. Adds ~1-2s per page. Overridden by refine_options if both are provided. Requires model: pulse-ultra-2.
Granular refinement targets. Takes precedence over the boolean refine flag. Requires model: pulse-ultra-2.
Convert charts and data visualizations into HTML <table> blocks, wrapped in <figure-table> tags inside response.markdown. Useful for financial decks, dashboards, and scientific charts. Requires model: pulse-ultra-2.
Generate a 1-2 paragraph natural-language description of each picture, wrapped in <figure-description> tags inside response.markdown. Combines well with extract_figure. Requires model: pulse-ultra-2.
Extra context injected into the extraction prompt. Use to steer extraction toward a specific domain or attention focus. Requires model: pulse-ultra-2.
4000Extra context appended to the prompt used by figure_description and extract_figure. Tunes image and chart interpretation for your domain. Requires model: pulse-ultra-2.
2000Extra context appended to the refinement prompt. Only applies when refine: true or refine_options is set. Requires model: pulse-ultra-2.
2000Deprecated -- Use extensions.chunking.chunk_types instead. Comma-separated list of chunking strategies.
Deprecated -- Use extensions.chunking.chunk_size instead.
x >= 1Deprecated -- Use figure_processing.show_images instead.
Deprecated -- Use extensions.alt_outputs.return_html instead.
Response
Asynchronous extraction job accepted
Acknowledgement returned when a request is submitted for asynchronous processing. Poll GET /job/{job_id} to check status and retrieve results.
