Skip to main content
POST
/
extract_async
Python SDK
# DEPRECATED — use client.extract(..., async_=True) instead
from pulse import Pulse

client = Pulse(api_key="YOUR_API_KEY")
response = client.extract(
    file_url="https://example.com/document.pdf",
    async_=True
)
print(response.job_id)
{
  "job_id": "<string>",
  "message": "<string>"
}
Deprecated: This endpoint is deprecated. Use /extract with async: true instead.

Overview

The asynchronous extraction endpoint accepts the same input parameters as the synchronous /extract endpoint but returns immediately with a job identifier. Use this endpoint for:
  • Large documents that may take longer to process
  • Batch processing workflows
  • Non-blocking integrations

Migration

Replace calls to /extract_async with /extract and add async: true:
- POST /extract_async
- {"file_url": "https://example.com/doc.pdf"}

+ POST /extract
+ {"file_url": "https://example.com/doc.pdf", "async": true}
The response format is identical.

Request

Document Source

Provide the document using one of these methods:
FieldTypeDescription
filebinaryDocument file to upload directly (multipart/form-data).
file_urlstringPublic or pre-signed URL that Pulse will download and extract.

Extraction Options

FieldTypeDefaultDescription
modelstring (enum)defaultExtraction model to use. One of default or pulse-ultra-2. pulse-ultra-2 uses Pulse’s vision-language model with built-in refinement, figure/chart extraction, and word-level bounding boxes.
pagesstring-Page range filter (1-indexed). Supports segments like 1-2 or mixed ranges like 1-2,5. Page 1 is the first page.
figure_processingobject-Settings that control how figures in the document are processed. These affect the markdown output directly and do not produce additional output fields. See Figure Processing.
extensionsobject-Settings that enable additional processing or alternate output formats. Each enabled extension produces a corresponding result under response.extensions.*. See Extensions.
spreadsheetobject-Settings for Excel/spreadsheet extraction. Controls handling of hidden rows, columns, and sheets. Applies to .xlsx, .xlsm, and .xls files. See Spreadsheet Options.
storageobject-Options for persisting extraction artifacts. See Storage Options.
asyncbooleanfalseIf true, returns immediately with a job_id for polling via GET /job/{jobId}.
structured_outputobject-⚠️ Deprecated — Use the /schema endpoint after extraction instead. Still works for backward compatibility.

Figure Processing

Settings under figure_processing control how figures (images, charts, diagrams) and embedded visuals are processed. Applies to both PDFs/images (figures detected from layout) and spreadsheets (charts and embedded images read directly from the workbook). Affects the markdown output and the bounding_boxes.Images[] array.
FieldTypeDefaultDescription
figure_processing.descriptionbooleanfalseGenerate descriptive captions for extracted visuals. Captions appear under bounding_boxes.Images[].description and inline in the markdown output. Applies to both detected charts and non-chart images.
figure_processing.show_imagesbooleanfalseReturn image URLs for extracted visuals. URLs appear under bounding_boxes.Images[].image_url and resolve to a Pulse-hosted PNG/JPEG served from GET /results/{jobId}/images/{filename}. Applies to both detected charts and non-chart images.
For spreadsheets specifically, show_images: true collects every embedded chart and image in the workbook and emits one entry per visual under bounding_boxes.Images, with chart-specific fields like chart_type, chart_title, and source_ranges populated. See Bounding Boxes for the full field list.

Spreadsheet Options

Settings under spreadsheet control how Excel workbooks (.xlsx, .xlsm, .xls) are processed. By default, hidden rows, columns, and sheets are excluded from extraction output, and cell values are rendered the way Excel displays them.
FieldTypeDefaultDescription
spreadsheet.include_hidden_rowsbooleanfalseInclude rows that are hidden in the Excel workbook.
spreadsheet.include_hidden_colsbooleanfalseInclude columns that are hidden in the Excel workbook.
spreadsheet.include_hidden_sheetsbooleanfalseInclude sheets that are hidden in the Excel workbook.
spreadsheet.use_raw_valuesbooleanfalseEmit the underlying numeric value for number cells instead of the Excel display-formatted text — e.g. 1201.67 rather than $1,202 when the cell uses a rounded currency format. Useful when downstream processing needs exact amounts (cent-level precision) rather than what the workbook shows visually. Percent-formatted cells and dates keep their display rendering. Does not apply to legacy .xls files.
These settings accept both camelCase (includeHiddenRows) and snake_case (include_hidden_rows) formats.

Pulse Ultra 2 Options

These options are available only when model: pulse-ultra-2 is set. Passing any of them with the default model returns a 400 error listing the offending fields.
FieldTypeDefaultDescription
refinebooleanfalseRun a full-page OCR and formatting correction pass after extraction. Improves accuracy on dense layouts, numerical values, and table structure. Adds ~1–2s per page. Overridden by refine_options if both are provided.
refine_optionsobject-Granular refinement targets. Takes precedence over the boolean refine flag. See below.
refine_options.tablesbooleanfalseFix table cell values, structure, and headers against the source image.
refine_options.textbooleanfalseFix OCR errors, missing or extra content, and numerical accuracy (tables untouched).
refine_options.formattingbooleanfalseAdd strikethrough, italic, bold, super/subscript, and LaTeX formatting (tables untouched).
extract_figurebooleanfalseConvert charts and data visualizations into HTML <table> blocks, wrapped in <figure-table> tags. Useful for financial decks, dashboards, and scientific charts.
figure_descriptionbooleanfalseGenerate a 1–2 paragraph natural-language description of each picture, wrapped in <figure-description> tags. Combines well with extract_figure.
additional_promptstring""Extra context injected into the extraction prompt. Use to steer extraction toward a specific domain or attention focus. Max 4000 characters.
custom_image_promptstring""Extra context appended to the prompt used by figure_description and extract_figure. Tunes image and chart interpretation. Max 2000 characters.
custom_refine_promptstring""Extra context appended to the refinement prompt. Only applies when refine: true or refine_options is set. Max 2000 characters.

Markdown output additions

When extract_figure or figure_description is enabled, figures in response.markdown include additional tags:
<figure data-page="1">
  <figure-table>...HTML table for the chart...</figure-table>
  <figure-description>...1–2 paragraph description...</figure-description>
</figure>
When refine (or refine_options) is set, markdown content is post-processed page-by-page; output is cleaner but typically grows ~1.5–3x in size for dense documents. No new tags are introduced.

Extensions

Settings under extensions enable additional processing passes or alternate output formats. Each enabled extension produces a corresponding output field under response.extensions.*. For example, enabling extensions.chunking produces response.extensions.chunking, and enabling extensions.alt_outputs.return_html produces response.extensions.alt_outputs.html.
FieldTypeDefaultDescription
extensions.footnote_referencesbooleanfalseLink footnote markers to their corresponding footnote text.
extensions.chunkingobject-Chunking configuration. See below.
extensions.chunking.chunk_typesstring[]-List of chunking strategies: semantic, header, page, recursive.
extensions.chunking.chunk_sizeinteger-Maximum characters per chunk.
extensions.alt_outputsobject-Alternate output formats. See below.
extensions.alt_outputs.wlbbbooleanfalseEnable word-level bounding boxes (PDF only). Results in response.extensions.alt_outputs.wlbb.
extensions.alt_outputs.return_htmlbooleanfalseInclude HTML representation. response.markdown is still present; HTML is at response.extensions.alt_outputs.html.
extensions.alt_outputs.return_xmlbooleanfalseInclude XML representation (work in progress).

Storage Options

Control whether extractions are saved to your extraction library:
FieldTypeDefaultDescription
storage.enabledbooleantrueWhether to persist extraction artifacts. Set to false for temporary extractions.
storage.folder_namestring-Target folder name to save the extraction to. Creates the folder if it doesn’t exist.
storage.folder_idstring (uuid)-Target folder ID to save the extraction to. Takes precedence over folder_name.

Deprecated Fields

The following input fields are deprecated and will be removed in a future version. They are still accepted for backward compatibility.
FieldReplacement
show_imagesUse figure_processing.show_images
chunkingUse extensions.chunking.chunk_types (array instead of comma-separated string)
chunk_sizeUse extensions.chunking.chunk_size
return_htmlUse extensions.alt_outputs.return_html
structured_outputUse /schema endpoint after extraction. Pass extraction_id + schema_config. Accepts schema, schema_prompt, and effort.
schemaUse /schema endpoint after extraction
schema_promptUse /schema endpoint with schema_config.schema_prompt
custom_promptNo replacement
thinkingNo replacement
When legacy input fields are used, the API returns a deprecation warning in the warnings array directing you to the updated field names. See the latest documentation for details.

Response

When you submit a document for async extraction, you’ll receive a response containing the job metadata:
{
  "job_id": "abc123-def456-ghi789",
  "status": "pending",
  "queuedAt": "2025-01-15T10:30:00Z"
}

Response Fields

FieldTypeDescription
job_idstringUnique identifier for the extraction job. Use this to poll for results with the Poll Job endpoint.
statusstringInitial job status. Typically pending when first submitted.
queuedAtstringISO 8601 timestamp indicating when the job was accepted.

Retrieving Results

After submitting an async extraction, poll the job status endpoint to retrieve results:
GET /job/{job_id}
The job status endpoint will return the extraction results once the job is completed. See the Poll Job documentation for details on the response structure.
For detailed information on the extraction output format (markdown, bounding boxes, chunks, etc.), see the Extract documentation.

Example Usage

Submit Async Extraction

import time
from pulse import Pulse

client = Pulse(api_key="YOUR_API_KEY")

# Submit async extraction
submission = client.extract_async(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf"
)

print(f"Job ID: {submission.job_id}")
print(f"Status: {submission.status}")

# Poll for completion
job_id = submission.job_id
while True:
    job_status = client.jobs.get_job(job_id=job_id)
    print(f"Status: {job_status.status}")
    
    if job_status.status == "completed":
        print("Extraction complete!")
        print(f"Result: {job_status.result}")
        break
    elif job_status.status in ["failed", "canceled"]:
        print(f"Job ended: {job_status.status}")
        if job_status.error:
            print(f"Error: {job_status.error}")
        break
    
    time.sleep(2)

With Structured Output

schema = {
    "type": "object",
    "properties": {
        "total": {"type": "number"},
        "vendor": {"type": "string"}
    }
}

submission = client.extract_async(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
    structured_output={
        "schema": schema,
        "schema_prompt": "Extract the invoice total"
    }
)

Cancel a Job

# Cancel a running job
cancellation = client.jobs.cancel_job(job_id=job_id)
print(f"Cancelled: {cancellation.message}")

# Verify cancellation
status = client.jobs.get_job(job_id=job_id)
print(f"Status: {status.status}")  # Should be "canceled"

Authorizations

x-api-key
string
header
required

API key for authentication

Body

Input schema for multipart/form-data requests (file upload or file_url).

file
file
required

Document to upload directly. Required unless file_url is specified.

file_url
string<uri>

Public or pre-signed URL that Pulse will download and extract.

model
enum<string>

Extraction model to use. pulse-ultra-2 uses Pulse's vision-language model with built-in refinement, figure/chart extraction, and word-level bounding boxes. Omit or pass default for standard extraction.

Available options:
default,
pulse-ultra-2
pages
string

Page range filter (1-indexed, where page 1 is the first page). Supports segments such as 1-2 or mixed ranges like 1-2,5.

Pattern: ^[0-9]+(-[0-9]+)?(,[0-9]+(-[0-9]+)?)*$
figure_processing
object

Settings that control how figures and embedded visuals are processed. Applies to both PDFs/images (figures detected from layout) and spreadsheets (charts and embedded images read directly from the workbook). Affects the markdown output and the bounding_boxes.Images[] array.

spreadsheet
object

Settings for Excel/spreadsheet extraction. Controls handling of hidden rows, columns, and sheets, and whether numeric cells are rendered using their display format or underlying raw value. Applies to .xlsx, .xlsm, and .xls files. Accepts both camelCase and snake_case field names.

extensions
object

Settings that enable additional processing passes or alternate output formats. Each enabled extension produces a corresponding output field under response.extensions.*.

storage
object

Options for persisting extraction artifacts. When enabled (default), artifacts are saved to storage and a database record is created.

async
boolean
default:false

If true, returns immediately with a job_id for polling via GET /job/{jobId}. Otherwise processes synchronously.

refine
boolean
default:false

Run a full-page OCR and formatting correction pass after extraction. Improves accuracy on dense layouts, numerical values, and table structure. Adds ~1-2s per page. Overridden by refine_options if both are provided. Requires model: pulse-ultra-2.

refine_options
object

Granular refinement targets. Takes precedence over the boolean refine flag. Requires model: pulse-ultra-2.

extract_figure
boolean
default:false

Convert charts and data visualizations into HTML <table> blocks, wrapped in <figure-table> tags inside response.markdown. Useful for financial decks, dashboards, and scientific charts. Requires model: pulse-ultra-2.

figure_description
boolean
default:false

Generate a 1-2 paragraph natural-language description of each picture, wrapped in <figure-description> tags inside response.markdown. Combines well with extract_figure. Requires model: pulse-ultra-2.

additional_prompt
string
default:""

Extra context injected into the extraction prompt. Use to steer extraction toward a specific domain or attention focus. Requires model: pulse-ultra-2.

Maximum string length: 4000
custom_image_prompt
string
default:""

Extra context appended to the prompt used by figure_description and extract_figure. Tunes image and chart interpretation for your domain. Requires model: pulse-ultra-2.

Maximum string length: 2000
custom_refine_prompt
string
default:""

Extra context appended to the refinement prompt. Only applies when refine: true or refine_options is set. Requires model: pulse-ultra-2.

Maximum string length: 2000
chunking
string
deprecated

Deprecated -- Use extensions.chunking.chunk_types instead. Comma-separated list of chunking strategies.

chunk_size
integer
deprecated

Deprecated -- Use extensions.chunking.chunk_size instead.

Required range: x >= 1
show_images
boolean
default:false
deprecated

Deprecated -- Use figure_processing.show_images instead.

return_html
boolean
default:false
deprecated

Deprecated -- Use extensions.alt_outputs.return_html instead.

Response

Asynchronous extraction job accepted

Acknowledgement returned when a request is submitted for asynchronous processing. Poll GET /job/{job_id} to check status and retrieve results.

job_id
string
required

Identifier assigned to the asynchronous job.

status
enum<string>
required

Initial status reported by the server.

Available options:
pending,
processing
message
string

Human-readable description of the accepted job.