Skip to main content
POST
/
extract_async
Python SDK
# DEPRECATED — use client.extract(..., async_=True) instead
from pulse import Pulse

client = Pulse(api_key="YOUR_API_KEY")
response = client.extract(
    file_url="https://example.com/document.pdf",
    async_=True
)
print(response.job_id)
{
  "job_id": "<string>",
  "status": "pending",
  "message": "<string>"
}
Deprecated: This endpoint is deprecated. Use /extract with async: true instead.

Overview

The asynchronous extraction endpoint accepts the same input parameters as the synchronous /extract endpoint but returns immediately with a job identifier. Use this endpoint for:
  • Large documents that may take longer to process
  • Batch processing workflows
  • Non-blocking integrations

Migration

Replace calls to /extract_async with /extract and add async: true:
- POST /extract_async
- {"file_url": "https://example.com/doc.pdf"}

+ POST /extract
+ {"file_url": "https://example.com/doc.pdf", "async": true}
The response format is identical.

Request

Document Source

Provide the document using one of these methods:
FieldTypeDescription
filebinaryDocument file to upload directly (multipart/form-data).
file_urlstringPublic or pre-signed URL that Pulse will download and extract.

Extraction Options

FieldTypeDefaultDescription
pagesstring-Page range filter (1-indexed). Supports segments like 1-2 or mixed ranges like 1-2,5. Page 1 is the first page.
figure_processingobject-Settings that control how figures in the document are processed. These affect the markdown output directly and do not produce additional output fields. See Figure Processing.
extensionsobject-Settings that enable additional processing or alternate output formats. Each enabled extension produces a corresponding result under response.extensions.*. See Extensions.
storageobject-Options for persisting extraction artifacts. See Storage Options.
asyncbooleanfalseIf true, returns immediately with a job_id for polling via GET /job/{jobId}.
structured_outputobject-⚠️ Deprecated — Use the /schema endpoint after extraction instead. Still works for backward compatibility.

Figure Processing

Settings under figure_processing control how figures (images, charts, diagrams) in the document are processed. These settings affect the markdown output directly — for example, adding descriptive captions to figures or converting charts into markdown tables. They do not create additional output fields in the response.
FieldTypeDefaultDescription
figure_processing.descriptionbooleanfalseGenerate descriptive captions for extracted figures.
figure_processing.show_imagesbooleanfalseEmbed base64-encoded images inline in figure tags. Increases response size.

Extensions

Settings under extensions enable additional processing passes or alternate output formats. Each enabled extension produces a corresponding output field under response.extensions.*. For example, enabling extensions.chunking produces response.extensions.chunking, and enabling extensions.alt_outputs.return_html produces response.extensions.alt_outputs.html.
FieldTypeDefaultDescription
extensions.merge_tablesbooleanfalseMerge tables that span multiple pages into a single table.
extensions.footnote_referencesbooleanfalseLink footnote markers to their corresponding footnote text.
extensions.chunkingobject-Chunking configuration. See below.
extensions.chunking.chunk_typesstring[]-List of chunking strategies: semantic, header, page, recursive.
extensions.chunking.chunk_sizeinteger-Maximum characters per chunk.
extensions.alt_outputsobject-Alternate output formats. See below.
extensions.alt_outputs.wlbbbooleanfalseEnable word-level bounding boxes (PDF only). Results in response.extensions.alt_outputs.wlbb.
extensions.alt_outputs.return_htmlbooleanfalseInclude HTML representation. response.markdown is still present; HTML is at response.extensions.alt_outputs.html.
extensions.alt_outputs.return_xmlbooleanfalseInclude XML representation (work in progress).

Storage Options

Control whether extractions are saved to your extraction library:
FieldTypeDefaultDescription
storage.enabledbooleantrueWhether to persist extraction artifacts. Set to false for temporary extractions.
storage.folder_namestring-Target folder name to save the extraction to. Creates the folder if it doesn’t exist.
storage.folder_idstring (uuid)-Target folder ID to save the extraction to. Takes precedence over folder_name.

Deprecated Fields

The following input fields are deprecated and will be removed in a future version. They are still accepted for backward compatibility.
FieldReplacement
extract_figureNo replacement
figure_descriptionUse figure_processing.description
show_imagesUse figure_processing.show_images
chunkingUse extensions.chunking.chunk_types (array instead of comma-separated string)
chunk_sizeUse extensions.chunking.chunk_size
return_htmlUse extensions.alt_outputs.return_html
structured_outputUse /schema endpoint after extraction. Pass extraction_id + schema_config. Accepts schema, schema_prompt, and effort.
schemaUse /schema endpoint after extraction
schema_promptUse /schema endpoint with schema_config.schema_prompt
custom_promptNo replacement
thinkingNo replacement
When legacy input fields are used, the API returns a deprecation warning in the warnings array directing you to the updated field names. See the latest documentation for details.

Response

When you submit a document for async extraction, you’ll receive a response containing the job metadata:
{
  "job_id": "abc123-def456-ghi789",
  "status": "pending",
  "queuedAt": "2025-01-15T10:30:00Z"
}

Response Fields

FieldTypeDescription
job_idstringUnique identifier for the extraction job. Use this to poll for results with the Poll Job endpoint.
statusstringInitial job status. Typically pending when first submitted.
queuedAtstringISO 8601 timestamp indicating when the job was accepted.

Retrieving Results

After submitting an async extraction, poll the job status endpoint to retrieve results:
GET /job/{job_id}
The job status endpoint will return the extraction results once the job is completed. See the Poll Job documentation for details on the response structure.
For detailed information on the extraction output format (markdown, bounding boxes, chunks, etc.), see the Extract documentation.

Example Usage

Submit Async Extraction

import time
from pulse import Pulse

client = Pulse(api_key="YOUR_API_KEY")

# Submit async extraction
submission = client.extract_async(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf"
)

print(f"Job ID: {submission.job_id}")
print(f"Status: {submission.status}")

# Poll for completion
job_id = submission.job_id
while True:
    job_status = client.jobs.get_job(job_id=job_id)
    print(f"Status: {job_status.status}")
    
    if job_status.status == "completed":
        print("Extraction complete!")
        print(f"Result: {job_status.result}")
        break
    elif job_status.status in ["failed", "canceled"]:
        print(f"Job ended: {job_status.status}")
        if job_status.error:
            print(f"Error: {job_status.error}")
        break
    
    time.sleep(2)

With Structured Output

schema = {
    "type": "object",
    "properties": {
        "total": {"type": "number"},
        "vendor": {"type": "string"}
    }
}

submission = client.extract_async(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
    structured_output={
        "schema": schema,
        "schema_prompt": "Extract the invoice total"
    }
)

Cancel a Job

# Cancel a running job
cancellation = client.jobs.cancel_job(job_id=job_id)
print(f"Cancelled: {cancellation.message}")

# Verify cancellation
status = client.jobs.get_job(job_id=job_id)
print(f"Status: {status.status}")  # Should be "canceled"

Authorizations

x-api-key
string
header
required

API key for authentication

Body

Input schema for multipart/form-data requests (file upload or file_url).

file
file
required

Document to upload directly. Required unless file_url is specified.

file_url
string<uri>

Public or pre-signed URL that Pulse will download and extract.

pages
string

Page range filter (1-indexed, where page 1 is the first page). Supports segments such as 1-2 or mixed ranges like 1-2,5.

Pattern: ^[0-9]+(-[0-9]+)?(,[0-9]+(-[0-9]+)?)*$
figure_processing
object

Settings that control how figures in the document are processed. These affect the markdown output directly (e.g. figure descriptions, chart-to-table conversion, image embedding) and do not produce additional output fields in the response.

extensions
object

Settings that enable additional processing passes or alternate output formats. Each enabled extension produces a corresponding output field under response.extensions.*.

storage
object

Options for persisting extraction artifacts. When enabled (default), artifacts are saved to storage and a database record is created.

async
boolean
default:false

If true, returns immediately with a job_id for polling via GET /job/{jobId}. Otherwise processes synchronously.

chunking
string
deprecated

Deprecated -- Use extensions.chunking.chunk_types instead. Comma-separated list of chunking strategies.

chunk_size
integer
deprecated

Deprecated -- Use extensions.chunking.chunk_size instead.

Required range: x >= 1
extract_figure
boolean
default:false
deprecated

Deprecated -- No replacement.

figure_description
boolean
default:false
deprecated

Deprecated -- Use figure_processing.description instead.

show_images
boolean
default:false
deprecated

Deprecated -- Use figure_processing.show_images instead.

return_html
boolean
default:false
deprecated

Deprecated -- Use extensions.alt_outputs.return_html instead.

Response

Asynchronous extraction job accepted

Acknowledgement returned when a request is submitted for asynchronous processing. Poll GET /job/{job_id} to check status and retrieve results.

job_id
string
required

Identifier assigned to the asynchronous job.

status
enum<string>
required

Initial status reported by the server.

Available options:
pending,
processing
message
string

Human-readable description of the accepted job.