Skip to main content
POST
/
extract_async
Extract Document Async
curl --request POST \
  --url https://api.runpulse.com/extract_async \
  --header 'Content-Type: multipart/form-data' \
  --header 'x-api-key: <api-key>' \
  --form file='@example-file' \
  --form 'file_url=<string>' \
  --form 'structured_output={
  "schema": {},
  "schema_prompt": "<string>"
}' \
  --form 'pages=<string>' \
  --form 'chunking=<string>' \
  --form chunk_size=2 \
  --form extract_figure=false \
  --form figure_description=false \
  --form return_html=false \
  --form 'storage={
  "enabled": true,
  "folder_name": "<string>"
}' \
  --form 'schema={}' \
  --form 'schema_prompt=<string>' \
  --form 'experimental_schema={}' \
  --form 'custom_prompt=<string>' \
  --form thinking=false
{
  "job_id": "<string>",
  "status": "pending",
  "message": "Job queued for processing",
  "queuedAt": "2023-11-07T05:31:56Z"
}

Overview

The asynchronous extraction endpoint accepts the same input parameters as the synchronous /extract endpoint but returns immediately with a job identifier. Use this endpoint for:
  • Large documents that may take longer to process
  • Batch processing workflows
  • Non-blocking integrations

Request

Document Source

Provide the document using one of these methods:
FieldTypeDescription
filebinaryDocument file to upload directly (multipart/form-data).
file_urlstringPublic or pre-signed URL that Pulse will download and extract.

Extraction Options

FieldTypeDefaultDescription
structured_outputobject-Recommended method for schema-guided extraction. Contains schema (JSON schema) and optional schema_prompt (natural language instructions).
pagesstring-Page range filter (1-indexed). Supports segments like 1-2 or mixed ranges like 1-2,5. Page 1 is the first page.
chunkingstring-Comma-separated list of chunking strategies (e.g., semantic,header,page,recursive).
chunk_sizeinteger-Maximum characters per chunk when chunking is enabled.
extract_figurebooleanfalseEnable figure extraction in results.
figure_descriptionbooleanfalseGenerate descriptive captions for extracted figures.
return_htmlbooleanfalseInclude HTML representation alongside markdown in the response.
storageobject-Options for persisting extraction artifacts. See Storage Options.

Storage Options

Control whether extractions are saved to your extraction library:
FieldTypeDefaultDescription
storage.enabledbooleantrueWhether to persist extraction artifacts. Set to false for temporary extractions.
storage.folder_namestring-Target folder name to save the extraction to. Creates the folder if it doesn’t exist.

Deprecated Fields

The following fields are deprecated and will be removed in a future version:
FieldReplacement
schemaUse structured_output.schema instead
schema_promptUse structured_output.schema_prompt instead
experimental_schemaUse structured_output.schema instead
custom_promptNo replacement
thinkingNo replacement

Response

When you submit a document for async extraction, you’ll receive a response containing the job metadata:
{
  "job_id": "abc123-def456-ghi789",
  "status": "pending",
  "queuedAt": "2025-01-15T10:30:00Z"
}

Response Fields

FieldTypeDescription
job_idstringUnique identifier for the extraction job. Use this to poll for results with the Poll Job endpoint.
statusstringInitial job status. Typically pending when first submitted.
queuedAtstringISO 8601 timestamp indicating when the job was accepted.

Retrieving Results

After submitting an async extraction, poll the job status endpoint to retrieve results:
GET /job/{job_id}
The job status endpoint will return the extraction results once the job is completed. See the Poll Job documentation for details on the response structure.
For detailed information on the extraction output format (markdown, bounding boxes, chunks, etc.), see the Extract documentation.

Example Usage

Submit Async Extraction

import time
from pulse import Pulse

client = Pulse(api_key="YOUR_API_KEY")

# Submit async extraction
submission = client.extract_async(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
    extract_figure=True
)

print(f"Job ID: {submission.job_id}")
print(f"Status: {submission.status}")

# Poll for completion
job_id = submission.job_id
while True:
    job_status = client.jobs.get_job(job_id=job_id)
    print(f"Status: {job_status.status}")
    
    if job_status.status == "completed":
        print("Extraction complete!")
        print(f"Result: {job_status.result}")
        break
    elif job_status.status in ["failed", "canceled"]:
        print(f"Job ended: {job_status.status}")
        if job_status.error:
            print(f"Error: {job_status.error}")
        break
    
    time.sleep(2)

With Structured Output

schema = {
    "type": "object",
    "properties": {
        "total": {"type": "number"},
        "vendor": {"type": "string"}
    }
}

submission = client.extract_async(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
    structured_output={
        "schema": schema,
        "schema_prompt": "Extract the invoice total"
    }
)

Cancel a Job

# Cancel a running job
cancellation = client.jobs.cancel_job(job_id=job_id)
print(f"Cancelled: {cancellation.message}")

# Verify cancellation
status = client.jobs.get_job(job_id=job_id)
print(f"Status: {status.status}")  # Should be "canceled"

Authorizations

x-api-key
string
header
required

API key for authentication

Body

Input schema for multipart/form-data requests (file upload or file_url).

file
file
required

Document to upload directly. Required unless file_url is specified.

file_url
string<uri>

Public or pre-signed URL that Pulse will download and extract.

structured_output
object

Recommended method for schema-guided extraction. Contains the schema and optional prompt in a single object.

pages
string

Page range filter (1-indexed, where page 1 is the first page). Supports segments such as 1-2 or mixed ranges like 1-2,5.

chunking
string

Comma-separated list of chunking strategies to apply (for example semantic,header,page,recursive).

chunk_size
integer

Override for maximum characters per chunk when chunking is enabled.

Required range: x >= 1
extract_figure
boolean
default:false

Toggle to enable figure extraction in results.

figure_description
boolean
default:false

Toggle to generate descriptive captions for extracted figures.

return_html
boolean
default:false

Whether to include HTML representation alongside markdown in the response.

storage
object

Options for persisting extraction artifacts. When enabled (default), artifacts are saved to storage and a database record is created.

schema
deprecated

⚠️ DEPRECATED - Use structured_output.schema instead. JSON schema describing structured data to extract.

schema_prompt
string
deprecated

⚠️ DEPRECATED - Use structured_output.schema_prompt instead. Natural language prompt for schema-guided extraction.

experimental_schema
deprecated

⚠️ DEPRECATED - Use structured_output.schema instead. Experimental schema definition.

custom_prompt
string
deprecated

⚠️ DEPRECATED - No replacement. Custom instructions that augment the default extraction behaviour.

thinking
boolean
default:false
deprecated

⚠️ DEPRECATED - No replacement. Enables expanded rationale output for debugging.

Response

Asynchronous extraction job accepted

Metadata describing the enqueued asynchronous extraction job.

job_id
string
required

Identifier assigned to the asynchronous extraction job.

status
enum<string>
required

Initial status reported by the extractor.

Available options:
pending,
processing,
completed,
failed,
canceled
message
string

Human-readable status message.

Example:

"Job queued for processing"

queuedAt
string<date-time>

Timestamp indicating when the job was accepted.