Skip to main content
POST
/
extract
Extract Document
curl --request POST \
  --url https://api.runpulse.com/extract \
  --header 'Content-Type: multipart/form-data' \
  --header 'x-api-key: <api-key>' \
  --form file='@example-file' \
  --form 'file_url=<string>' \
  --form 'structured_output={
  "schema": {},
  "schema_prompt": "<string>"
}' \
  --form 'pages=<string>' \
  --form 'chunking=<string>' \
  --form chunk_size=2 \
  --form extract_figure=false \
  --form figure_description=false \
  --form return_html=false \
  --form 'storage={
  "enabled": true,
  "folder_name": "<string>"
}' \
  --form 'schema={}' \
  --form 'schema_prompt=<string>' \
  --form 'experimental_schema={}' \
  --form 'custom_prompt=<string>' \
  --form thinking=false
{
  "content": "<string>",
  "html": "<string>",
  "job_id": "<string>",
  "warnings": [
    "<string>"
  ],
  "metadata": {},
  "schema_data": {}
}

Overview

Synchronously extract content from documents. Best for files under 50 pages. Returns markdown or HTML formatted content with optional structured data extraction. For documents over 70 pages, results are returned via S3 URL.
For large documents or batch processing workflows, consider using the /extract_async endpoint to avoid timeout issues.

Request

Document Source

Provide the document using one of these methods:
FieldTypeDescription
filebinaryDocument file to upload directly (multipart/form-data).
file_urlstringPublic or pre-signed URL that Pulse will download and extract.

Extraction Options

FieldTypeDefaultDescription
structured_outputobject-Recommended method for schema-guided extraction. Contains schema (JSON schema) and optional schema_prompt (natural language instructions).
pagesstring-Page range filter (1-indexed). Supports segments like 1-2 or mixed ranges like 1-2,5. Page 1 is the first page.
chunkingstring-Comma-separated list of chunking strategies (e.g., semantic,header,page,recursive).
chunk_sizeinteger-Maximum characters per chunk when chunking is enabled.
extract_figurebooleanfalseEnable figure extraction in results.
figure_descriptionbooleanfalseGenerate descriptive captions for extracted figures.
return_htmlbooleanfalseInclude HTML representation alongside markdown in the response.
storageobject-Options for persisting extraction artifacts. See Storage Options.

Storage Options

Control whether extractions are saved to your extraction library:
FieldTypeDefaultDescription
storage.enabledbooleantrueWhether to persist extraction artifacts. Set to false for temporary extractions.
storage.folder_namestring-Target folder name to save the extraction to. Creates the folder if it doesn’t exist.

Deprecated Fields

The following fields are deprecated and will be removed in a future version:
FieldReplacement
schemaUse structured_output.schema instead
schema_promptUse structured_output.schema_prompt instead
experimental_schemaUse structured_output.schema instead
custom_promptNo replacement
thinkingNo replacement

Response

The response structure varies based on document size to optimize for different use cases.

Standard Response (Under 70 Pages)

For documents under 70 pages, results are returned directly in the response body:
{
  "markdown": "# Document Title\n\nExtracted content...",
  "page_count": 15,
  "job_id": "abc123-def456-ghi789",
  "plan-info": {
    "pages_used": 15,
    "tier": "standard",
    "note": "Pulse Ultra"
  },
  "bounding_boxes": {
    "Title": [...],
    "Paragraphs": [...],
    "Tables": [...],
    "Images": [...],
    "markdown_with_ids": "<p data-bb-text-id=\"txt-1\">..."
  },
  "extraction_url": "https://platform.runpulse.com/dashboard/extractions/abc123"
}

Response Fields

FieldTypeDescription
markdownstringClean markdown content extracted from the document.
page_countintegerTotal number of pages processed.
job_idstringUnique identifier for the extraction job.
plan-infoobjectBilling information including pages used and plan tier.
bounding_boxesobjectDetailed bounding box data for document elements. See Bounding Boxes for details.
extraction_urlstringURL to view the extraction in the Pulse Platform (when storage is enabled).
htmlstringHTML representation of the document (when return_html is true).
structured_outputobjectExtracted data matching your schema (when structured_output is provided).
chunksarrayDocument chunks (when chunking is enabled).
figuresarrayExtracted figures (when extract_figure is true).

Large Document Response (70+ Pages)

For documents with 70 or more pages, the API returns a URL to fetch the complete results. This prevents timeout issues and reduces response payload size.
{
  "is_url": true,
  "url": "https://pulse-studio-api.s3.amazonaws.com/results/abc123.json",
  "plan-info": {
    "pages_used": 150,
    "tier": "standard"
  }
}

Large Document Response Fields

FieldTypeDescription
is_urlbooleanAlways true for large document responses. Use this to detect URL-based responses.
urlstringPre-signed S3 URL containing the complete extraction results. Expires after 24 hours.
plan-infoobjectBilling information including pages used and plan tier.

Handling Large Document Responses

from pulse import Pulse

client = Pulse(api_key="YOUR_API_KEY")

# The SDK handles large document responses automatically
response = client.extract(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf"
)

# If the response contains is_url, fetch from S3
if hasattr(response, 'is_url') and response.is_url:
    import requests
    full_result = requests.get(response.url).json()
    print(full_result["markdown"])
else:
    print(response.content)
The S3 URL expires after 24 hours. If you need to access results after this period, ensure storage.enabled is true and retrieve results from your extraction library.

Example Usage

Basic Extraction

from pulse import Pulse

client = Pulse(api_key="YOUR_API_KEY")

# Extract from URL
response = client.extract(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
    extract_figure=True,
    return_html=True
)

print(f"Content: {response.content}")
print(f"Job ID: {response.job_id}")

File Upload

# Upload and extract a local file
with open("document.pdf", "rb") as f:
    response = client.extract(
        file=f,
        extract_figure=True
    )

Structured Output

schema = {
    "type": "object",
    "properties": {
        "total": {"type": "number"},
        "vendor": {"type": "string"}
    }
}

response = client.extract(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
    structured_output={
        "schema": schema,
        "schema_prompt": "Extract invoice total and vendor"
    }
)

Page Range and Chunking

response = client.extract(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
    pages="1-5,10",  # 1-indexed
    chunking="semantic,page",
    chunk_size=1000
)

Disable Storage

response = client.extract(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
    storage={"enabled": False}
)

Authorizations

x-api-key
string
header
required

API key for authentication

Body

Input schema for multipart/form-data requests (file upload or file_url).

file
file
required

Document to upload directly. Required unless file_url is specified.

file_url
string<uri>

Public or pre-signed URL that Pulse will download and extract.

structured_output
object

Recommended method for schema-guided extraction. Contains the schema and optional prompt in a single object.

pages
string

Page range filter (1-indexed, where page 1 is the first page). Supports segments such as 1-2 or mixed ranges like 1-2,5.

chunking
string

Comma-separated list of chunking strategies to apply (for example semantic,header,page,recursive).

chunk_size
integer

Override for maximum characters per chunk when chunking is enabled.

Required range: x >= 1
extract_figure
boolean
default:false

Toggle to enable figure extraction in results.

figure_description
boolean
default:false

Toggle to generate descriptive captions for extracted figures.

return_html
boolean
default:false

Whether to include HTML representation alongside markdown in the response.

storage
object

Options for persisting extraction artifacts. When enabled (default), artifacts are saved to storage and a database record is created.

schema
deprecated

⚠️ DEPRECATED - Use structured_output.schema instead. JSON schema describing structured data to extract.

schema_prompt
string
deprecated

⚠️ DEPRECATED - Use structured_output.schema_prompt instead. Natural language prompt for schema-guided extraction.

experimental_schema
deprecated

⚠️ DEPRECATED - Use structured_output.schema instead. Experimental schema definition.

custom_prompt
string
deprecated

⚠️ DEPRECATED - No replacement. Custom instructions that augment the default extraction behaviour.

thinking
boolean
default:false
deprecated

⚠️ DEPRECATED - No replacement. Enables expanded rationale output for debugging.

Response

Synchronous extraction result

High-level structure returned by the synchronous extract API.

content
string

Primary markdown content extracted from the document.

html
string

Optional HTML representation when return_html is true.

job_id
string

Identifier assigned to the extraction job.

warnings
string[]

Non-fatal warnings generated during extraction.

metadata
object

Additional metadata supplied by the backend.

schema_data
object

Extracted structured data if schema was provided.