Skip to main content
POST
/
extract
Python SDK
from pulse import Pulse
from pulse.types import (
    ExtractRequestFigureProcessing,
    ExtractRequestExtensions,
    ExtractRequestExtensionsChunking,
    ExtractRequestExtensionsAltOutputs,
)

client = Pulse(api_key="YOUR_API_KEY")

# Basic extraction from URL
response = client.extract(
    file_url="https://example.com/document.pdf"
)
print(response.markdown)
print(response.extraction_id)

# With figure processing and extensions
response = client.extract(
    file_url="https://example.com/document.pdf",
    figure_processing=ExtractRequestFigureProcessing(
        description=True,
    ),
    extensions=ExtractRequestExtensions(
        chunking=ExtractRequestExtensionsChunking(
            chunk_types=["semantic"],
            chunk_size=1000,
        ),
        alt_outputs=ExtractRequestExtensionsAltOutputs(
            return_html=True,
        ),
    ),
)
print(response.extensions.chunking)
print(response.extensions.alt_outputs.html)

# Async extraction
response = client.extract(
    file_url="https://example.com/document.pdf",
    async_=True
)
print(response.job_id)  # poll via client.jobs.get_job(job_id=...)
{
  "markdown": "<string>",
  "extensions": {
    "chunking": {
      "semantic": [
        "<string>"
      ],
      "header": [
        "<string>"
      ],
      "page": [
        "<string>"
      ],
      "recursive": [
        "<string>"
      ]
    },
    "merge_tables": {},
    "footnote_references": [
      {
        "symbol": "<string>",
        "footnoteTextId": "<string>",
        "referenceTextIds": [
          "<string>"
        ]
      }
    ],
    "alt_outputs": {
      "wlbb": {
        "words": [
          {
            "id": "<string>",
            "text": "<string>",
            "page_number": 2,
            "bounding_box": [
              123
            ],
            "average_word_confidence": 123
          }
        ],
        "error": "<string>"
      },
      "html": "<string>",
      "xml": "<string>"
    }
  },
  "bounding_boxes": {},
  "extraction_id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
  "extraction_url": "<string>",
  "page_count": 2,
  "plan_info": {
    "tier": "<string>",
    "pages_used": 123,
    "note": "<string>"
  },
  "warnings": [
    "<string>"
  ],
  "html": "<string>",
  "chunks": {
    "semantic": [
      "<string>"
    ],
    "header": [
      "<string>"
    ],
    "page": [
      "<string>"
    ],
    "recursive": [
      "<string>"
    ]
  },
  "plan-info": {
    "tier": "<string>",
    "pages_used": 123,
    "note": "<string>"
  },
  "structured_output": {
    "values": {},
    "citations": {}
  },
  "input_schema": {},
  "schema_error": "<string>"
}

Overview

Pipeline Step 1 — Extract is always the first step. After extraction, you can optionally split the document into topics, apply schema extraction to get structured data, or use tables for span-aware table extraction.
Extract content from documents. Returns markdown or HTML formatted content with optional structured data extraction. For documents over 70 pages, results are returned via S3 URL.
For large documents or batch processing workflows, set async: true to process asynchronously and poll for results via GET /job/jobId.
To process many files at once, use Batch Extract. It accepts an S3 prefix, local directory, or list of URLs and runs /extract on each file in parallel.

Async Mode

Set async: true to return immediately with a job ID for polling:
{
  "file_url": "https://example.com/document.pdf",
  "async": true
}
Async Response (200):
{
  "job_id": "abc123-def456",
  "status": "pending",
  "message": "Document processing started"
}
Use GET /job/{job_id} to poll for completion.

Request

Document Source

Provide the document using one of these methods:
FieldTypeDescription
filebinaryDocument file to upload directly (multipart/form-data).
file_urlstringPublic or pre-signed URL that Pulse will download and extract.

Extraction Options

FieldTypeDefaultDescription
pagesstring-Page range filter (1-indexed). Supports segments like 1-2 or mixed ranges like 1-2,5. Page 1 is the first page.
figure_processingobject-Settings that control how figures in the document are processed. These affect the markdown output directly and do not produce additional output fields. See Figure Processing.
extensionsobject-Settings that enable additional processing or alternate output formats. Each enabled extension produces a corresponding result under response.extensions.*. See Extensions.
storageobject-Options for persisting extraction artifacts. See Storage Options.
asyncbooleanfalseIf true, returns immediately with a job_id for polling via GET /job/{jobId}.
structured_outputobject-⚠️ Deprecated — Use the /schema endpoint after extraction instead. Still works for backward compatibility.

Figure Processing

Settings under figure_processing control how figures (images, charts, diagrams) in the document are processed. These settings affect the markdown output directly — for example, adding descriptive captions to figures or converting charts into markdown tables. They do not create additional output fields in the response.
FieldTypeDefaultDescription
figure_processing.descriptionbooleanfalseGenerate descriptive captions for extracted figures.
figure_processing.show_imagesbooleanfalseEmbed base64-encoded images inline in figure tags. Increases response size.

Extensions

Settings under extensions enable additional processing passes or alternate output formats. Each enabled extension produces a corresponding output field under response.extensions.*. For example, enabling extensions.chunking produces response.extensions.chunking, and enabling extensions.alt_outputs.return_html produces response.extensions.alt_outputs.html.
FieldTypeDefaultDescription
extensions.merge_tablesbooleanfalseMerge tables that span multiple pages into a single table.
extensions.footnote_referencesbooleanfalseLink footnote markers to their corresponding footnote text.
extensions.chunkingobject-Chunking configuration. See below.
extensions.chunking.chunk_typesstring[]-List of chunking strategies: semantic, header, page, recursive.
extensions.chunking.chunk_sizeinteger-Maximum characters per chunk.
extensions.alt_outputsobject-Alternate output formats. See below.
extensions.alt_outputs.wlbbbooleanfalseEnable word-level bounding boxes (PDF only). Results in response.extensions.alt_outputs.wlbb.
extensions.alt_outputs.return_htmlbooleanfalseInclude HTML representation. response.markdown is still present; HTML is at response.extensions.alt_outputs.html.
extensions.alt_outputs.return_xmlbooleanfalseInclude XML representation (work in progress).

Storage Options

Control whether extractions are saved to your extraction library:
FieldTypeDefaultDescription
storage.enabledbooleantrueWhether to persist extraction artifacts. Set to false for temporary extractions.
storage.folder_namestring-Target folder name to save the extraction to. Creates the folder if it doesn’t exist.
storage.folder_idstring (uuid)-Target folder ID to save the extraction to. Takes precedence over folder_name.

Deprecated Fields

The following input fields are deprecated and will be removed in a future version. They are still accepted for backward compatibility.
FieldReplacement
extract_figureNo replacement
figure_descriptionUse figure_processing.description
show_imagesUse figure_processing.show_images
chunkingUse extensions.chunking.chunk_types (array instead of comma-separated string)
chunk_sizeUse extensions.chunking.chunk_size
return_htmlUse extensions.alt_outputs.return_html
structured_outputUse /schema endpoint after extraction. Pass extraction_id + schema_config. Accepts schema, schema_prompt, and effort.
schemaUse /schema endpoint after extraction
schema_promptUse /schema endpoint with schema_config.schema_prompt
custom_promptNo replacement
thinkingNo replacement
When legacy input fields are used, the API returns a deprecation warning in the warnings array directing you to the updated field names. See the latest documentation for details.

Response

The response structure varies based on document size to optimize for different use cases.

Standard Response (Under 70 Pages)

For documents under 70 pages, results are returned directly in the response body:
{
  "markdown": "# Document Title\n\nExtracted content...",
  "page_count": 15,
  "extraction_id": "abc123-def456-ghi789",
  "extraction_url": "https://platform.runpulse.com/dashboard/extractions/abc123",
  "plan_info": {
    "pages_used": 15,
    "tier": "standard",
    "note": "Pulse Ultra"
  },
  "bounding_boxes": {
    "Title": [],
    "Text": [],
    "Tables": [],
    "Images": [],
    "markdown_with_ids": "<p data-bb-text-id=\"txt-1\">..."
  },
  "extensions": {
    "chunking": {
      "semantic": ["chunk 1...", "chunk 2..."],
      "header": ["section 1...", "section 2..."]
    },
    "altOutputs": {
      "html": "<html>...</html>"
    }
  },
  "warnings": []
}

Response Fields

FieldTypeDescription
markdownstringClean markdown content extracted from the document. Always present.
page_countintegerTotal number of pages processed.
extraction_idstring (uuid)Persisted extraction ID. Present when storage is enabled (default). Use with /split and /schema.
extraction_urlstringURL to view the extraction in the Pulse Platform. Present when storage is enabled.
plan_infoobjectBilling information including pages used and plan tier.
bounding_boxesobjectDetailed bounding box data for document elements. See Bounding Boxes for details.
extensionsobjectOutput from enabled extensions. Only keys for enabled extensions are present. See below.
extensions.chunkingobjectChunk results by strategy (when extensions.chunking is enabled).
extensions.merge_tablesobjectMerge tables result (when extensions.merge_tables is enabled).
extensions.footnote_referencesarrayList of detected footnotes with their in-text references (when extensions.footnote_references is enabled). See Footnote References below.
extensions.alt_outputs.wlbbobjectWord-level bounding boxes (when extensions.alt_outputs.wlbb is enabled).
extensions.alt_outputs.htmlstringHTML representation (when extensions.alt_outputs.return_html is enabled).
extensions.alt_outputs.xmlstringXML representation (when extensions.alt_outputs.return_xml is enabled, WIP).
warningsarrayNon-fatal warnings generated during extraction, including deprecation notices for legacy input usage.

Deprecated Response Fields

FieldReplacementDescription
htmlextensions.alt_outputs.htmlPresent when legacy return_html input is used.
chunksextensions.chunkingPresent when legacy chunking input is used.
plan-infoplan_infoPresent when only legacy inputs are used.
structured_outputUse /schemaPresent when deprecated structured_output input was used.
input_schemaUse /schemaEcho of the applied schema (deprecated path only).
schema_errorUse /schemaError message if schema processing failed (deprecated path only).

Large Document Response (70+ Pages)

For documents with 70 or more pages, the API returns a URL to fetch the complete results. This prevents timeout issues and reduces response payload size.
{
  "is_url": true,
  "url": "https://pulse-studio-api.s3.amazonaws.com/results/abc123.json",
  "plan_info": {
    "pages_used": 150,
    "tier": "standard"
  }
}

Large Document Response Fields

FieldTypeDescription
is_urlbooleanAlways true for large document responses. Use this to detect URL-based responses.
urlstringPre-signed S3 URL containing the complete extraction results. Expires after 24 hours.
plan_infoobjectBilling information including pages used and plan tier.

Handling Large Document Responses

from pulse import Pulse

client = Pulse(api_key="YOUR_API_KEY")

# The SDK handles large document responses automatically
response = client.extract(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf"
)

# If the response contains is_url, fetch from S3
if hasattr(response, 'is_url') and response.is_url:
    import requests
    full_result = requests.get(response.url).json()
    print(full_result["markdown"])
else:
    print(response.markdown)
The S3 URL expires after 24 hours. If you need to access results after this period, ensure storage.enabled is true and retrieve results from your extraction library.

Example Usage

Basic Extraction

from pulse import Pulse
from pulse.types import (
    ExtractRequestFigureProcessing,
    ExtractRequestExtensions,
    ExtractRequestExtensionsAltOutputs,
)

client = Pulse(api_key="YOUR_API_KEY")

# Extract from URL with figure processing and HTML output
response = client.extract(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
    figure_processing=ExtractRequestFigureProcessing(
        description=True,
    ),
    extensions=ExtractRequestExtensions(
        alt_outputs=ExtractRequestExtensionsAltOutputs(
            return_html=True,
        ),
    ),
)

print(f"Markdown: {response.markdown}")
print(f"HTML: {response.extensions.alt_outputs.html}")
print(f"Extraction ID: {response.extraction_id}")

File Upload

from pulse.types import ExtractRequestFigureProcessing

# Upload and extract a local file
with open("document.pdf", "rb") as f:
    response = client.extract(
        file=f,
    figure_processing=ExtractRequestFigureProcessing(
        description=True,
    ),
    )

Structured Data (Extract → Schema)

The structured_output parameter on /extract is deprecated. Use the /schema endpoint after extraction instead. This gives you better control, re-runnability, and support for split-mode schemas.
Recommended two-step approach:
# Step 1: Extract the document
response = client.extract(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf"
)

extraction_id = response.extraction_id

# Step 2: Apply schema separately
schema_result = client.schema(
    extraction_id=extraction_id,
    schema_config={
        "input_schema": {
            "type": "object",
            "properties": {
                "total": {"type": "number"},
                "vendor": {"type": "string"}
            }
        },
        "schema_prompt": "Extract invoice total and vendor"
    }
)

print(schema_result.schema_output)

Page Range and Chunking

from pulse.types import (
    ExtractRequestExtensions,
    ExtractRequestExtensionsChunking,
)

response = client.extract(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
    pages="1-5,10",  # 1-indexed
    extensions=ExtractRequestExtensions(
        chunking=ExtractRequestExtensionsChunking(
            chunk_types=["semantic", "page"],
            chunk_size=1000,
        ),
    ),
)

# Chunk data is in extensions.chunking
print(response.extensions.chunking.semantic)
print(response.extensions.chunking.page)

Footnote References

Enable extensions.footnote_references to detect footnote markers (e.g. *, , 1) in body text and link them to the footnote explanation paragraphs at the bottom of the page. Each result item includes the marker symbol, the bounding-box text ID of the footnote, and the bounding-box text IDs of all body-text paragraphs that reference it.
from pulse.types import ExtractRequestExtensions

response = client.extract(
    file_url="https://example.com/research-paper.pdf",
    extensions=ExtractRequestExtensions(
        footnote_references=True,
    ),
)

# Footnote links are in extensions.footnote_references
for ref in response.extensions.footnote_references:
    print(f"Marker: {ref.symbol}")
    print(f"  Footnote: {ref.footnote_text_id}")
    print(f"  Referenced by: {ref.reference_text_ids}")

Example Response

{
  "markdown": "...",
  "bounding_boxes": { ... },
  "extensions": {
    "footnoteReferences": [
      {
        "symbol": "*",
        "footnoteTextId": "txt-11",
        "referenceTextIds": ["txt-4", "txt-5", "txt-6", "txt-7", "txt-8"]
      },
      {
        "symbol": "†",
        "footnoteTextId": "txt-12",
        "referenceTextIds": ["txt-8"]
      },
      {
        "symbol": "4",
        "footnoteTextId": "txt-48",
        "referenceTextIds": ["txt-45"]
      }
    ]
  }
}

Footnote Reference Fields

FieldTypeDescription
symbolstringThe footnote marker symbol as detected in the document (e.g. *, , , 1, #).
footnoteTextIdstringThe bounding-box text ID (e.g. txt-11) of the footnote explanation paragraph. Cross-reference with bounding_boxes.Footer to get the footnote’s content and position.
referenceTextIdsstring[]Bounding-box text IDs of body-text paragraphs that contain a reference to this footnote. Cross-reference with bounding_boxes.Text to get each paragraph’s content and position.
Footnote reference detection uses Azure Document Intelligence for paragraph classification, supplemented by PyMuPDF native text extraction for accurate symbol identification. This handles common OCR confusion between visually similar symbols like /+ and /#. Supported marker types include numbered (1, 2, 3), symbolic (*, , , §, #), and lettered (a, b, c) footnotes.

Disable Storage

response = client.extract(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
    storage={"enabled": False}
)

Authorizations

x-api-key
string
header
required

API key for authentication

Body

Input schema for multipart/form-data requests (file upload or file_url).

file
file
required

Document to upload directly. Required unless file_url is specified.

file_url
string<uri>

Public or pre-signed URL that Pulse will download and extract.

pages
string

Page range filter (1-indexed, where page 1 is the first page). Supports segments such as 1-2 or mixed ranges like 1-2,5.

Pattern: ^[0-9]+(-[0-9]+)?(,[0-9]+(-[0-9]+)?)*$
figure_processing
object

Settings that control how figures in the document are processed. These affect the markdown output directly (e.g. figure descriptions, chart-to-table conversion, image embedding) and do not produce additional output fields in the response.

extensions
object

Settings that enable additional processing passes or alternate output formats. Each enabled extension produces a corresponding output field under response.extensions.*.

storage
object

Options for persisting extraction artifacts. When enabled (default), artifacts are saved to storage and a database record is created.

async
boolean
default:false

If true, returns immediately with a job_id for polling via GET /job/{jobId}. Otherwise processes synchronously.

chunking
string
deprecated

Deprecated -- Use extensions.chunking.chunk_types instead. Comma-separated list of chunking strategies.

chunk_size
integer
deprecated

Deprecated -- Use extensions.chunking.chunk_size instead.

Required range: x >= 1
extract_figure
boolean
default:false
deprecated

Deprecated -- No replacement.

figure_description
boolean
default:false
deprecated

Deprecated -- Use figure_processing.description instead.

show_images
boolean
default:false
deprecated

Deprecated -- Use figure_processing.show_images instead.

return_html
boolean
default:false
deprecated

Deprecated -- Use extensions.alt_outputs.return_html instead.

Response

When async=false (default): full extraction result with markdown, bounding boxes, chunks, etc. When async=true: job submission acknowledgement with job_id.

Full extraction result returned by the synchronous /extract endpoint. Contains the extracted markdown, optional extensions output, bounding boxes, and storage metadata.

markdown
string

Primary markdown content extracted from the document. Always present in the new format.

extensions
object

Output from enabled extensions. Each key corresponds to an extension that was enabled in the request under extensions.*. Only keys for enabled extensions are present.

bounding_boxes
object

Positional bounding-box data for text, titles, headers, footers, images, and tables.

extraction_id
string<uuid>

Persisted extraction ID. Present when storage is enabled (default). Use with /split and /schema endpoints.

extraction_url
string

URL to view the extraction on the Pulse platform. Present when storage is enabled.

page_count
integer

Number of pages processed.

Required range: x >= 1
plan_info
object

Billing tier and usage information.

warnings
string[]

Non-fatal warnings generated during extraction. Includes deprecation notices when legacy input parameters are used, as well as processing warnings (e.g. word-level bounding box limitations).

html
string
deprecated

Deprecated -- Use extensions.alt_outputs.html instead. Present when the legacy return_html input was used.

chunks
object
deprecated

Deprecated -- Use extensions.chunking instead. Present when the legacy chunking input was used.

plan-info
object
deprecated

Deprecated -- Use plan_info (underscore) instead. Present when only legacy input parameters are used.

structured_output
object
deprecated

Deprecated -- Only present when the deprecated structured_output input parameter was used. Use the /schema endpoint instead.

input_schema
object
deprecated

Deprecated -- Echo of the schema that was applied.

schema_error
string
deprecated

Deprecated -- Error message if schema processing failed.