Skip to main content
POST
/
extract
Python SDK
from pulse import Pulse
from pulse.types import (
    ExtractRequestFigureProcessing,
    ExtractRequestExtensions,
    ExtractRequestExtensionsChunking,
    ExtractRequestExtensionsAltOutputs,
)

client = Pulse(api_key="YOUR_API_KEY")

# Basic extraction from URL
response = client.extract(
    file_url="https://example.com/document.pdf"
)
print(response.markdown)
print(response.extraction_id)

# With figure processing and extensions
response = client.extract(
    file_url="https://example.com/document.pdf",
    figure_processing=ExtractRequestFigureProcessing(
        description=True,
    ),
    extensions=ExtractRequestExtensions(
        chunking=ExtractRequestExtensionsChunking(
            chunk_types=["semantic"],
            chunk_size=1000,
        ),
        alt_outputs=ExtractRequestExtensionsAltOutputs(
            return_html=True,
        ),
    ),
)
print(response.extensions.chunking)
print(response.extensions.alt_outputs.html)

# Async extraction
response = client.extract(
    file_url="https://example.com/document.pdf",
    async_=True
)
print(response.job_id)  # poll via client.jobs.get_job(job_id=...)
{
  "markdown": "<string>",
  "extensions": {
    "chunking": {
      "semantic": [
        "<string>"
      ],
      "header": [
        "<string>"
      ],
      "page": [
        "<string>"
      ],
      "recursive": [
        "<string>"
      ]
    },
    "merge_tables": {},
    "footnote_references": [
      {
        "symbol": "<string>",
        "footnoteTextId": "<string>",
        "referenceTextIds": [
          "<string>"
        ]
      }
    ],
    "alt_outputs": {
      "wlbb": {
        "words": [
          {
            "id": "<string>",
            "text": "<string>",
            "page_number": 2,
            "bounding_box": [
              123
            ],
            "average_word_confidence": 123
          }
        ],
        "error": "<string>"
      },
      "html": "<string>",
      "xml": "<string>"
    }
  },
  "bounding_boxes": {
    "Images": [
      {
        "id": "<string>",
        "content": "<string>",
        "page_number": 2,
        "bounding_box": [
          123
        ],
        "image_url": "<string>",
        "description": "<string>",
        "classification": {
          "confidence": 0.5,
          "model": "<string>",
          "error": "<string>"
        },
        "sheet_name": "<string>",
        "sheet_index": 123,
        "workbook_sheet_index": 123,
        "excel_range": "<string>",
        "chart_type": "<string>",
        "chart_title": "<string>",
        "source_ranges": [
          "<string>"
        ],
        "render_error": "<string>",
        "description_error": "<string>"
      }
    ],
    "Tables": [
      {
        "table_info": {
          "id": "<string>",
          "dimensions": [
            123
          ],
          "excel_range": "<string>",
          "sheet_name": "<string>",
          "sheet_index": 123,
          "workbook_sheet_index": 123,
          "section_index": 123,
          "section_type": "<string>",
          "section_name": "<string>",
          "table_name": "<string>",
          "layout_type": "<string>",
          "is_chart": true,
          "chart_type": "<string>",
          "chart_title": "<string>",
          "source_ranges": [
            "<string>"
          ],
          "location": {}
        },
        "cell_data": [
          {}
        ]
      }
    ],
    "Text": [
      {
        "id": "<string>",
        "content": "<string>",
        "page_number": 2,
        "bounding_box": [
          123
        ],
        "excel_range": "<string>",
        "sheet_name": "<string>",
        "sheet_index": 123,
        "workbook_sheet_index": 123
      }
    ],
    "Title": [
      {
        "id": "<string>",
        "content": "<string>",
        "page_number": 2,
        "bounding_box": [
          123
        ],
        "excel_range": "<string>",
        "sheet_name": "<string>",
        "sheet_index": 123,
        "workbook_sheet_index": 123
      }
    ],
    "Footer": [
      {
        "id": "<string>",
        "content": "<string>",
        "page_number": 2,
        "bounding_box": [
          123
        ],
        "excel_range": "<string>",
        "sheet_name": "<string>",
        "sheet_index": 123,
        "workbook_sheet_index": 123
      }
    ],
    "markdown_with_ids": "<string>"
  },
  "extraction_id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
  "extraction_url": "<string>",
  "page_count": 2,
  "plan_info": {
    "tier": "<string>",
    "total_credits_used": 123,
    "pages_used": 1,
    "note": "<string>"
  },
  "credits_used": 123,
  "warnings": [
    "<string>"
  ],
  "html": "<string>",
  "chunks": {
    "semantic": [
      "<string>"
    ],
    "header": [
      "<string>"
    ],
    "page": [
      "<string>"
    ],
    "recursive": [
      "<string>"
    ]
  },
  "plan-info": {
    "tier": "<string>",
    "pages_used": 123,
    "note": "<string>"
  },
  "structured_output": {
    "values": {},
    "citations": {}
  },
  "input_schema": {},
  "schema_error": "<string>"
}

Overview

Pipeline Step 1 — Extract is always the first step. After extraction, you can optionally split the document into topics, apply schema extraction to get structured data, or use tables for span-aware table extraction.
Extract content from documents. Returns markdown or HTML formatted content with optional structured data extraction. For large results (typically documents over 70 pages, or any response above 5 MB), the API returns a one-time download link at https://api.runpulse.com/large_results/{job_id} instead of inlining the payload. See Large Document Response below.
For large documents or batch processing workflows, set async: true to process asynchronously and poll for results via GET /job/jobId.
To process many files at once, use Batch Extract. It accepts an S3 prefix, local directory, or list of URLs and runs /extract on each file in parallel.

Async Mode

Set async: true to return immediately with a job ID for polling:
{
  "file_url": "https://example.com/document.pdf",
  "async": true
}
Async Response (200):
{
  "job_id": "abc123-def456",
  "status": "pending",
  "message": "Document processing started"
}
Use GET /job/{job_id} to poll for completion.

Request

Document Source

Provide the document using one of these methods:
FieldTypeDescription
filebinaryDocument file to upload directly (multipart/form-data).
file_urlstringPublic or pre-signed URL that Pulse will download and extract.

Extraction Options

FieldTypeDefaultDescription
modelstring (enum)defaultExtraction model to use. One of default or pulse-ultra-2. pulse-ultra-2 uses Pulse’s vision-language model with built-in refinement, figure/chart extraction, and word-level bounding boxes.
pagesstring-Page range filter (1-indexed). Supports segments like 1-2 or mixed ranges like 1-2,5. Page 1 is the first page.
figure_processingobject-Settings that control how figures in the document are processed. These affect the markdown output directly and do not produce additional output fields. See Figure Processing.
extensionsobject-Settings that enable additional processing or alternate output formats. Each enabled extension produces a corresponding result under response.extensions.*. See Extensions.
spreadsheetobject-Settings for Excel/spreadsheet extraction. Controls handling of hidden rows, columns, sheets, and the automatic trimming of empty trailing rows/columns past the last data-bearing cell. Applies to .xlsx, .xlsm, and .xls files. See Spreadsheet Options.
storageobject-Options for persisting extraction artifacts. See Storage Options.
asyncbooleanfalseIf true, returns immediately with a job_id for polling via GET /job/{jobId}.
structured_outputobject-⚠️ Deprecated — Use the /schema endpoint after extraction instead. Still works for backward compatibility.

Figure Processing

Settings under figure_processing control how figures (images, charts, diagrams) and embedded visuals are processed. Applies to both PDFs/images (figures detected from layout) and spreadsheets (charts and embedded images read directly from the workbook). Affects the markdown output and the bounding_boxes.Images[] array.
FieldTypeDefaultDescription
figure_processing.descriptionbooleanfalseGenerate descriptive captions for extracted visuals. Captions appear under bounding_boxes.Images[].description and inline in the markdown output. Applies to both detected charts and non-chart images.
figure_processing.show_imagesbooleanfalseReturn image URLs for extracted visuals. URLs appear under bounding_boxes.Images[].image_url and resolve to a Pulse-hosted PNG/JPEG served from GET /results/{jobId}/images/{filename}. Applies to both detected charts and non-chart images.
For spreadsheets specifically, show_images: true collects every embedded chart and image in the workbook and emits one entry per visual under bounding_boxes.Images, with chart-specific fields like chart_type, chart_title, and source_ranges populated. See Bounding Boxes for the full field list.

Spreadsheet Options

Settings under spreadsheet control how Excel workbooks (.xlsx, .xlsm, .xls) are processed. By default, hidden rows, columns, and sheets are excluded from extraction output, and cell values are rendered the way Excel displays them. Phantom-cell trimming is opt-in.
FieldTypeDefaultDescription
spreadsheet.include_hidden_rowsbooleanfalseInclude rows that are hidden in the Excel workbook.
spreadsheet.include_hidden_colsbooleanfalseInclude columns that are hidden in the Excel workbook.
spreadsheet.include_hidden_sheetsbooleanfalseInclude sheets that are hidden in the Excel workbook.
spreadsheet.use_raw_valuesbooleanfalseEmit the underlying numeric value for number cells instead of the Excel display-formatted text — e.g. 1201.67 rather than $1,202 when the cell uses a rounded currency format. Useful when downstream processing needs exact amounts (cent-level precision) rather than what the workbook shows visually. Percent-formatted cells and dates keep their display rendering. Does not apply to legacy .xls files.
spreadsheet.only_data_rowsbooleanfalseWhen true, trim trailing empty rows past the last cell carrying a value or formula. See Phantom-cell trimming below.
spreadsheet.only_data_colsbooleanfalseWhen true, trim trailing empty columns past the last cell carrying a value or formula. Same rationale as only_data_rows.
These settings accept both camelCase (includeHiddenRows, onlyDataRows) and snake_case (include_hidden_rows, only_data_rows) formats.

Phantom-cell trimming (only_data_rows / only_data_cols)

Excel files exported from claims systems, ERPs, and other automated pipelines routinely declare a “used range” that extends hundreds of thousands of rows past where the data actually ends. A typical case: a 57 MB workbook with only ~500 rows of real data, where the other ~1,000,000 rows are empty cells that exist only because they were once selected and styled. These phantom cells inflate file size by orders of magnitude and can exhaust parser memory on the extraction pipeline. Set only_data_rows: true and only_data_cols: true to have Pulse scan each sheet once before parsing, find the largest row and column containing a value or formula, and ignore everything beyond that extent. Surviving cells keep their original A1 coordinates (e.g., a value at B7 in the source is still B7 in the output), so any citation or bounding box that references a specific cell remains stable. The trim only kicks in on large sheets (≥5 MB of XML per sheet), so small, well-formed workbooks pay no overhead either way. Both flags default to false.

Pulse Ultra 2 Options

These options are available only when model: pulse-ultra-2 is set. Passing any of them with the default model returns a 400 error listing the offending fields.
FieldTypeDefaultDescription
refinebooleanfalseRun a full-page OCR and formatting correction pass after extraction. Improves accuracy on dense layouts, numerical values, and table structure. Adds ~1–2s per page. Overridden by refine_options if both are provided.
refine_optionsobject-Granular refinement targets. Takes precedence over the boolean refine flag. See below.
refine_options.tablesbooleanfalseFix table cell values, structure, and headers against the source image.
refine_options.textbooleanfalseFix OCR errors, missing or extra content, and numerical accuracy (tables untouched).
refine_options.formattingbooleanfalseAdd strikethrough, italic, bold, super/subscript, and LaTeX formatting (tables untouched).
extract_figurebooleanfalseConvert charts and data visualizations into HTML <table> blocks, wrapped in <figure-table> tags. Useful for financial decks, dashboards, and scientific charts.
figure_descriptionbooleanfalseGenerate a 1–2 paragraph natural-language description of each picture, wrapped in <figure-description> tags. Combines well with extract_figure.
additional_promptstring""Extra context injected into the extraction prompt. Use to steer extraction toward a specific domain or attention focus. Max 4000 characters.
custom_image_promptstring""Extra context appended to the prompt used by figure_description and extract_figure. Tunes image and chart interpretation. Max 2000 characters.
custom_refine_promptstring""Extra context appended to the refinement prompt. Only applies when refine: true or refine_options is set. Max 2000 characters.

Markdown output additions

When extract_figure or figure_description is enabled, figures in response.markdown include additional tags:
<figure data-page="1">
  <figure-table>...HTML table for the chart...</figure-table>
  <figure-description>...1–2 paragraph description...</figure-description>
</figure>
When refine (or refine_options) is set, markdown content is post-processed page-by-page; output is cleaner but typically grows ~1.5–3x in size for dense documents. No new tags are introduced.

Extensions

Settings under extensions enable additional processing passes or alternate output formats. Each enabled extension produces a corresponding output field under response.extensions.*. For example, enabling extensions.chunking produces response.extensions.chunking, and enabling extensions.alt_outputs.return_html produces response.extensions.alt_outputs.html.
FieldTypeDefaultDescription
extensions.footnote_referencesbooleanfalseLink footnote markers to their corresponding footnote text.
extensions.chunkingobject-Chunking configuration. See below.
extensions.chunking.chunk_typesstring[]-List of chunking strategies: semantic, header, page, recursive.
extensions.chunking.chunk_sizeinteger-Maximum characters per chunk.
extensions.alt_outputsobject-Alternate output formats. See below.
extensions.alt_outputs.wlbbbooleanfalseEnable word-level bounding boxes (PDF only). Results in response.extensions.alt_outputs.wlbb.
extensions.alt_outputs.return_htmlbooleanfalseInclude HTML representation. response.markdown is still present; HTML is at response.extensions.alt_outputs.html.
extensions.alt_outputs.return_xmlbooleanfalseInclude XML representation (work in progress).

pulse-ultra-2 Rate Limits

Requests made with model: pulse-ultra-2 are subject to dedicated rate limits, separate from standard extraction:
LimitValue
Per minute5 extractions
Per hour20 extractions
File size50 MB
Concurrent2 per API key
The concurrent limit is the one that most commonly applies in practice — long-running extractions held open while new requests arrive will trip it first.

Storage Options

Control whether extractions are saved to your extraction library:
FieldTypeDefaultDescription
storage.enabledbooleantrueWhether to persist extraction artifacts. Set to false for temporary extractions.
storage.folder_namestring-Target folder name to save the extraction to. Creates the folder if it doesn’t exist.
storage.folder_idstring (uuid)-Target folder ID to save the extraction to. Takes precedence over folder_name.

Deprecated Fields

The following input fields are deprecated and will be removed in a future version. They are still accepted for backward compatibility.
FieldReplacement
show_imagesUse figure_processing.show_images
chunkingUse extensions.chunking.chunk_types (array instead of comma-separated string)
chunk_sizeUse extensions.chunking.chunk_size
return_htmlUse extensions.alt_outputs.return_html
structured_outputUse /schema endpoint after extraction. Pass extraction_id + schema_config. Accepts schema, schema_prompt, and effort.
schemaUse /schema endpoint after extraction
schema_promptUse /schema endpoint with schema_config.schema_prompt
custom_promptNo replacement
thinkingNo replacement
When legacy input fields are used, the API returns a deprecation warning in the warnings array directing you to the updated field names. See the latest documentation for details.

Response

The response structure varies based on document size to optimize for different use cases.

Standard Response (Under 70 Pages)

For documents under 70 pages, results are returned directly in the response body:
{
  "markdown": "# Document Title\n\nExtracted content...",
  "page_count": 15,
  "extraction_id": "abc123-def456-ghi789",
  "extraction_url": "https://platform.runpulse.com/dashboard/extractions/abc123",
  "credits_used": 1.0,
  "plan_info": {
    "tier": "growth",
    "pages_used": 15,
    "total_credits_used": 49.5,
    "note": "Pulse Ultra"
  },
  "bounding_boxes": {
    "Title": [],
    "Text": [],
    "Tables": [],
    "Images": [
      {
        "id": "excel_image_1_1",
        "visual_type": "chart",
        "image_url": "https://api.runpulse.com/results/abc123-def456-ghi789/images/excel_image_1_1.png",
        "chart_type": "BarChart",
        "chart_title": "Revenue",
        "excel_range": "D2",
        "sheet_name": "Charts"
      }
    ],
    "markdown_with_ids": "<p data-bb-text-id=\"txt-1\">..."
  },
  "extensions": {
    "chunking": {
      "semantic": ["chunk 1...", "chunk 2..."],
      "header": ["section 1...", "section 2..."]
    },
    "altOutputs": {
      "html": "<html>...</html>"
    }
  },
  "warnings": []
}

Response Fields

FieldTypeDescription
markdownstringClean markdown content extracted from the document. Always present.
page_countintegerTotal number of pages processed.
extraction_idstring (uuid)Persisted extraction ID. Present when storage is enabled (default). Use with /split and /schema.
extraction_urlstringURL to view the extraction in the Pulse Platform. Present when storage is enabled.
credits_usednumberCredits consumed by this request. Only present when the org has the credit billing system enabled.
plan_infoobjectBilling tier and cumulative usage information for the calling org, including this request. Includes tier, total_credits_used (primary billing metric), pages_used (legacy), and an optional note.
bounding_boxesobjectTyped bounding-box data — Images, Tables, Text, Title, Footer, plus markdown_with_ids. See Bounding Boxes for the full field list including the chart/image fields under Images.
extensionsobjectOutput from enabled extensions. Only keys for enabled extensions are present. See below.
extensions.chunkingobjectChunk results by strategy (when extensions.chunking is enabled).
extensions.footnoteReferencesarrayList of detected footnotes with their in-text references (when extensions.footnote_references is enabled). See Footnote References below.
extensions.altOutputs.wlbbobjectWord-level bounding boxes (when extensions.alt_outputs.wlbb is enabled).
extensions.altOutputs.htmlstringHTML representation (when extensions.alt_outputs.return_html is enabled).
extensions.altOutputs.xmlstringXML representation (when extensions.alt_outputs.return_xml is enabled, WIP).
warningsarrayNon-fatal warnings generated during extraction, including deprecation notices for legacy input usage.

Deprecated Response Fields

FieldReplacementDescription
htmlextensions.altOutputs.htmlPresent when legacy return_html input is used.
chunksextensions.chunkingPresent when legacy chunking input is used.
plan-infoplan_infoPresent when only legacy inputs are used.
structured_outputUse /schemaPresent when deprecated structured_output input was used.
input_schemaUse /schemaEcho of the applied schema (deprecated path only).
schema_errorUse /schemaError message if schema processing failed (deprecated path only).

Large Document Response (70+ Pages)

For documents with 70 or more pages — or any response payload above the 5 MB inline threshold — the API returns a one-time download link to /large_results/{job_id} instead of inlining the payload. This prevents timeout issues and keeps the immediate response small.
{
  "is_url": true,
  "url": "https://api.runpulse.com/large_results/abc123-def456-ghi789",
  "plan_info": {
    "tier": "growth",
    "pages_used": 150,
    "total_credits_used": 312.0
  }
}

Large Document Response Fields

FieldTypeDescription
is_urlbooleanAlways true for large document responses. Use this to detect URL-based responses.
urlstringOne-time download link of the form https://api.runpulse.com/large_results/{job_id}. The link streams the complete extraction result the first time it is fetched and is then invalidated (subsequent reads return 410 Gone). It also expires 1 hour after the job completes. Authenticate the request with your x-api-key header.
plan_infoobjectBilling information including pages used and plan tier.
/large_results/{job_id} links are single-use and expire 1 hour after the job completes. Download and persist the payload immediately — do not pass the URL through queues or share it across workers.

Handling Large Document Responses

import requests
from pulse import Pulse

API_KEY = "YOUR_API_KEY"
client = Pulse(api_key=API_KEY)

response = client.extract(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf"
)

if hasattr(response, "is_url") and response.is_url:
    full_result = requests.get(
        response.url,
        headers={"x-api-key": API_KEY},
    ).json()
    print(full_result["markdown"])
else:
    print(response.markdown)
Because /large_results/{job_id} is one-time use, persist the result to your own storage on first download. If you need to access the result later, enable storage.enabled and retrieve it from your extraction library on the Pulse Platform.

Example Usage

Basic Extraction

from pulse import Pulse
from pulse.types import (
    ExtractRequestFigureProcessing,
    ExtractRequestExtensions,
    ExtractRequestExtensionsAltOutputs,
)

client = Pulse(api_key="YOUR_API_KEY")

# Extract from URL with figure processing and HTML output
response = client.extract(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
    figure_processing=ExtractRequestFigureProcessing(
        description=True,
    ),
    extensions=ExtractRequestExtensions(
        alt_outputs=ExtractRequestExtensionsAltOutputs(
            return_html=True,
        ),
    ),
)

print(f"Markdown: {response.markdown}")
print(f"HTML: {response.extensions.alt_outputs.html}")
print(f"Extraction ID: {response.extraction_id}")

File Upload

from pulse.types import ExtractRequestFigureProcessing

# Upload and extract a local file
with open("document.pdf", "rb") as f:
    response = client.extract(
        file=f,
    figure_processing=ExtractRequestFigureProcessing(
        description=True,
    ),
    )

Structured Data (Extract → Schema)

The structured_output parameter on /extract is deprecated. Use the /schema endpoint after extraction instead. This gives you better control, re-runnability, and support for split-mode schemas.
Recommended two-step approach:
# Step 1: Extract the document
response = client.extract(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf"
)

extraction_id = response.extraction_id

# Step 2: Apply schema separately
schema_result = client.schema(
    extraction_id=extraction_id,
    schema_config={
        "input_schema": {
            "type": "object",
            "properties": {
                "total": {"type": "number"},
                "vendor": {"type": "string"}
            }
        },
        "schema_prompt": "Extract invoice total and vendor"
    }
)

print(schema_result.schema_output)

Page Range and Chunking

from pulse.types import (
    ExtractRequestExtensions,
    ExtractRequestExtensionsChunking,
)

response = client.extract(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
    pages="1-5,10",  # 1-indexed
    extensions=ExtractRequestExtensions(
        chunking=ExtractRequestExtensionsChunking(
            chunk_types=["semantic", "page"],
            chunk_size=1000,
        ),
    ),
)

# Chunk data is in extensions.chunking
print(response.extensions.chunking.semantic)
print(response.extensions.chunking.page)

Footnote References

Enable extensions.footnote_references to detect footnote markers (e.g. *, , 1) in body text and link them to the footnote explanation paragraphs at the bottom of the page. Each result item includes the marker symbol, the bounding-box text ID of the footnote, and the bounding-box text IDs of all body-text paragraphs that reference it.
from pulse.types import ExtractRequestExtensions

response = client.extract(
    file_url="https://example.com/research-paper.pdf",
    extensions=ExtractRequestExtensions(
        footnote_references=True,
    ),
)

# Footnote links are in extensions.footnote_references
for ref in response.extensions.footnote_references:
    print(f"Marker: {ref.symbol}")
    print(f"  Footnote: {ref.footnote_text_id}")
    print(f"  Referenced by: {ref.reference_text_ids}")

Example Response

{
  "markdown": "...",
  "bounding_boxes": { ... },
  "extensions": {
    "footnoteReferences": [
      {
        "symbol": "*",
        "footnoteTextId": "txt-11",
        "referenceTextIds": ["txt-4", "txt-5", "txt-6", "txt-7", "txt-8"]
      },
      {
        "symbol": "†",
        "footnoteTextId": "txt-12",
        "referenceTextIds": ["txt-8"]
      },
      {
        "symbol": "4",
        "footnoteTextId": "txt-48",
        "referenceTextIds": ["txt-45"]
      }
    ]
  }
}

Footnote Reference Fields

FieldTypeDescription
symbolstringThe footnote marker symbol as detected in the document (e.g. *, , , 1, #).
footnoteTextIdstringThe bounding-box text ID (e.g. txt-11) of the footnote explanation paragraph. Cross-reference with bounding_boxes.Footer to get the footnote’s content and position.
referenceTextIdsstring[]Bounding-box text IDs of body-text paragraphs that contain a reference to this footnote. Cross-reference with bounding_boxes.Text to get each paragraph’s content and position.
Footnote reference detection uses Azure Document Intelligence for paragraph classification, supplemented by PyMuPDF native text extraction for accurate symbol identification. This handles common OCR confusion between visually similar symbols like /+ and /#. Supported marker types include numbered (1, 2, 3), symbolic (*, , , §, #), and lettered (a, b, c) footnotes.

Excel Spreadsheet Options

from pulse import Pulse
from pulse.types import ExtractRequestSpreadsheet

client = Pulse(api_key="YOUR_API_KEY")

# Extract from Excel with hidden content included
response = client.extract(
    file=open("financials.xlsx", "rb"),
    spreadsheet=ExtractRequestSpreadsheet(
        include_hidden_rows=True,
        include_hidden_cols=True,
        include_hidden_sheets=False,
    ),
)

print(response.markdown)
Workbooks exported from claims systems, ERPs, and other automated pipelines often declare a “used range” that extends hundreds of thousands of rows past where the data actually ends. Set spreadsheet.only_data_rows: true and spreadsheet.only_data_cols: true to have Pulse trim those trailing empty “phantom” rows and columns before parsing. Surviving cells keep their original A1 coordinates, so any citation or bounding box that references a specific cell remains stable. Both flags default to false. See Spreadsheet Options for the full reference.

Excel Charts and Embedded Images

When you set figure_processing.show_images: true on an Excel workbook, every embedded chart and image is collected from the workbook directly and returned under bounding_boxes.Images[]. Each entry carries a Pulse-hosted image_url you can fetch via results.getImage (or any HTTP client with your API key) to get the raw PNG/JPEG bytes.
import re
from pulse import Pulse
from pulse.types import ExtractRequestFigureProcessing

client = Pulse(api_key="YOUR_API_KEY")

# 1) Extract the workbook with show_images enabled.
response = client.extract(
    file=open("financials.xlsx", "rb"),
    figure_processing=ExtractRequestFigureProcessing(
        show_images=True,
        description=False,
    ),
)

# 2) Walk the typed Images array.
for img in response.bounding_boxes.images or []:
    print(f"{img.id}: {img.visual_type} '{img.chart_title}' @ {img.excel_range}")
    print(f"    url: {img.image_url}")

# 3) Fetch the bytes for one chart.
img = response.bounding_boxes.images[0]
m = re.search(r"/results/([^/]+)/images/([^/?#]+)", img.image_url)
job_id, filename = m.group(1), m.group(2)

chunks = list(client.results.get_image(job_id=job_id, filename=filename))
with open("chart.png", "wb") as f:
    f.write(b"".join(chunks))

Example bounding_boxes.Images Entry

{
  "id": "excel_image_1_1",
  "visual_type": "chart",
  "page_number": 1,
  "bounding_box": [],
  "image_url": "https://api.runpulse.com/results/13e3e75f-.../images/excel_image_1_1.png",
  "sheet_name": "Charts",
  "excel_range": "D2",
  "chart_type": "BarChart",
  "chart_title": "Revenue",
  "source_ranges": ["'Charts'!$A$2:$A$5", "'Charts'!$B$2:$B$5"],
  "description": "Bar chart showing revenue by quarter."
}
See Bounding Boxes — Images Array for the full field reference and Get Result Image for the auth requirement on image_url.

Disable Storage

response = client.extract(
    file_url="https://www.impact-bank.com/user/file/dummy_statement.pdf",
    storage={"enabled": False}
)

Authorizations

x-api-key
string
header
required

API key for authentication

Body

Input schema for multipart/form-data requests (file upload or file_url).

file
file
required

Document to upload directly. Required unless file_url is specified.

file_url
string<uri>

Public or pre-signed URL that Pulse will download and extract.

model
enum<string>

Extraction model to use. pulse-ultra-2 uses Pulse's vision-language model with built-in refinement, figure/chart extraction, and word-level bounding boxes. Omit or pass default for standard extraction.

Available options:
default,
pulse-ultra-2
pages
string

Page range filter (1-indexed, where page 1 is the first page). Supports segments such as 1-2 or mixed ranges like 1-2,5.

Pattern: ^[0-9]+(-[0-9]+)?(,[0-9]+(-[0-9]+)?)*$
figure_processing
object

Settings that control how figures and embedded visuals are processed. Applies to both PDFs/images (figures detected from layout) and spreadsheets (charts and embedded images read directly from the workbook). Affects the markdown output and the bounding_boxes.Images[] array.

spreadsheet
object

Settings for Excel/spreadsheet extraction. Controls handling of hidden rows, columns, and sheets, whether numeric cells are rendered using their display format or underlying raw value, and optional trimming of empty phantom rows/columns past the last data-bearing cell. Applies to .xlsx, .xlsm, and .xls files. Accepts both camelCase and snake_case field names.

extensions
object

Settings that enable additional processing passes or alternate output formats. Each enabled extension produces a corresponding output field under response.extensions.*.

storage
object

Options for persisting extraction artifacts. When enabled (default), artifacts are saved to storage and a database record is created.

async
boolean
default:false

If true, returns immediately with a job_id for polling via GET /job/{jobId}. Otherwise processes synchronously.

refine
boolean
default:false

Run a full-page OCR and formatting correction pass after extraction. Improves accuracy on dense layouts, numerical values, and table structure. Adds ~1-2s per page. Overridden by refine_options if both are provided. Requires model: pulse-ultra-2.

refine_options
object

Granular refinement targets. Takes precedence over the boolean refine flag. Requires model: pulse-ultra-2.

extract_figure
boolean
default:false

Convert charts and data visualizations into HTML <table> blocks, wrapped in <figure-table> tags inside response.markdown. Useful for financial decks, dashboards, and scientific charts. Requires model: pulse-ultra-2.

figure_description
boolean
default:false

Generate a 1-2 paragraph natural-language description of each picture, wrapped in <figure-description> tags inside response.markdown. Combines well with extract_figure. Requires model: pulse-ultra-2.

additional_prompt
string
default:""

Extra context injected into the extraction prompt. Use to steer extraction toward a specific domain or attention focus. Requires model: pulse-ultra-2.

Maximum string length: 4000
custom_image_prompt
string
default:""

Extra context appended to the prompt used by figure_description and extract_figure. Tunes image and chart interpretation for your domain. Requires model: pulse-ultra-2.

Maximum string length: 2000
custom_refine_prompt
string
default:""

Extra context appended to the refinement prompt. Only applies when refine: true or refine_options is set. Requires model: pulse-ultra-2.

Maximum string length: 2000
chunking
string
deprecated

Deprecated -- Use extensions.chunking.chunk_types instead. Comma-separated list of chunking strategies.

chunk_size
integer
deprecated

Deprecated -- Use extensions.chunking.chunk_size instead.

Required range: x >= 1
show_images
boolean
default:false
deprecated

Deprecated -- Use figure_processing.show_images instead.

return_html
boolean
default:false
deprecated

Deprecated -- Use extensions.alt_outputs.return_html instead.

Response

When async=false (default): full extraction result with markdown, bounding boxes, chunks, etc. When async=true: job submission acknowledgement with job_id.

Full extraction result returned by the synchronous /extract endpoint. Contains the extracted markdown, optional extensions output, bounding boxes, and storage metadata.

markdown
string

Primary markdown content extracted from the document. Always present in the new format.

extensions
object

Output from enabled extensions. Each key corresponds to an extension that was enabled in the request under extensions.*. Only keys for enabled extensions are present.

bounding_boxes
object

Positional bounding-box data for text, titles, headers, footers, images, and tables. Images carries chart/image visuals (with image_url when figure_processing.show_images is enabled), Tables the detected tables, and Text/Title/Footer the paragraph/title/footer regions. Additional keys (e.g. markdown_with_ids, defined_names) round-trip without being typed.

extraction_id
string<uuid>

Persisted extraction ID. Present when storage is enabled (default). Use with /split and /schema endpoints.

extraction_url
string

URL to view the extraction on the Pulse platform. Present when storage is enabled.

page_count
integer

Number of pages processed.

Required range: x >= 1
plan_info
object

Billing tier and cumulative usage information. Includes total_credits_used (primary billing metric) and pages_used (legacy compatibility).

credits_used
number<float> | null

Credits consumed by this request. Only present when the organization has the credit billing system enabled.

warnings
string[]

Non-fatal warnings generated during extraction. Includes deprecation notices when legacy input parameters are used, as well as processing warnings (e.g. word-level bounding box limitations).

html
string
deprecated

Deprecated -- Use extensions.alt_outputs.html instead. Present when the legacy return_html input was used.

chunks
object
deprecated

Deprecated -- Use extensions.chunking instead. Present when the legacy chunking input was used.

plan-info
object
deprecated

Deprecated -- Use plan_info (underscore) instead. Present when only legacy input parameters are used.

structured_output
object
deprecated

Deprecated -- Only present when the deprecated structured_output input parameter was used. Use the /schema endpoint instead.

input_schema
object
deprecated

Deprecated -- Echo of the schema that was applied.

schema_error
string
deprecated

Deprecated -- Error message if schema processing failed.