Skip to main content
The Pulse MCP server exposes eight tools, identical on the hosted endpoint and the local uvx pulse-mcp server except where called out below. This page covers how they behave, the full reference for each, and worked end-to-end examples. If you haven’t connected a client yet, start with Connecting a client.

How the tools behave

A few behaviors are shared across the tools. Understanding them up front explains most of what an agent will do.

Document inputs

Documents are referenced by URL: a public or pre-signed link the engine can fetch. The hosted server does not read files from your local disk, so a tool like extract takes a file_url, not a file upload. The local server closes that gap: there, extract also accepts a file_path. The server process reads the file from disk and uploads it out-of-band, so the document’s bytes never pass through the model or the chat context. Files up to 50 MB; supported types: pdf, png, jpg, jpeg, bmp, tiff, docx, pptx, xlsx, xlsm, csv, txt, html. On the hosted server, host the file somewhere reachable (e.g. an S3 pre-signed URL) and pass that URL. If you need direct file uploads in your own code, use the SDKs or REST API instead, which support multipart upload.

Chaining with extraction_id

You extract a document once. extract returns an extraction_id, and the downstream tools (apply_schema, split_document, extract_tables) take that id so they operate on the already-parsed document instead of re-fetching and re-parsing it.
extract("https://.../report.pdf")          →  extraction_id: ext_abc
apply_schema(extraction_id="ext_abc", ...)  →  structured JSON
split_document(extraction_id="ext_abc", ...) →  split_id: spl_xyz

Asynchronous jobs and get_job

Extraction, schema, split, and table operations run asynchronously. Each tool submits the job and inline-polls it for up to ~60 seconds:
  • If it finishes in time, the tool returns the completed result directly.
  • If it’s still running, the tool returns a stub: { "status": "processing", "job_id": "...", "poll_with": "get_job" }.
When you get a stub, call get_job with that job_id to fetch the status and result. batch_extract is always asynchronous and returns a batch_job_id to poll the same way.

Large results

To stay under MCP clients’ tool-result size limits, results larger than the inline budget (~350 KB) are offloaded to a download link and returned as a stub: { "is_url": true, "url": "https://..." }. The agent cannot fetch that link itself: by design, agents can’t follow tool-output URLs. To read an offloaded result, paste the URL back into the chat (which makes it user-provided and fetchable) or open it in your browser. For very large outputs, prefer the SDK/API, which streams results without this limit.

Page ranges

Wherever a tool accepts pages, use 1-indexed numbers and ranges, comma-separated. For example, "1-2,5" for pages 1, 2, and 5.

Tool reference

extract

Parse a document into markdown, with optional HTML, figure processing, and chunking. Provide the document as a file_url, or on the local server, a file_path.
ParameterTypeRequiredDescription
file_urlstringOne inputPublic or pre-signed URL of the PDF / document.
file_pathstringOne inputLocal server only. Path to a local file (~ works), read by the server and uploaded out-of-band. See document inputs.
pagesstringNoPage range, 1-indexed, e.g. "1-2,5".
modelstringNoModel override: "default" or "pulse-ultra-2".
return_htmlbooleanNoAlso return an HTML representation of the document.
footnote_referencesbooleanNoLink footnote markers to their footnote text.
figure_descriptionsbooleanNoGenerate descriptive captions for figures and visuals.
show_imagesbooleanNoReturn image URLs for extracted visuals.
chunk_typesstring[]NoChunking strategies: any of "semantic", "header", "page", "recursive".
chunk_sizeintegerNoMax characters per chunk.
Provide exactly one of file_url or file_path.
Returns the completed extraction (including extraction_id and markdown), or a { status, job_id } stub to poll with get_job if it’s still running.
For structured JSON, run extract first, then call apply_schema on the returned extraction_id. Schema extraction directly on extract is deprecated.

apply_schema

Apply a JSON schema to a prior extraction (or split) to get structured output with citations. Provide exactly one target:
  • extraction_id: a single extraction, or
  • extraction_ids: combine several extractions, or
  • split_id: per-topic schemas (pass split_schema_config).
ParameterTypeRequiredDescription
extraction_idstringOne targetApply the schema to a single extraction.
extraction_idsstring[]One targetCombine several extractions under one schema.
split_idstringOne targetApply per-topic schemas to a split (with split_schema_config).
schemaobjectNoThe JSON schema to extract against.
schema_promptstringNoNatural-language guidance to accompany the schema.
schema_config_idstringNoA saved schema config (alternative to schema).
split_schema_configobjectNoPer-topic schema mapping when using split_id.
effortbooleanNoSpend extra effort for harder extractions.
pagesstringNoLimit to a page range, 1-indexed.
Returns { schema_id, schema_output: { values, citations } } for single/multi extraction, or { schema_id, results: { <topic>: { values, citations } } } for split mode, or a { status, job_id } stub to poll with get_job.

generate_schema

AI-generate or refine a JSON extraction schema. Useful when you want the agent to design a schema before calling apply_schema.
ParameterTypeRequiredDescription
promptstringNo*Natural-language description of the fields you want.
current_schemaobjectNo*An existing schema to refine.
*Provide a prompt, a current_schema, or both.
Returns { schema: <JSON schema object> }, ready to pass to apply_schema.

split_document

Split a prior extraction into topic-based page ranges. Run extract first; there is no file input here.
ParameterTypeRequiredDescription
extraction_idstringYesID of a completed extraction to split.
topicsobject[]No*Topic objects, each { "name": string, "description": string }.
split_config_idstringNo*A saved split config (alternative to topics).
*Provide either topics or a split_config_id.
Returns { split_id, split_output: { splits: { <topic>: [page numbers] } } }, or a { status, job_id } stub to poll with get_job. Feed the split_id into apply_schema for per-topic structured extraction.

extract_tables

Pull tables out of a completed extraction.
ParameterTypeRequiredDescription
extraction_idstringYesID of a completed extraction.
mergebooleanNoMerge tables that span multiple pages.
table_formatstringNoOutput format: "html" (default) or "markdown".
charts_to_tablesbooleanNoAlso convert detected charts into tables.
Returns { tables_id, tables_output: { tables: [...] } }, or a { status, job_id } stub to poll with get_job.

batch_extract

Extract many documents in one asynchronous batch.
ParameterTypeRequiredDescription
outputobjectYesDestination for results: { "s3_prefix": "..." } or { "local_path": "..." } (see output destinations).
file_urlsstring[]No*List of document URLs to process.
input_s3_prefixstringNo*An S3 prefix to enumerate for input files.
extract_optionsobjectNoPer-file extraction options (same shape as extract).
*Provide file_urls, input_s3_prefix, or both.
Returns { batch_job_id, status, total_files }. Poll progress with get_job using the batch_job_id.

run_pipeline

Run a saved multi-step Pulse pipeline on a document. Pipelines are authored in the Pulse Platform; obtain the pipeline_id there.
ParameterTypeRequiredDescription
pipeline_idstringYesID of a saved pipeline (created in the dashboard).
file_urlstringYesPublic / pre-signed URL of the document.
overridesobjectNoPer-step overrides for this run.
run_asyncbooleanNoRun asynchronously and return a job_id to poll.
Returns { execution_id, status, results } when synchronous (default), or { job_id, status } when run_async is true; poll with get_job.

get_job

Poll any previously submitted asynchronous job for its status and result.
ParameterTypeRequiredDescription
job_idstringYesThe job_id (or batch_job_id) returned by another tool.
Returns the job’s status and, when complete, its result, or a { is_url, url } stub for large results.

Usage patterns

These show the sequence of tool calls an agent makes. In practice you express the goal in natural language and the agent chooses the tools, but seeing the chain makes the behavior predictable.

Parse a document

“Extract the text from https://example.com/report.pdf
extract(file_url="https://example.com/report.pdf")
  → { extraction_id: "ext_abc", markdown: "# Report ..." }

Parse a local file (local server)

“Extract the text from ~/Contracts/msa.pdf”
extract(file_path="~/Contracts/msa.pdf")
  → { extraction_id: "ext_abc", markdown: "# Master Service Agreement ..." }
The server reads the file and uploads it out-of-band; the document’s bytes never enter the chat context. On the hosted server, pass a file_url instead.

Extract structured data with a schema

“Pull the invoice number, vendor, and total from this invoice: <url>
extract(file_url="https://example.com/invoice.pdf")
  → extraction_id: "ext_abc"

apply_schema(
  extraction_id="ext_abc",
  schema={
    "type": "object",
    "properties": {
      "invoice_number": { "type": "string" },
      "vendor_name":    { "type": "string" },
      "total":          { "type": "number" }
    }
  }
)
  → { schema_id: "...", schema_output: { values: {...}, citations: {...} } }
Let the agent design the schema first with generate_schema if you describe the fields in prose rather than handing it a schema.

Split a multi-section document, then extract per topic

“This filing has a balance sheet and an income statement. Pull the line items from each”
extract(file_url="https://example.com/10k.pdf")
  → extraction_id: "ext_abc"

split_document(
  extraction_id="ext_abc",
  topics=[
    { "name": "balance_sheet",    "description": "The consolidated balance sheet" },
    { "name": "income_statement", "description": "The consolidated income statement" }
  ]
)
  → split_id: "spl_xyz", splits: { balance_sheet: [12,13], income_statement: [14,15] }

apply_schema(split_id="spl_xyz", split_schema_config={ ... per-topic schemas ... })
  → { results: { balance_sheet: {...}, income_statement: {...} } }

Extract tables

“Get every table from this report as markdown”
extract(file_url="https://example.com/report.pdf")  → extraction_id: "ext_abc"
extract_tables(extraction_id="ext_abc", table_format="markdown", merge=true)
  → { tables_id: "...", tables_output: { tables: [...] } }

Process many documents

“Extract all the PDFs under this S3 prefix and write results back to my bucket”
batch_extract(
  input_s3_prefix="s3://my-bucket/incoming/",
  output={ "s3_prefix": "s3://my-bucket/results/" }
)
  → { batch_job_id: "...", status: "processing", total_files: 128 }

get_job(job_id="<batch_job_id>")  → progress, then results

Handle a long-running job

When a tool returns a processing stub, poll it:
extract(file_url="https://example.com/huge.pdf")
  → { status: "processing", job_id: "job_123", poll_with: "get_job" }

get_job(job_id="job_123")
  → { status: "completed", result: {...} }   (or another stub, poll again)

Next steps

Connect a client

Configuration for Codex, Claude Desktop, Claude Code, and VS Code.

API Reference

The same operations as REST endpoints and SDK methods.