Tools & workflows

The Pulse MCP server exposes eight tools, identical on the hosted endpoint and the local uvx pulse-mcp server except where called out below. This page covers how they behave, the full reference for each, and worked end-to-end examples. If you haven’t connected a client yet, start with Connecting a client.

How the tools behave

A few behaviors are shared across the tools. Understanding them up front explains most of what an agent will do.

Document inputs

Documents are referenced by URL: a public or pre-signed link the engine can fetch. The hosted server does not read files from your local disk, so a tool like extract takes a file_url, not a file upload. The local server closes that gap: there, extract also accepts a file_path. The server process reads the file from disk and uploads it out-of-band, so the document’s bytes never pass through the model or the chat context. Files up to 50 MB; supported types: pdf, png, jpg, jpeg, bmp, tiff, docx, pptx, xlsx, xlsm, csv, txt, html. On the hosted server, host the file somewhere reachable (e.g. an S3 pre-signed URL) and pass that URL. If you need direct file uploads in your own code, use the SDKs or REST API instead, which support multipart upload.

Chaining with `extraction_id`

You extract a document once. extract returns an extraction_id, and the downstream tools (apply_schema, split_document, extract_tables) take that id so they operate on the already-parsed document instead of re-fetching and re-parsing it.

extract("https://platform.runpulse.com/api/examples/637e5678-30b1-45fa-acc4-877f2d636419/pdf")          →  extraction_id: ext_abc
apply_schema(extraction_id="ext_abc", ...)  →  structured JSON
split_document(extraction_id="ext_abc", ...) →  split_id: spl_xyz

Asynchronous jobs and `get_job`

Extraction, schema, split, and table operations run asynchronously. Each tool submits the job and inline-polls it for up to ~60 seconds:

If it finishes in time, the tool returns the completed result directly.
If it’s still running, the tool returns a stub: { "status": "processing", "job_id": "...", "poll_with": "get_job" }.

When you get a stub, call get_job with that job_id to fetch the status and result. batch_extract is always asynchronous and returns a batch_job_id to poll the same way.

Large results

To stay under MCP clients’ tool-result size limits, results larger than the inline budget (~350 KB) are offloaded to a download link and returned as a stub: { "is_url": true, "url": "https://..." }. The agent cannot fetch that link itself: by design, agents can’t follow tool-output URLs. To read an offloaded result, paste the URL back into the chat (which makes it user-provided and fetchable) or open it in your browser. For very large outputs, prefer the SDK/API, which streams results without this limit.

Page ranges

Wherever a tool accepts pages, use 1-indexed numbers and ranges, comma-separated. For example, "1-2,5" for pages 1, 2, and 5.

Tool reference

`extract`

Parse a document into markdown, with optional HTML, figure processing, and chunking. Provide the document as a file_url, or on the local server, a file_path.

Parameter	Type	Required	Description
`file_url`	string	One input	Public or pre-signed URL of the PDF / document.
`file_path`	string	One input	Local server only. Path to a local file (`~` works), read by the server and uploaded out-of-band. See document inputs.
`pages`	string	No	Page range, 1-indexed, e.g. `"1-2,5"`.
`model`	string	No	Model override: `"default"` or `"pulse-ultra-2"`.
`return_html`	boolean	No	Also return an HTML representation of the document.
`footnote_references`	boolean	No	Link footnote markers to their footnote text.
`figure_descriptions`	boolean	No	Generate descriptive captions for figures and visuals.
`show_images`	boolean	No	Return image URLs for extracted visuals.
`only_data_rows`	boolean	No	Excel only: trim trailing empty rows past the last data-bearing cell.
`only_data_cols`	boolean	No	Excel only: trim trailing empty columns past the last data-bearing cell.
`chunk_types`	string[]	No	Chunking strategies: any of `"semantic"`, `"header"`, `"page"`, `"recursive"`.
`chunk_size`	integer	No	Max characters per chunk.

Provide exactly one of file_url or file_path.

Returns the completed extraction (including extraction_id and markdown), or a { status, job_id } stub to poll with get_job if it’s still running.

For structured JSON, run extract first, then call apply_schema on the returned extraction_id. Schema extraction directly on extract is deprecated.

`apply_schema`

Apply a JSON schema to a prior extraction (or split) to get structured output with citations. Provide exactly one target:

extraction_id: a single extraction, or
extraction_ids: combine several extractions, or
split_id: per-topic schemas (pass split_schema_config).

Parameter	Type	Required	Description
`extraction_id`	string	One target	Apply the schema to a single extraction.
`extraction_ids`	string[]	One target	Combine several extractions under one schema.
`split_id`	string	One target	Apply per-topic schemas to a split (with `split_schema_config`).
`schema`	object	No	The JSON schema to extract against.
`schema_prompt`	string	No	Natural-language guidance to accompany the schema.
`schema_config_id`	string	No	A saved schema config (alternative to `schema`).
`split_schema_config`	object	No	Per-topic schema mapping when using `split_id`.
`effort`	boolean	No	Spend extra effort for harder extractions.
`pages`	string	No	Limit to a page range, 1-indexed.

Returns { schema_id, schema_output: { values, citations } } for single/multi extraction, or { schema_id, results: { <topic>: { values, citations } } } for split mode, or a { status, job_id } stub to poll with get_job.

`generate_schema`

AI-generate or refine a JSON extraction schema. Useful when you want the agent to design a schema before calling apply_schema.

Parameter	Type	Required	Description
`prompt`	string	No*	Natural-language description of the fields you want.
`current_schema`	object	No*	An existing schema to refine.

*Provide a prompt, a current_schema, or both.

Returns { schema: <JSON schema object> }, ready to pass to apply_schema.

`split_document`

Split a prior extraction into topic-based page ranges. Run extract first; there is no file input here.

Parameter	Type	Required	Description
`extraction_id`	string	Yes	ID of a completed extraction to split.
`topics`	object[]	No*	Topic objects, each `{ "name": string, "description": string }`.
`split_config_id`	string	No*	A saved split config (alternative to `topics`).

*Provide either topics or a split_config_id.

Returns { split_id, split_output: { splits: { <topic>: [page numbers] } } }, or a { status, job_id } stub to poll with get_job. Feed the split_id into apply_schema for per-topic structured extraction.

`extract_tables`

Pull tables out of a completed extraction.

Parameter	Type	Required	Description
`extraction_id`	string	Yes	ID of a completed extraction.
`merge`	boolean	No	Merge tables that span multiple pages.
`table_format`	string	No	Output format: `"html"` (default) or `"markdown"`.
`charts_to_tables`	boolean	No	Also convert detected charts into tables.

Returns { tables_id, tables_output: { tables: [...] } }, or a { status, job_id } stub to poll with get_job.

`batch_extract`

Extract many documents in one asynchronous batch.

Parameter	Type	Required	Description
`output`	object	Yes	Destination for results: `{ "s3_prefix": "..." }` or `{ "local_path": "..." }` (see output destinations).
`file_urls`	string[]	No*	List of document URLs to process.
`input_s3_prefix`	string	No*	An S3 prefix to enumerate for input files.
`extract_options`	object	No	Per-file extraction options (same shape as `extract`).

*Provide file_urls, input_s3_prefix, or both.

Returns { batch_job_id, status, total_files }. Poll progress with get_job using the batch_job_id.

`run_pipeline`

Run a saved multi-step Pulse pipeline on a document. Pipelines are authored in the Pulse Platform; obtain the pipeline_id there.

Parameter	Type	Required	Description
`pipeline_id`	string	Yes	ID of a saved pipeline (created in the dashboard).
`file_url`	string	Yes	Public / pre-signed URL of the document.
`overrides`	object	No	Per-step overrides for this run.
`run_async`	boolean	No	Run asynchronously and return a `job_id` to poll.

Returns { execution_id, status, results } when synchronous (default), or { job_id, status } when run_async is true; poll with get_job.

`get_job`

Poll any previously submitted asynchronous job for its status and result.

Parameter	Type	Required	Description
`job_id`	string	Yes	The `job_id` (or `batch_job_id`) returned by another tool.

Returns the job’s status and, when complete, its result, or a { is_url, url } stub for large results.

Usage patterns

These show the sequence of tool calls an agent makes. In practice you express the goal in natural language and the agent chooses the tools, but seeing the chain makes the behavior predictable.

Parse a document

“Extract the text from the Bank Statement sample document”

extract(file_url="https://platform.runpulse.com/api/examples/637e5678-30b1-45fa-acc4-877f2d636419/pdf")
  → { extraction_id: "ext_abc", markdown: "# Report ..." }

Parse a local file (local server)

“Extract the text from ~/Contracts/msa.pdf”

extract(file_path="~/Contracts/msa.pdf")
  → { extraction_id: "ext_abc", markdown: "# Master Service Agreement ..." }

The server reads the file and uploads it out-of-band; the document’s bytes never enter the chat context. On the hosted server, pass a file_url instead.

Extract structured data with a schema

“Pull the invoice number, vendor, and total from this invoice: <url>”

extract(file_url="https://example.com/invoice.pdf")
  → extraction_id: "ext_abc"

apply_schema(
  extraction_id="ext_abc",
  schema={
    "type": "object",
    "properties": {
      "invoice_number": { "type": "string" },
      "vendor_name":    { "type": "string" },
      "total":          { "type": "number" }
    }
  }
)
  → { schema_id: "...", schema_output: { values: {...}, citations: {...} } }

Let the agent design the schema first with generate_schema if you describe the fields in prose rather than handing it a schema.

Split a multi-section document, then extract per topic

“This filing has a balance sheet and an income statement. Pull the line items from each”

extract(file_url="https://example.com/10k.pdf")
  → extraction_id: "ext_abc"

split_document(
  extraction_id="ext_abc",
  topics=[
    { "name": "balance_sheet",    "description": "The consolidated balance sheet" },
    { "name": "income_statement", "description": "The consolidated income statement" }
  ]
)
  → split_id: "spl_xyz", splits: { balance_sheet: [12,13], income_statement: [14,15] }

apply_schema(split_id="spl_xyz", split_schema_config={ ... per-topic schemas ... })
  → { results: { balance_sheet: {...}, income_statement: {...} } }

Extract tables

“Get every table from this report as markdown”

extract(file_url="https://platform.runpulse.com/api/examples/637e5678-30b1-45fa-acc4-877f2d636419/pdf")  → extraction_id: "ext_abc"
extract_tables(extraction_id="ext_abc", table_format="markdown", merge=true)
  → { tables_id: "...", tables_output: { tables: [...] } }

Process many documents

“Extract all the PDFs under this S3 prefix and write results back to my bucket”

batch_extract(
  input_s3_prefix="s3://my-bucket/incoming/",
  output={ "s3_prefix": "s3://my-bucket/results/" }
)
  → { batch_job_id: "...", status: "processing", total_files: 128 }

get_job(job_id="<batch_job_id>")  → progress, then results

Handle a long-running job

When a tool returns a processing stub, poll it:

extract(file_url="https://example.com/huge.pdf")
  → { status: "processing", job_id: "job_123", poll_with: "get_job" }

get_job(job_id="job_123")
  → { status: "completed", result: {...} }   (or another stub, poll again)

Start Here

Agent Tools

Core Concepts

Security & Compliance

Help

How the tools behave

Document inputs

Chaining with `extraction_id`

Asynchronous jobs and `get_job`

Large results

Page ranges

Tool reference

`extract`

`apply_schema`

`generate_schema`

`split_document`

`extract_tables`

`batch_extract`

`run_pipeline`

`get_job`

Usage patterns

Parse a document

Parse a local file (local server)

Extract structured data with a schema

Split a multi-section document, then extract per topic

Extract tables

Process many documents

Handle a long-running job

Next steps

Connect a client

API Reference

​How the tools behave

​Document inputs

​Chaining with extraction_id

​Asynchronous jobs and get_job

​Large results

​Page ranges

​Tool reference

​extract

​apply_schema

​generate_schema

​split_document

​extract_tables

​batch_extract

​run_pipeline

​get_job

​Usage patterns

​Parse a document

​Parse a local file (local server)

​Extract structured data with a schema

​Split a multi-section document, then extract per topic

​Extract tables

​Process many documents

​Handle a long-running job

​Next steps

Connect a client

API Reference

How the tools behave

Document inputs

Chaining with `extraction_id`

Asynchronous jobs and `get_job`

Large results

Page ranges

Tool reference

`extract`

`apply_schema`

`generate_schema`

`split_document`

`extract_tables`

`batch_extract`

`run_pipeline`

`get_job`

Usage patterns

Parse a document

Parse a local file (local server)

Extract structured data with a schema

Split a multi-section document, then extract per topic

Extract tables

Process many documents

Handle a long-running job

Next steps