Skip to main content
The Extract API allows you to extract specific structured data from documents using natural language instructions or schemas.

Basic Usage

from reducto import Reducto

client = Reducto()

response = client.extract.run(
    input="https://example.com/invoice.pdf",
    instructions={
        "schema": {
            "invoice_number": "string",
            "date": "string",
            "total_amount": "number",
            "line_items": "array"
        }
    }
)
print(response)

Method Signature

client.extract.run(
    input: str,
    instructions: dict | None = None,
    parsing: ParseOptions | None = None,
    settings: dict | None = None,
    async_: ConfigV3AsyncConfig | None = None
) -> ExtractRunResponse

Parameters

input
string
required
The URL of the document to extract from. You can provide:
  • A publicly available URL
  • A presigned S3 URL
  • A reducto:// prefixed URL from the /upload endpoint
  • A jobid:// prefixed URL from a previous parse invocation
  • A list of URLs (for multi-document pipelines, V3 API only)
instructions
object
Instructions for data extraction. Can be either:
  • A schema object defining the structure to extract
  • Natural language instructions describing what to extract
parsing
ParseOptions
Configuration options for parsing the document. If you’re passing in a jobid:// URL, this will be ignored.
settings
object
Settings to control the extraction process.
async_
ConfigV3AsyncConfig
Configuration for asynchronous processing. When provided, returns immediately with a job ID.

Schema-Based Extraction

Define a schema to extract specific fields:
from reducto import Reducto

client = Reducto()

# Define extraction schema
schema = {
    "company_name": "string",
    "revenue": "number",
    "employees": "number",
    "founded_date": "string",
    "headquarters": {
        "city": "string",
        "country": "string"
    }
}

response = client.extract.run(
    input="https://example.com/company-report.pdf",
    instructions={"schema": schema}
)

# Access extracted data
print(response.data)

Natural Language Instructions

Use natural language to describe what to extract:
from reducto import Reducto

client = Reducto()

response = client.extract.run(
    input="https://example.com/contract.pdf",
    instructions={
        "prompt": "Extract all parties involved in the contract, the contract start date, end date, and key obligations for each party."
    }
)

print(response.data)

Extract with Custom Parsing

Combine extraction with custom parsing options:
from reducto import Reducto

client = Reducto()

response = client.extract.run(
    input="https://example.com/document.pdf",
    instructions={
        "schema": {
            "section_titles": "array",
            "key_figures": "array"
        }
    },
    parsing={
        "enhance": {
            "summarize_figures": True
        },
        "formatting": {
            "add_page_markers": True
        }
    }
)

Async Job Processing

For large documents or batch processing, use async jobs:
from reducto import Reducto

client = Reducto()

# Start an async extraction job
job = client.extract.run_job(
    input="https://example.com/large-document.pdf",
    instructions={
        "schema": {
            "field1": "string",
            "field2": "number"
        }
    },
    async_={
        "webhook": {"url": "https://example.com/webhook"}
    }
)

print(f"Job ID: {job.job_id}")

# Poll for results
result = client.job.get(job.job_id)

Reusing Parsed Documents

Extract from a document that was previously parsed:
from reducto import Reducto

client = Reducto()

# First parse the document
parse_response = client.parse.run(
    input="https://example.com/document.pdf"
)

# Then extract using the job ID (no re-parsing needed)
extract_response = client.extract.run(
    input=f"jobid://{parse_response.job_id}",
    instructions={
        "schema": {"key_data": "string"}
    }
)

Complex Schema Example

from reducto import Reducto

client = Reducto()

# Extract structured data from financial statements
schema = {
    "company_info": {
        "name": "string",
        "ticker": "string",
        "fiscal_year": "string"
    },
    "financial_metrics": {
        "revenue": "number",
        "net_income": "number",
        "eps": "number",
        "operating_expenses": "number"
    },
    "balance_sheet": {
        "total_assets": "number",
        "total_liabilities": "number",
        "shareholders_equity": "number"
    },
    "key_risks": "array"
}

response = client.extract.run(
    input="https://example.com/10k-filing.pdf",
    instructions={"schema": schema}
)

print(response.data)