Structured Data Extraction - Reducto Python SDK

The Extract API allows you to extract specific structured data from documents using natural language instructions or schemas.

Basic Usage

Sync
Async

from reducto import Reducto

client = Reducto()

response = client.extract.run(
    input="https://example.com/invoice.pdf",
    instructions={
        "schema": {
            "invoice_number": "string",
            "date": "string",
            "total_amount": "number",
            "line_items": "array"
        }
    }
)
print(response)

import asyncio
from reducto import AsyncReducto

client = AsyncReducto()

async def main():
    response = await client.extract.run(
        input="https://example.com/invoice.pdf",
        instructions={
            "schema": {
                "invoice_number": "string",
                "date": "string",
                "total_amount": "number",
                "line_items": "array"
            }
        }
    )
    print(response)

asyncio.run(main())

Method Signature

client.extract.run(
    input: str,
    instructions: dict | None = None,
    parsing: ParseOptions | None = None,
    settings: dict | None = None,
    async_: ConfigV3AsyncConfig | None = None
) -> ExtractRunResponse

Parameters

input

string

required

The URL of the document to extract from. You can provide:

A publicly available URL
A presigned S3 URL
A reducto:// prefixed URL from the /upload endpoint
A jobid:// prefixed URL from a previous parse invocation
A list of URLs (for multi-document pipelines, V3 API only)

instructions

object

Instructions for data extraction. Can be either:

A schema object defining the structure to extract
Natural language instructions describing what to extract

parsing

ParseOptions

Configuration options for parsing the document. If you’re passing in a jobid:// URL, this will be ignored.

settings

object

Settings to control the extraction process.

async_

ConfigV3AsyncConfig

Configuration for asynchronous processing. When provided, returns immediately with a job ID.

Schema-Based Extraction

Define a schema to extract specific fields:

Sync
Async

from reducto import Reducto

client = Reducto()

# Define extraction schema
schema = {
    "company_name": "string",
    "revenue": "number",
    "employees": "number",
    "founded_date": "string",
    "headquarters": {
        "city": "string",
        "country": "string"
    }
}

response = client.extract.run(
    input="https://example.com/company-report.pdf",
    instructions={"schema": schema}
)

# Access extracted data
print(response.data)

import asyncio
from reducto import AsyncReducto

client = AsyncReducto()

async def main():
    # Define extraction schema
    schema = {
        "company_name": "string",
        "revenue": "number",
        "employees": "number",
        "founded_date": "string",
        "headquarters": {
            "city": "string",
            "country": "string"
        }
    }

    response = await client.extract.run(
        input="https://example.com/company-report.pdf",
        instructions={"schema": schema}
    )

    # Access extracted data
    print(response.data)

asyncio.run(main())

Natural Language Instructions

Use natural language to describe what to extract:

from reducto import Reducto

client = Reducto()

response = client.extract.run(
    input="https://example.com/contract.pdf",
    instructions={
        "prompt": "Extract all parties involved in the contract, the contract start date, end date, and key obligations for each party."
    }
)

print(response.data)

Extract with Custom Parsing

Combine extraction with custom parsing options:

from reducto import Reducto

client = Reducto()

response = client.extract.run(
    input="https://example.com/document.pdf",
    instructions={
        "schema": {
            "section_titles": "array",
            "key_figures": "array"
        }
    },
    parsing={
        "enhance": {
            "summarize_figures": True
        },
        "formatting": {
            "add_page_markers": True
        }
    }
)

Async Job Processing

For large documents or batch processing, use async jobs:

from reducto import Reducto

client = Reducto()

# Start an async extraction job
job = client.extract.run_job(
    input="https://example.com/large-document.pdf",
    instructions={
        "schema": {
            "field1": "string",
            "field2": "number"
        }
    },
    async_={
        "webhook": {"url": "https://example.com/webhook"}
    }
)

print(f"Job ID: {job.job_id}")

# Poll for results
result = client.job.get(job.job_id)

Reusing Parsed Documents

Extract from a document that was previously parsed:

from reducto import Reducto

client = Reducto()

# First parse the document
parse_response = client.parse.run(
    input="https://example.com/document.pdf"
)

# Then extract using the job ID (no re-parsing needed)
extract_response = client.extract.run(
    input=f"jobid://{parse_response.job_id}",
    instructions={
        "schema": {"key_data": "string"}
    }
)

Complex Schema Example

from reducto import Reducto

client = Reducto()

# Extract structured data from financial statements
schema = {
    "company_info": {
        "name": "string",
        "ticker": "string",
        "fiscal_year": "string"
    },
    "financial_metrics": {
        "revenue": "number",
        "net_income": "number",
        "eps": "number",
        "operating_expenses": "number"
    },
    "balance_sheet": {
        "total_assets": "number",
        "total_liabilities": "number",
        "shareholders_equity": "number"
    },
    "key_risks": "array"
}

response = client.extract.run(
    input="https://example.com/10k-filing.pdf",
    instructions={"schema": schema}
)

print(response.data)

​Basic Usage

​Method Signature

​Parameters

​Schema-Based Extraction

​Natural Language Instructions

​Extract with Custom Parsing

​Async Job Processing

​Reusing Parsed Documents

​Complex Schema Example