DocExtract API Documentation

Extract structured data from invoices, receipts, and bank statements with our powerful document processing API.

JS

Node.js

npm i docextract

Py

Python

pip install docextract

Go

go get docextract

Rb

Ruby

gem install docextract

Authentication

All API requests require authentication using an API key. Include your API key in the X-API-Key header with every request.

HTTP Header

X-API-Key: your_api_key_here

Never expose your API key in client-side code. Always make API requests from your backend server.

You can generate API keys from your Dashboard. Each key can have specific scopes and IP restrictions.

Quickstart

Extract data from a document in just a few lines of code:

JavaScript

const DocExtract = require('docextract');

const client = new DocExtract('your_api_key');

const result = await client.extract({
  file: './invoice.pdf',
  type: 'auto'
});

console.log(result.extractedData);
// { vendor: 'Acme Corp', total: 1250.00, date: '2026-01-10', ... }

Python

from docextract import DocExtract

client = DocExtract("your_api_key")

result = client.extract(
    file="./invoice.pdf",
    type="auto"
)

print(result.extracted_data)
# {'vendor': 'Acme Corp', 'total': 1250.00, 'date': '2026-01-10', ...}

cURL

curl -X POST https://api.docextract.io/v1/extract/auto \
  -H "X-API-Key: your_api_key" \
  -F "[email protected]"

Auto Extract

POST /v1/extract/auto

Automatically detect document type and extract structured data.

Request Parameters

Parameter	Type	Description
filerequired	file	The document file (PDF, PNG, JPG, TIFF)
include_raw_textoptional	boolean	Include raw OCR text in response (default: false)
webhook_urloptional	string	URL to receive async processing results

Response

200 OK

{
  "success": true,
  "documentType": "invoice",
  "detectedType": "invoice",
  "confidence": {
    "overall": 0.94,
    "fields": {
      "vendor": 0.98,
      "total": 0.95
    }
  },
  "extractedData": {
    "vendor": "Acme Corporation",
    "invoiceNumber": "INV-2026-001",
    "date": "2026-01-10",
    "dueDate": "2026-02-10",
    "subtotal": 1150.00,
    "tax": 100.00,
    "total": 1250.00,
    "currency": "USD",
    "lineItems": [...]
  },
  "processingTimeMs": 1234
}

Extract Invoice

POST /v1/extract/invoice

Extract structured data specifically from invoices.

Extracted Fields

Field	Type	Description
vendor	string	Vendor/supplier name
invoiceNumber	string	Invoice reference number
date	string	Invoice date (ISO 8601)
dueDate	string	Payment due date
subtotal	number	Subtotal before tax
tax	number	Tax amount
total	number	Total amount
currency	string	Currency code (USD, EUR, etc.)
lineItems	array	List of line items

Extract Receipt

POST /v1/extract/receipt

Extract structured data from receipts and purchase records.

Extracted Fields

Field	Type	Description
merchant	string	Store/merchant name
date	string	Transaction date
time	string	Transaction time
items	array	Purchased items
subtotal	number	Subtotal amount
tax	number	Tax amount
total	number	Total amount
paymentMethod	string	Payment method used

Extract Bank Statement

POST /v1/extract/bank-statement

Extract transaction data from bank statements.

Extracted Fields

Field	Type	Description
bankName	string	Financial institution name
accountNumber	string	Account number (masked)
statementPeriod	object	Start and end dates
openingBalance	number	Starting balance
closingBalance	number	Ending balance
transactions	array	List of transactions

Detect Document Type

POST /v1/detect

Detect document type without extracting data.

200 OK

{
  "type": "invoice",
  "confidence": 0.96,
  "signals": ["invoice_number", "due_date", "line_items"]
}

Node.js SDK

Full-featured SDK with TypeScript support and async handling.

Installation

npm install docextract

JavaScript

const { DocExtract } = require('docextract');
const fs = require('fs');

// Initialize client
const client = new DocExtract({
  apiKey: process.env.DOCEXTRACT_API_KEY
});

// Extract from file path
const result = await client.extract({
  file: './documents/invoice.pdf',
  type: 'auto'
});

// Extract from buffer
const buffer = fs.readFileSync('./invoice.pdf');
const result2 = await client.extract({
  file: buffer,
  filename: 'invoice.pdf',
  type: 'invoice'
});

// Extract from URL
const result3 = await client.extractFromUrl({
  url: 'https://example.com/invoice.pdf',
  type: 'auto'
});

// Batch processing
const results = await client.extractBatch({
  files: ['./inv1.pdf', './inv2.pdf', './inv3.pdf'],
  type: 'invoice',
  concurrency: 3
});

console.log(result.extractedData);

TypeScript Support

TypeScript

import { DocExtract, InvoiceData, ExtractionResult } from 'docextract';

const client = new DocExtract({ apiKey: process.env.DOCEXTRACT_API_KEY! });

const result: ExtractionResult<InvoiceData> = await client.extract({
  file: './invoice.pdf',
  type: 'invoice'
});

// Fully typed extracted data
console.log(result.extractedData.vendor);
console.log(result.extractedData.total);
console.log(result.extractedData.lineItems);

Python SDK

Pythonic SDK with sync and async support.

Installation

pip install docextract

Python

import os
from docextract import DocExtract

# Initialize client
client = DocExtract(api_key=os.environ["DOCEXTRACT_API_KEY"])

# Extract from file path
result = client.extract(
    file="./documents/invoice.pdf",
    type="auto"
)

# Extract from bytes
with open("invoice.pdf", "rb") as f:
    result = client.extract(
        file=f.read(),
        filename="invoice.pdf",
        type="invoice"
    )

# Extract from URL
result = client.extract_from_url(
    url="https://example.com/invoice.pdf",
    type="auto"
)

print(result.extracted_data)

Async Support

Python (Async)

import asyncio
from docextract import AsyncDocExtract

async def process_documents():
    client = AsyncDocExtract(api_key=os.environ["DOCEXTRACT_API_KEY"])

    # Process multiple documents concurrently
    tasks = [
        client.extract(file="inv1.pdf"),
        client.extract(file="inv2.pdf"),
        client.extract(file="inv3.pdf")
    ]

    results = await asyncio.gather(*tasks)

    for result in results:
        print(result.extracted_data)

asyncio.run(process_documents())

cURL Examples

Auto Extract from File

curl -X POST https://api.docextract.io/v1/extract/auto \
  -H "X-API-Key: your_api_key" \
  -F "[email protected]"

Extract with Options

curl -X POST https://api.docextract.io/v1/extract/invoice \
  -H "X-API-Key: your_api_key" \
  -F "[email protected]" \
  -F "include_raw_text=true"

Extract from URL

curl -X POST https://api.docextract.io/v1/extract/auto \
  -H "X-API-Key: your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/invoice.pdf"}'

Detect Document Type

curl -X POST https://api.docextract.io/v1/detect \
  -H "X-API-Key: your_api_key" \
  -F "[email protected]"

Error Handling

The API uses standard HTTP status codes and returns detailed error information.

Code	Error	Description
400	VALIDATION_ERROR	Invalid request parameters
401	INVALID_API_KEY	Invalid or missing API key
403	FORBIDDEN	Insufficient permissions
404	NOT_FOUND	Resource doesn't exist
413	FILE_TOO_LARGE	File exceeds 10MB limit
415	UNSUPPORTED_FORMAT	File format not supported
422	EXTRACTION_FAILED	Document couldn't be processed
422	UNREADABLE_PDF	PDF text extraction failed
422	LOW_QUALITY_IMAGE	Image quality too low for OCR
429	RATE_LIMITED	Too many requests
500	SERVER_ERROR	Internal server error

Error Response

{
  "success": false,
  "error": {
    "code": "UNREADABLE_PDF",
    "message": "Could not extract text from PDF",
    "details": {
      "suggestion": "Upload a higher quality image or native PDF"
    }
  }
}

SDK Error Handling

JavaScript

import { DocExtract, DocExtractError } from 'docextract';

try {
  const result = await client.extract({ file: './doc.pdf' });
} catch (error) {
  if (error instanceof DocExtractError) {
    console.log(error.code);       // 'UNREADABLE_PDF'
    console.log(error.message);    // 'Could not extract text'
    console.log(error.statusCode); // 422
    console.log(error.suggestion); // 'Upload a higher quality...'
  }
}

Python

from docextract import DocExtract, DocExtractError

try:
    result = client.extract(file="doc.pdf")
except DocExtractError as e:
    print(e.code)        # 'UNREADABLE_PDF'
    print(e.message)     # 'Could not extract text'
    print(e.status_code) # 422
    print(e.suggestion)  # 'Upload a higher quality...'

Rate Limits

Rate limits vary by plan:

Plan	Requests/min	Requests/day	File Size
Free	10	100	5 MB
Starter	60	5,000	10 MB
Pro	300	50,000	25 MB
Enterprise	Custom	Custom	Custom

Rate limit headers are included in all responses:
X-RateLimit-Limit - Maximum requests allowed
X-RateLimit-Remaining - Requests remaining
X-RateLimit-Reset - Unix timestamp when limit resets

Webhooks

Receive async notifications when documents are processed.

Configuring Webhooks

Pass a webhook_url parameter with your extraction request:

Request with Webhook

curl -X POST https://api.docextract.io/v1/extract/auto \
  -H "X-API-Key: your_api_key" \
  -F "[email protected]" \
  -F "webhook_url=https://yourapp.com/webhooks/docextract"

Webhook Payload

{
  "event": "extraction.completed",
  "timestamp": "2026-01-11T10:30:00Z",
  "data": {
    "id": "ext_abc123",
    "documentType": "invoice",
    "extractedData": { ... },
    "confidence": {
      "overall": 0.94
    },
    "processingTimeMs": 1234
  }
}

Verifying Webhooks

All webhook requests include an X-DocExtract-Signature header for verification.

Node.js Verification

const crypto = require('crypto');

function verifyWebhook(payload, signature, secret) {
  const expected = crypto
    .createHmac('sha256', secret)
    .update(payload)
    .digest('hex');
  return crypto.timingSafeEqual(
    Buffer.from(signature),
    Buffer.from(expected)
  );
}

Webhooks are retried up to 3 times with exponential backoff if your endpoint returns a non-2xx status code.