DocExtract API Documentation
Extract structured data from invoices, receipts, and bank statements with our powerful document processing API.
Authentication
All API requests require authentication using an API key. Include your API key in the X-API-Key header with every request.
X-API-Key: your_api_key_here
You can generate API keys from your Dashboard. Each key can have specific scopes and IP restrictions.
Quickstart
Extract data from a document in just a few lines of code:
const DocExtract = require('docextract'); const client = new DocExtract('your_api_key'); const result = await client.extract({ file: './invoice.pdf', type: 'auto' }); console.log(result.extractedData); // { vendor: 'Acme Corp', total: 1250.00, date: '2026-01-10', ... }
from docextract import DocExtract client = DocExtract("your_api_key") result = client.extract( file="./invoice.pdf", type="auto" ) print(result.extracted_data) # {'vendor': 'Acme Corp', 'total': 1250.00, 'date': '2026-01-10', ...}
curl -X POST https://api.docextract.io/v1/extract/auto \ -H "X-API-Key: your_api_key" \ -F "[email protected]"
Auto Extract
Automatically detect document type and extract structured data.
Request Parameters
| Parameter | Type | Description |
|---|---|---|
| filerequired | file | The document file (PDF, PNG, JPG, TIFF) |
| include_raw_textoptional | boolean | Include raw OCR text in response (default: false) |
| webhook_urloptional | string | URL to receive async processing results |
Response
{
"success": true,
"documentType": "invoice",
"detectedType": "invoice",
"confidence": {
"overall": 0.94,
"fields": {
"vendor": 0.98,
"total": 0.95
}
},
"extractedData": {
"vendor": "Acme Corporation",
"invoiceNumber": "INV-2026-001",
"date": "2026-01-10",
"dueDate": "2026-02-10",
"subtotal": 1150.00,
"tax": 100.00,
"total": 1250.00,
"currency": "USD",
"lineItems": [...]
},
"processingTimeMs": 1234
}
Extract Invoice
Extract structured data specifically from invoices.
Extracted Fields
| Field | Type | Description |
|---|---|---|
| vendor | string | Vendor/supplier name |
| invoiceNumber | string | Invoice reference number |
| date | string | Invoice date (ISO 8601) |
| dueDate | string | Payment due date |
| subtotal | number | Subtotal before tax |
| tax | number | Tax amount |
| total | number | Total amount |
| currency | string | Currency code (USD, EUR, etc.) |
| lineItems | array | List of line items |
Extract Receipt
Extract structured data from receipts and purchase records.
Extracted Fields
| Field | Type | Description |
|---|---|---|
| merchant | string | Store/merchant name |
| date | string | Transaction date |
| time | string | Transaction time |
| items | array | Purchased items |
| subtotal | number | Subtotal amount |
| tax | number | Tax amount |
| total | number | Total amount |
| paymentMethod | string | Payment method used |
Extract Bank Statement
Extract transaction data from bank statements.
Extracted Fields
| Field | Type | Description |
|---|---|---|
| bankName | string | Financial institution name |
| accountNumber | string | Account number (masked) |
| statementPeriod | object | Start and end dates |
| openingBalance | number | Starting balance |
| closingBalance | number | Ending balance |
| transactions | array | List of transactions |
Detect Document Type
Detect document type without extracting data.
{
"type": "invoice",
"confidence": 0.96,
"signals": ["invoice_number", "due_date", "line_items"]
}
Node.js SDK
Full-featured SDK with TypeScript support and async handling.
npm install docextract
const { DocExtract } = require('docextract'); const fs = require('fs'); // Initialize client const client = new DocExtract({ apiKey: process.env.DOCEXTRACT_API_KEY }); // Extract from file path const result = await client.extract({ file: './documents/invoice.pdf', type: 'auto' }); // Extract from buffer const buffer = fs.readFileSync('./invoice.pdf'); const result2 = await client.extract({ file: buffer, filename: 'invoice.pdf', type: 'invoice' }); // Extract from URL const result3 = await client.extractFromUrl({ url: 'https://example.com/invoice.pdf', type: 'auto' }); // Batch processing const results = await client.extractBatch({ files: ['./inv1.pdf', './inv2.pdf', './inv3.pdf'], type: 'invoice', concurrency: 3 }); console.log(result.extractedData);
TypeScript Support
import { DocExtract, InvoiceData, ExtractionResult } from 'docextract'; const client = new DocExtract({ apiKey: process.env.DOCEXTRACT_API_KEY! }); const result: ExtractionResult<InvoiceData> = await client.extract({ file: './invoice.pdf', type: 'invoice' }); // Fully typed extracted data console.log(result.extractedData.vendor); console.log(result.extractedData.total); console.log(result.extractedData.lineItems);
Python SDK
Pythonic SDK with sync and async support.
pip install docextract
import os from docextract import DocExtract # Initialize client client = DocExtract(api_key=os.environ["DOCEXTRACT_API_KEY"]) # Extract from file path result = client.extract( file="./documents/invoice.pdf", type="auto" ) # Extract from bytes with open("invoice.pdf", "rb") as f: result = client.extract( file=f.read(), filename="invoice.pdf", type="invoice" ) # Extract from URL result = client.extract_from_url( url="https://example.com/invoice.pdf", type="auto" ) print(result.extracted_data)
Async Support
import asyncio from docextract import AsyncDocExtract async def process_documents(): client = AsyncDocExtract(api_key=os.environ["DOCEXTRACT_API_KEY"]) # Process multiple documents concurrently tasks = [ client.extract(file="inv1.pdf"), client.extract(file="inv2.pdf"), client.extract(file="inv3.pdf") ] results = await asyncio.gather(*tasks) for result in results: print(result.extracted_data) asyncio.run(process_documents())
cURL Examples
curl -X POST https://api.docextract.io/v1/extract/auto \ -H "X-API-Key: your_api_key" \ -F "[email protected]"
curl -X POST https://api.docextract.io/v1/extract/invoice \ -H "X-API-Key: your_api_key" \ -F "[email protected]" \ -F "include_raw_text=true"
curl -X POST https://api.docextract.io/v1/extract/auto \ -H "X-API-Key: your_api_key" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com/invoice.pdf"}'
curl -X POST https://api.docextract.io/v1/detect \ -H "X-API-Key: your_api_key" \ -F "[email protected]"
Error Handling
The API uses standard HTTP status codes and returns detailed error information.
| Code | Error | Description |
|---|---|---|
| 400 | VALIDATION_ERROR | Invalid request parameters |
| 401 | INVALID_API_KEY | Invalid or missing API key |
| 403 | FORBIDDEN | Insufficient permissions |
| 404 | NOT_FOUND | Resource doesn't exist |
| 413 | FILE_TOO_LARGE | File exceeds 10MB limit |
| 415 | UNSUPPORTED_FORMAT | File format not supported |
| 422 | EXTRACTION_FAILED | Document couldn't be processed |
| 422 | UNREADABLE_PDF | PDF text extraction failed |
| 422 | LOW_QUALITY_IMAGE | Image quality too low for OCR |
| 429 | RATE_LIMITED | Too many requests |
| 500 | SERVER_ERROR | Internal server error |
{
"success": false,
"error": {
"code": "UNREADABLE_PDF",
"message": "Could not extract text from PDF",
"details": {
"suggestion": "Upload a higher quality image or native PDF"
}
}
}
SDK Error Handling
import { DocExtract, DocExtractError } from 'docextract'; try { const result = await client.extract({ file: './doc.pdf' }); } catch (error) { if (error instanceof DocExtractError) { console.log(error.code); // 'UNREADABLE_PDF' console.log(error.message); // 'Could not extract text' console.log(error.statusCode); // 422 console.log(error.suggestion); // 'Upload a higher quality...' } }
from docextract import DocExtract, DocExtractError try: result = client.extract(file="doc.pdf") except DocExtractError as e: print(e.code) # 'UNREADABLE_PDF' print(e.message) # 'Could not extract text' print(e.status_code) # 422 print(e.suggestion) # 'Upload a higher quality...'
Rate Limits
Rate limits vary by plan:
| Plan | Requests/min | Requests/day | File Size |
|---|---|---|---|
| Free | 10 | 100 | 5 MB |
| Starter | 60 | 5,000 | 10 MB |
| Pro | 300 | 50,000 | 25 MB |
| Enterprise | Custom | Custom | Custom |
X-RateLimit-Limit - Maximum requests allowed
X-RateLimit-Remaining - Requests remaining
X-RateLimit-Reset - Unix timestamp when limit resets
Webhooks
Receive async notifications when documents are processed.
Configuring Webhooks
Pass a webhook_url parameter with your extraction request:
curl -X POST https://api.docextract.io/v1/extract/auto \ -H "X-API-Key: your_api_key" \ -F "[email protected]" \ -F "webhook_url=https://yourapp.com/webhooks/docextract"
Webhook Payload
{
"event": "extraction.completed",
"timestamp": "2026-01-11T10:30:00Z",
"data": {
"id": "ext_abc123",
"documentType": "invoice",
"extractedData": { ... },
"confidence": {
"overall": 0.94
},
"processingTimeMs": 1234
}
}
Verifying Webhooks
All webhook requests include an X-DocExtract-Signature header for verification.
const crypto = require('crypto'); function verifyWebhook(payload, signature, secret) { const expected = crypto .createHmac('sha256', secret) .update(payload) .digest('hex'); return crypto.timingSafeEqual( Buffer.from(signature), Buffer.from(expected) ); }