The problem
The company’s value proposition was the breadth and timeliness of their financial data coverage. But their data pipeline was entirely manual: analysts would open PDFs, find the relevant tables and figures, and type the values into a spreadsheet. 500 documents per week. 12 people doing it.
The data was always slightly stale, errors crept in under time pressure, and the team had no capacity to expand coverage to new document types.
What we built
The pipeline has four components:
Document classifier: Identifies the document type (fund factsheet, earnings release, regulatory filing, etc.) and routes it to the appropriate extraction template. This matters because the same data point — say, net asset value — appears in different positions and formats depending on document type.
Layout-aware parser: Uses a vision model to understand document structure, not just raw text. Tables, multi-column layouts, and embedded charts are handled correctly without brittle coordinate-based scraping.
Extraction model: Fine-tuned on 2,000 annotated examples from the client’s own document corpus. The model extracts structured data into a typed schema with per-field confidence scores.
Validation layer: Any extraction below the confidence threshold is queued for human review. This queue typically contains 3–5% of all fields — the genuinely ambiguous cases that even a human analyst would pause on.
Results
- Processing time: ~25 minutes per document → under 40 seconds
- Throughput: 500 documents/week → 500 documents/day with the same infrastructure
- Human review queue: ~3% of extracted fields (down from 100%)
- Data latency: reports available in the analytics platform within 2 hours of publication vs. next-day previously
- Team redeployed from extraction to analysis — first new product feature shipped in 6 weeks