Structured Document Extraction
Build a tool to extract and structure documents from user input
Overview
This example demonstrates how to build a structured document extraction tool using GAIK toolkit. The tool takes unstructured user input (like scanned documents or text) and converts it into a structured, queryable format.
Use Case
Scenario: You have a collection of unstructured documents (invoices, reports, forms) that need to be digitized and organized into a structured database.
Solution: Use GAIK's Parser and Knowledge Capture components to automatically extract and structure the information.
Benefits:
- Automated data entry
- Consistent data structure
- Reduced manual processing time
- Improved data accuracy
Implementation
Step 1: Setup
First, install and import the required components:
import { Parser, KnowledgeCapture } from 'gaik-toolkit';
// Initialize components
const parser = new Parser();
const capture = new KnowledgeCapture({
entityTypes: ['name', 'date', 'amount', 'reference'],
includeRelationships: true
});Step 2: Parse the Document
Parse the input document to extract raw content:
async function parseDocument(filePath) {
try {
const result = await parser.parse(filePath);
return result;
} catch (error) {
console.error('Parsing failed:', error);
throw error;
}
}Step 3: Structure the Data
Use Knowledge Capture to structure the extracted content:
async function structureDocument(parseResult) {
const knowledge = await capture.captureFromDocument(parseResult);
// Organize entities by type
const structured = {
metadata: parseResult.metadata,
entities: knowledge.entities,
relationships: knowledge.relationships,
summary: knowledge.summary
};
return structured;
}Step 4: Create the Extraction Tool
Combine the steps into a complete extraction tool:
class DocumentExtractionTool {
constructor() {
this.parser = new Parser();
this.capture = new KnowledgeCapture({
entityTypes: ['name', 'date', 'amount', 'reference', 'description'],
includeRelationships: true
});
}
async extract(filePath) {
// Parse document
console.log('Parsing document...');
const parsed = await this.parser.parse(filePath);
// Structure content
console.log('Structuring content...');
const knowledge = await this.capture.captureFromDocument(parsed);
// Format output
return this.formatOutput(parsed, knowledge);
}
formatOutput(parsed, knowledge) {
return {
document: {
filename: parsed.metadata.filename,
pages: parsed.pages?.length || 1,
processedAt: new Date().toISOString()
},
extracted: {
entities: this.groupEntitiesByType(knowledge.entities),
relationships: knowledge.relationships,
concepts: knowledge.concepts
},
raw: {
content: parsed.content,
metadata: parsed.metadata
}
};
}
groupEntitiesByType(entities) {
return entities.reduce((acc, entity) => {
if (!acc[entity.type]) {
acc[entity.type] = [];
}
acc[entity.type].push({
value: entity.value,
confidence: entity.confidence,
position: entity.position
});
return acc;
}, {});
}
}Complete Example
Here's a complete working example:
import { DocumentExtractionTool } from './extraction-tool';
async function main() {
const tool = new DocumentExtractionTool();
try {
// Extract from invoice
const invoice = await tool.extract('invoice.pdf');
console.log('Invoice data:', invoice.extracted.entities);
// Extract from report
const report = await tool.extract('report.docx');
console.log('Report data:', report.extracted.entities);
// Save structured data
saveToDatabase(invoice);
saveToDatabase(report);
} catch (error) {
console.error('Extraction failed:', error);
}
}
function saveToDatabase(data) {
// Save to your database
console.log('Saving to database:', data.document.filename);
// Implementation depends on your database
}
main();Advanced Usage
Custom Entity Extraction
Define custom entity types for your specific use case:
const capture = new KnowledgeCapture({
entityTypes: [
'invoice_number',
'customer_name',
'billing_address',
'line_item',
'total_amount',
'due_date'
],
customRules: {
invoice_number: /INV-\d{6}/,
total_amount: /Total:\s*\$[\d,]+\.\d{2}/
}
});Batch Processing
Process multiple documents efficiently:
async function batchExtract(filePaths) {
const tool = new DocumentExtractionTool();
const results = await Promise.all(
filePaths.map(path => tool.extract(path))
);
return results;
}
// Process folder of documents
const documents = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf'];
const extracted = await batchExtract(documents);Validation and Quality Checks
Add validation to ensure data quality:
class DocumentExtractionTool {
// ... previous methods ...
async extractWithValidation(filePath) {
const result = await this.extract(filePath);
// Validate extracted data
const validation = this.validate(result);
return {
...result,
validation: validation,
isValid: validation.errors.length === 0
};
}
validate(result) {
const errors = [];
// Check required entities
if (!result.extracted.entities.date) {
errors.push('Missing required entity: date');
}
// Check confidence scores
Object.values(result.extracted.entities).forEach(entityGroup => {
entityGroup.forEach(entity => {
if (entity.confidence < 0.7) {
errors.push(`Low confidence for: ${entity.value}`);
}
});
});
return {
errors: errors,
warnings: [],
timestamp: new Date().toISOString()
};
}
}Output Format
Example output structure:
{
"document": {
"filename": "invoice_2024_001.pdf",
"pages": 2,
"processedAt": "2024-03-15T10:30:00Z"
},
"extracted": {
"entities": {
"invoice_number": [
{ "value": "INV-123456", "confidence": 0.98 }
],
"customer_name": [
{ "value": "Acme Corp", "confidence": 0.95 }
],
"total_amount": [
{ "value": "$1,234.56", "confidence": 0.99 }
],
"due_date": [
{ "value": "2024-04-15", "confidence": 0.97 }
]
},
"relationships": [
{
"source": "Acme Corp",
"type": "owes",
"target": "$1,234.56"
}
]
}
}Next Steps
- Integrate with your database system
- Add custom validation rules for your document types
- Explore Knowledge Extraction for querying extracted data
- Check out Semantic Video Search example