Structured Document Extraction

Extract and structure information from unstructured documents

Overview

This example demonstrates how to build a structured document extraction tool using GAIK toolkit. The tool takes unstructured user input (like scanned documents or text) and converts it into a structured, queryable format.

Use Case

Scenario: You have a collection of unstructured documents (invoices, reports, forms) that need to be digitized and organized into a structured database.

Solution: Use GAIK's Parser and Knowledge Capture components to automatically extract and structure the information.

Benefits:

Automated data entry
Consistent data structure
Reduced manual processing time
Improved data accuracy

Implementation

Step 1: Setup

First, install and import the required components:

import { Parser, KnowledgeCapture } from 'gaik-toolkit';

// Initialize components
const parser = new Parser();
const capture = new KnowledgeCapture({
  entityTypes: ['name', 'date', 'amount', 'reference'],
  includeRelationships: true
});

Step 2: Parse the Document

Parse the input document to extract raw content:

async function parseDocument(filePath) {
  try {
    const result = await parser.parse(filePath);
    return result;
  } catch (error) {
    console.error('Parsing failed:', error);
    throw error;
  }
}

Step 3: Structure the Data

Use Knowledge Capture to structure the extracted content:

async function structureDocument(parseResult) {
  const knowledge = await capture.captureFromDocument(parseResult);

  // Organize entities by type
  const structured = {
    metadata: parseResult.metadata,
    entities: knowledge.entities,
    relationships: knowledge.relationships,
    summary: knowledge.summary
  };

  return structured;
}

Step 4: Create the Extraction Tool

Combine the steps into a complete extraction tool:

class DocumentExtractionTool {
  constructor() {
    this.parser = new Parser();
    this.capture = new KnowledgeCapture({
      entityTypes: ['name', 'date', 'amount', 'reference', 'description'],
      includeRelationships: true
    });
  }

  async extract(filePath) {
    // Parse document
    console.log('Parsing document...');
    const parsed = await this.parser.parse(filePath);

    // Structure content
    console.log('Structuring content...');
    const knowledge = await this.capture.captureFromDocument(parsed);

    // Format output
    return this.formatOutput(parsed, knowledge);
  }

  formatOutput(parsed, knowledge) {
    return {
      document: {
        filename: parsed.metadata.filename,
        pages: parsed.pages?.length || 1,
        processedAt: new Date().toISOString()
      },
      extracted: {
        entities: this.groupEntitiesByType(knowledge.entities),
        relationships: knowledge.relationships,
        concepts: knowledge.concepts
      },
      raw: {
        content: parsed.content,
        metadata: parsed.metadata
      }
    };
  }

  groupEntitiesByType(entities) {
    return entities.reduce((acc, entity) => {
      if (!acc[entity.type]) {
        acc[entity.type] = [];
      }
      acc[entity.type].push({
        value: entity.value,
        confidence: entity.confidence,
        position: entity.position
      });
      return acc;
    }, {});
  }
}

Complete Example

Here's a complete working example:

import { DocumentExtractionTool } from './extraction-tool';

async function main() {
  const tool = new DocumentExtractionTool();

  try {
    // Extract from invoice
    const invoice = await tool.extract('invoice.pdf');
    console.log('Invoice data:', invoice.extracted.entities);

    // Extract from report
    const report = await tool.extract('report.docx');
    console.log('Report data:', report.extracted.entities);

    // Save structured data
    saveToDatabase(invoice);
    saveToDatabase(report);
  } catch (error) {
    console.error('Extraction failed:', error);
  }
}

function saveToDatabase(data) {
  // Save to your database
  console.log('Saving to database:', data.document.filename);
  // Implementation depends on your database
}

main();

Advanced Usage

Custom Entity Extraction

Define custom entity types for your specific use case:

const capture = new KnowledgeCapture({
  entityTypes: [
    'invoice_number',
    'customer_name',
    'billing_address',
    'line_item',
    'total_amount',
    'due_date'
  ],
  customRules: {
    invoice_number: /INV-\d{6}/,
    total_amount: /Total:\s*\$[\d,]+\.\d{2}/
  }
});

Batch Processing

Process multiple documents efficiently:

async function batchExtract(filePaths) {
  const tool = new DocumentExtractionTool();

  const results = await Promise.all(
    filePaths.map(path => tool.extract(path))
  );

  return results;
}

// Process folder of documents
const documents = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf'];
const extracted = await batchExtract(documents);

Validation and Quality Checks

Add validation to ensure data quality:

class DocumentExtractionTool {
  // ... previous methods ...

  async extractWithValidation(filePath) {
    const result = await this.extract(filePath);

    // Validate extracted data
    const validation = this.validate(result);

    return {
      ...result,
      validation: validation,
      isValid: validation.errors.length === 0
    };
  }

  validate(result) {
    const errors = [];

    // Check required entities
    if (!result.extracted.entities.date) {
      errors.push('Missing required entity: date');
    }

    // Check confidence scores
    Object.values(result.extracted.entities).forEach(entityGroup => {
      entityGroup.forEach(entity => {
        if (entity.confidence < 0.7) {
          errors.push(`Low confidence for: ${entity.value}`);
        }
      });
    });

    return {
      errors: errors,
      warnings: [],
      timestamp: new Date().toISOString()
    };
  }
}

Output Format

Example output structure:

{
  "document": {
    "filename": "invoice_2024_001.pdf",
    "pages": 2,
    "processedAt": "2024-03-15T10:30:00Z"
  },
  "extracted": {
    "entities": {
      "invoice_number": [
        { "value": "INV-123456", "confidence": 0.98 }
      ],
      "customer_name": [
        { "value": "Acme Corp", "confidence": 0.95 }
      ],
      "total_amount": [
        { "value": "$1,234.56", "confidence": 0.99 }
      ],
      "due_date": [
        { "value": "2024-04-15", "confidence": 0.97 }
      ]
    },
    "relationships": [
      {
        "source": "Acme Corp",
        "type": "owes",
        "target": "$1,234.56"
      }
    ]
  }
}

Next Steps

Integrate with your database system
Add custom validation rules for your document types
Explore Knowledge Extraction for querying extracted data
Check out Semantic Video Search example

On this page