LogoGAIK
Examples

Structured Document Extraction

Build a tool to extract and structure documents from user input

Document Extraction Tool

Extract and structure information from unstructured documents

Overview

This example demonstrates how to build a structured document extraction tool using GAIK toolkit. The tool takes unstructured user input (like scanned documents or text) and converts it into a structured, queryable format.

Use Case

Scenario: You have a collection of unstructured documents (invoices, reports, forms) that need to be digitized and organized into a structured database.

Solution: Use GAIK's Parser and Knowledge Capture components to automatically extract and structure the information.

Benefits:

  • Automated data entry
  • Consistent data structure
  • Reduced manual processing time
  • Improved data accuracy

Implementation

Step 1: Setup

First, install and import the required components:

import { Parser, KnowledgeCapture } from 'gaik-toolkit';

// Initialize components
const parser = new Parser();
const capture = new KnowledgeCapture({
  entityTypes: ['name', 'date', 'amount', 'reference'],
  includeRelationships: true
});

Step 2: Parse the Document

Parse the input document to extract raw content:

async function parseDocument(filePath) {
  try {
    const result = await parser.parse(filePath);
    return result;
  } catch (error) {
    console.error('Parsing failed:', error);
    throw error;
  }
}

Step 3: Structure the Data

Use Knowledge Capture to structure the extracted content:

async function structureDocument(parseResult) {
  const knowledge = await capture.captureFromDocument(parseResult);

  // Organize entities by type
  const structured = {
    metadata: parseResult.metadata,
    entities: knowledge.entities,
    relationships: knowledge.relationships,
    summary: knowledge.summary
  };

  return structured;
}

Step 4: Create the Extraction Tool

Combine the steps into a complete extraction tool:

class DocumentExtractionTool {
  constructor() {
    this.parser = new Parser();
    this.capture = new KnowledgeCapture({
      entityTypes: ['name', 'date', 'amount', 'reference', 'description'],
      includeRelationships: true
    });
  }

  async extract(filePath) {
    // Parse document
    console.log('Parsing document...');
    const parsed = await this.parser.parse(filePath);

    // Structure content
    console.log('Structuring content...');
    const knowledge = await this.capture.captureFromDocument(parsed);

    // Format output
    return this.formatOutput(parsed, knowledge);
  }

  formatOutput(parsed, knowledge) {
    return {
      document: {
        filename: parsed.metadata.filename,
        pages: parsed.pages?.length || 1,
        processedAt: new Date().toISOString()
      },
      extracted: {
        entities: this.groupEntitiesByType(knowledge.entities),
        relationships: knowledge.relationships,
        concepts: knowledge.concepts
      },
      raw: {
        content: parsed.content,
        metadata: parsed.metadata
      }
    };
  }

  groupEntitiesByType(entities) {
    return entities.reduce((acc, entity) => {
      if (!acc[entity.type]) {
        acc[entity.type] = [];
      }
      acc[entity.type].push({
        value: entity.value,
        confidence: entity.confidence,
        position: entity.position
      });
      return acc;
    }, {});
  }
}

Complete Example

Here's a complete working example:

import { DocumentExtractionTool } from './extraction-tool';

async function main() {
  const tool = new DocumentExtractionTool();

  try {
    // Extract from invoice
    const invoice = await tool.extract('invoice.pdf');
    console.log('Invoice data:', invoice.extracted.entities);

    // Extract from report
    const report = await tool.extract('report.docx');
    console.log('Report data:', report.extracted.entities);

    // Save structured data
    saveToDatabase(invoice);
    saveToDatabase(report);
  } catch (error) {
    console.error('Extraction failed:', error);
  }
}

function saveToDatabase(data) {
  // Save to your database
  console.log('Saving to database:', data.document.filename);
  // Implementation depends on your database
}

main();

Advanced Usage

Custom Entity Extraction

Define custom entity types for your specific use case:

const capture = new KnowledgeCapture({
  entityTypes: [
    'invoice_number',
    'customer_name',
    'billing_address',
    'line_item',
    'total_amount',
    'due_date'
  ],
  customRules: {
    invoice_number: /INV-\d{6}/,
    total_amount: /Total:\s*\$[\d,]+\.\d{2}/
  }
});

Batch Processing

Process multiple documents efficiently:

async function batchExtract(filePaths) {
  const tool = new DocumentExtractionTool();

  const results = await Promise.all(
    filePaths.map(path => tool.extract(path))
  );

  return results;
}

// Process folder of documents
const documents = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf'];
const extracted = await batchExtract(documents);

Validation and Quality Checks

Add validation to ensure data quality:

class DocumentExtractionTool {
  // ... previous methods ...

  async extractWithValidation(filePath) {
    const result = await this.extract(filePath);

    // Validate extracted data
    const validation = this.validate(result);

    return {
      ...result,
      validation: validation,
      isValid: validation.errors.length === 0
    };
  }

  validate(result) {
    const errors = [];

    // Check required entities
    if (!result.extracted.entities.date) {
      errors.push('Missing required entity: date');
    }

    // Check confidence scores
    Object.values(result.extracted.entities).forEach(entityGroup => {
      entityGroup.forEach(entity => {
        if (entity.confidence < 0.7) {
          errors.push(`Low confidence for: ${entity.value}`);
        }
      });
    });

    return {
      errors: errors,
      warnings: [],
      timestamp: new Date().toISOString()
    };
  }
}

Output Format

Example output structure:

{
  "document": {
    "filename": "invoice_2024_001.pdf",
    "pages": 2,
    "processedAt": "2024-03-15T10:30:00Z"
  },
  "extracted": {
    "entities": {
      "invoice_number": [
        { "value": "INV-123456", "confidence": 0.98 }
      ],
      "customer_name": [
        { "value": "Acme Corp", "confidence": 0.95 }
      ],
      "total_amount": [
        { "value": "$1,234.56", "confidence": 0.99 }
      ],
      "due_date": [
        { "value": "2024-04-15", "confidence": 0.97 }
      ]
    },
    "relationships": [
      {
        "source": "Acme Corp",
        "type": "owes",
        "target": "$1,234.56"
      }
    ]
  }
}

Next Steps