LogoGAIK
Toolkit

Parser

Document parsing and processing component

Overview

The Parser component is responsible for parsing and processing various document formats. It provides a unified interface for handling different file types and extracting their content.

Features

  • Multi-format Support: PDF, DOCX, TXT, and more
  • Automatic Encoding Detection: Handles various text encodings automatically
  • Metadata Extraction: Extracts document metadata and properties
  • Structure Preservation: Maintains document structure during parsing
  • Error Handling: Robust error handling for corrupted or invalid files

Basic Usage

import { Parser } from 'gaik-toolkit';

// Create a parser instance
const parser = new Parser();

// Parse a document
const result = await parser.parse('path/to/document.pdf');

console.log(result.content);
console.log(result.metadata);

API Reference

Parser

Main parser class for document processing.

Constructor

const parser = new Parser(options);

Options:

  • format (string): Specific format to parse (auto-detected if not provided)
  • encoding (string): Text encoding (default: 'utf-8')
  • preserveFormatting (boolean): Preserve original formatting (default: true)

Methods

parse(filePath: string): Promise<ParseResult>

Parses a document from the given file path.

Parameters:

  • filePath: Path to the document to parse

Returns: Promise resolving to ParseResult object

Example:

const result = await parser.parse('document.pdf');
parseBuffer(buffer: Buffer, format?: string): Promise<ParseResult>

Parses a document from a buffer.

Parameters:

  • buffer: Document content as Buffer
  • format: Optional format hint

Returns: Promise resolving to ParseResult object

ParseResult

Result object returned by parse methods.

Properties:

  • content (string): Extracted text content
  • metadata (object): Document metadata
  • structure (object): Document structure information
  • pages (array): Page-by-page content (for multi-page documents)

Advanced Examples

Parsing with Custom Options

const parser = new Parser({
  preserveFormatting: true,
  encoding: 'utf-8'
});

const result = await parser.parse('document.pdf');

Handling Different Formats

// Parse PDF
const pdfResult = await parser.parse('document.pdf');

// Parse DOCX
const docxResult = await parser.parse('document.docx');

// Parse TXT
const txtResult = await parser.parse('document.txt');

Working with Buffers

const fs = require('fs');
const buffer = fs.readFileSync('document.pdf');

const result = await parser.parseBuffer(buffer, 'pdf');

Error Handling

try {
  const result = await parser.parse('document.pdf');
} catch (error) {
  if (error.code === 'UNSUPPORTED_FORMAT') {
    console.error('File format not supported');
  } else if (error.code === 'CORRUPTED_FILE') {
    console.error('File is corrupted or invalid');
  } else {
    console.error('Parsing failed:', error.message);
  }
}

Next Steps