Parser

Overview

The Parser component is responsible for parsing and processing various document formats. It provides a unified interface for handling different file types and extracting their content.

Features

Multi-format Support: PDF, DOCX, TXT, and more
Automatic Encoding Detection: Handles various text encodings automatically
Metadata Extraction: Extracts document metadata and properties
Structure Preservation: Maintains document structure during parsing
Error Handling: Robust error handling for corrupted or invalid files

Basic Usage

import { Parser } from 'gaik-toolkit';

// Create a parser instance
const parser = new Parser();

// Parse a document
const result = await parser.parse('path/to/document.pdf');

console.log(result.content);
console.log(result.metadata);

API Reference

`Parser`

Main parser class for document processing.

Constructor

const parser = new Parser(options);

Options:

format (string): Specific format to parse (auto-detected if not provided)
encoding (string): Text encoding (default: 'utf-8')
preserveFormatting (boolean): Preserve original formatting (default: true)

Methods

`parse(filePath: string): Promise<ParseResult>`

Parses a document from the given file path.

Parameters:

filePath: Path to the document to parse

Returns: Promise resolving to ParseResult object

Example:

const result = await parser.parse('document.pdf');

`parseBuffer(buffer: Buffer, format?: string): Promise<ParseResult>`

Parses a document from a buffer.

Parameters:

buffer: Document content as Buffer
format: Optional format hint

Returns: Promise resolving to ParseResult object

`ParseResult`

Result object returned by parse methods.

Properties:

content (string): Extracted text content
metadata (object): Document metadata
structure (object): Document structure information
pages (array): Page-by-page content (for multi-page documents)

Advanced Examples

Parsing with Custom Options

const parser = new Parser({
  preserveFormatting: true,
  encoding: 'utf-8'
});

const result = await parser.parse('document.pdf');

Handling Different Formats

// Parse PDF
const pdfResult = await parser.parse('document.pdf');

// Parse DOCX
const docxResult = await parser.parse('document.docx');

// Parse TXT
const txtResult = await parser.parse('document.txt');

Working with Buffers

const fs = require('fs');
const buffer = fs.readFileSync('document.pdf');

const result = await parser.parseBuffer(buffer, 'pdf');

Error Handling

try {
  const result = await parser.parse('document.pdf');
} catch (error) {
  if (error.code === 'UNSUPPORTED_FORMAT') {
    console.error('File format not supported');
  } else if (error.code === 'CORRUPTED_FILE') {
    console.error('File is corrupted or invalid');
  } else {
    console.error('Parsing failed:', error.message);
  }
}

Next Steps

Learn about Knowledge Capture
Explore Examples
Check out the Extraction Tool Example

On this page