Toolkit
Parser
Document parsing and processing component
Overview
The Parser component is responsible for parsing and processing various document formats. It provides a unified interface for handling different file types and extracting their content.
Features
- Multi-format Support: PDF, DOCX, TXT, and more
- Automatic Encoding Detection: Handles various text encodings automatically
- Metadata Extraction: Extracts document metadata and properties
- Structure Preservation: Maintains document structure during parsing
- Error Handling: Robust error handling for corrupted or invalid files
Basic Usage
import { Parser } from 'gaik-toolkit';
// Create a parser instance
const parser = new Parser();
// Parse a document
const result = await parser.parse('path/to/document.pdf');
console.log(result.content);
console.log(result.metadata);API Reference
Parser
Main parser class for document processing.
Constructor
const parser = new Parser(options);Options:
format(string): Specific format to parse (auto-detected if not provided)encoding(string): Text encoding (default: 'utf-8')preserveFormatting(boolean): Preserve original formatting (default: true)
Methods
parse(filePath: string): Promise<ParseResult>
Parses a document from the given file path.
Parameters:
filePath: Path to the document to parse
Returns: Promise resolving to ParseResult object
Example:
const result = await parser.parse('document.pdf');parseBuffer(buffer: Buffer, format?: string): Promise<ParseResult>
Parses a document from a buffer.
Parameters:
buffer: Document content as Bufferformat: Optional format hint
Returns: Promise resolving to ParseResult object
ParseResult
Result object returned by parse methods.
Properties:
content(string): Extracted text contentmetadata(object): Document metadatastructure(object): Document structure informationpages(array): Page-by-page content (for multi-page documents)
Advanced Examples
Parsing with Custom Options
const parser = new Parser({
preserveFormatting: true,
encoding: 'utf-8'
});
const result = await parser.parse('document.pdf');Handling Different Formats
// Parse PDF
const pdfResult = await parser.parse('document.pdf');
// Parse DOCX
const docxResult = await parser.parse('document.docx');
// Parse TXT
const txtResult = await parser.parse('document.txt');Working with Buffers
const fs = require('fs');
const buffer = fs.readFileSync('document.pdf');
const result = await parser.parseBuffer(buffer, 'pdf');Error Handling
try {
const result = await parser.parse('document.pdf');
} catch (error) {
if (error.code === 'UNSUPPORTED_FORMAT') {
console.error('File format not supported');
} else if (error.code === 'CORRUPTED_FILE') {
console.error('File is corrupted or invalid');
} else {
console.error('Parsing failed:', error.message);
}
}Next Steps
- Learn about Knowledge Capture
- Explore Examples
- Check out the Extraction Tool Example