Skip to main content

OJT Parser

Table of Contents

  1. Overview
  2. Getting Started
  3. Input Data Requirements
  4. How to Prompt the Agent
  5. Example Usage
  6. Best Practices
  7. Troubleshooting
  8. FAQ

Overview

The OJT Data Parser Agent is a specialized agent designed to extract and parse structured information from On-the-Job Training (OJT) documents. It focuses on extracting two main categories of information:

  • Project (OJT) Summary Information: Comprehensive metadata about the training program
  • Project (OJT) Curriculum Tasks: Detailed task breakdowns including steps, standards, and learning objectives

Key Capabilities

  • Extracts structured data from various document formats
  • Parses complex OJT task analysis tables
  • Generates CSV files for easy data import and manipulation
  • Displays extracted data in readable markdown table format
  • Maintains data integrity by only extracting existing information

Getting Started

Prerequisites

  • Access to OJT training documents (PDF, Word, or text formats)

Input Data Requirements

Document Format

The agent accepts documents containing:

  1. Training Module Information

    • Module title and descriptions
    • Company and location details
    • Personnel assignments (approver, assignee, lead)
    • Timeline information
  2. Task Analysis Tables

    • Main task listings
    • OJT hours allocations
    • Task breakdown structures including:
      • Task steps
      • Key learning points
      • Standards and requirements
      • Required skills and knowledge
      • Training guidelines

Expected Structure

Documents should ideally contain:

  • Clear section headers (e.g., "MODULE", "MAIN TASKS", "OJT HOURS")
  • Structured tables for task analysis
  • Consistent formatting for dates and durations

How to Prompt the Agent

Basic Prompt Structure

Please extract OJT information from the uploaded document [filename]

Advanced Prompting Tips

  1. Specific Extraction: Request focus on particular sections

    Extract only the curriculum tasks from the OJT document
  2. Multiple Documents: Process batch documents

    Parse all OJT modules in the uploaded training materials
  3. Validation Requests: Verify extraction completeness

    Extract and validate all task analysis tables from the document

Example Usage

Example 1: Complete OJT Document Extraction

Input Document Content:

MODULE: Equipment Maintenance Training
COMPANY: TechCorp Industries
LOCATION: Singapore
APPROVER: John Smith
ASSIGNEE: Jane Doe
START DATE: 2024-01-15
END DATE: 2024-03-15

MAIN TASKS:
1. Preventive Maintenance - 40 hours
2. Troubleshooting - 60 hours

ON-THE-JOB TRAINING TASK ANALYSIS:
S/N || Main Task || Steps || Learning Points || Standards
1 || Preventive Maintenance || Check oil levels || Understanding viscosity || ISO 9001
2 || Troubleshooting || Diagnose issues || Problem identification || Company SOP

Expected Output:

  • Project Summary CSV with all metadata fields
  • Project Tasks CSV with formatted task descriptions
  • Markdown tables for immediate viewing

Example 2: Task-Focused Extraction

Prompt:

Extract only the curriculum tasks with their detailed breakdown from the OJT document

Output Focus:

  • Detailed task_description_table with proper CSV formatting
  • Preserved table structure with || delimiters
  • All columns and rows maintained

Best Practices

Document Preparation

  1. Ensure Clear Structure: Use consistent headers and formatting
  2. Complete Tables: Include all columns even if empty
  3. Date Formatting: Use YYYY-MM-DD format for dates
  4. Duration Units: Clearly specify hours, days, or weeks

Prompt Optimization

  1. Be Specific: Clearly state what information you need extracted
  2. Provide Context: Mention document type and expected content
  3. Request Validation: Ask for confirmation of extracted fields

Data Handling

  1. Review Output: Always verify extracted data against source
  2. Check CSV Format: Ensure proper delimiter usage (semicolon)
  3. Validate Completeness: Confirm all expected fields are present

Troubleshooting

Common Issues and Solutions

Issue 1: Missing Fields in Output

Symptom: Some expected fields are empty in the CSV Solution:

  • Verify the field exists in the source document
  • Check for alternative field names or variations
  • Ensure document quality is sufficient for parsing

Issue 2: Malformed Task Description Tables

Symptom: Task analysis tables are not properly formatted Solution:

  • Ensure source tables use consistent delimiters
  • Check for merged cells or irregular formatting
  • Manually clean the source document if necessary

Issue 3: CSV Export Fails

Symptom: Generated CSV files cannot be opened or imported Solution:

  • Verify delimiter consistency (semicolon)
  • Check for unescaped quotes in data
  • Ensure proper line endings

Issue 4: Incomplete Extraction

Symptom: Agent misses entire sections of the document Solution:

  • Break document into smaller sections
  • Ensure clear section headers
  • Remove complex formatting or graphics

FAQ

Q1: What document formats are supported?

A: The agent primarily works with text-based formats including PDF, Word documents, and plain text files. Complex layouts or image-heavy documents may require preprocessing.

Q2: Can the agent handle multiple OJT modules in one document?

A: Yes, the agent can extract multiple modules. Each will be listed as a separate row in the Project Summary CSV, with corresponding tasks in the Tasks CSV.

Q3: How does the agent handle missing information?

A: The agent leaves fields empty when information is not found. It does not generate or infer missing data, maintaining data integrity.

Q4: What is the maximum document size the agent can process?

A: While there's no strict limit, very large documents (>100 pages) may benefit from being split into sections for optimal processing.

Q5: Can I customize the output format?

A: The agent outputs in CSV format with semicolon delimiters and markdown tables. The CSV structure is fixed but can be post-processed as needed.

Q6: How accurate is the extraction?

A: Accuracy depends on document quality and structure. Well-formatted documents typically achieve 95%+ accuracy. Always review critical data.

Q7: Can the agent extract from scanned documents?

A: Scanned documents require OCR preprocessing. The agent works best with digital text rather than image-based content.

Q8: How are complex tables with merged cells handled?

A: The agent attempts to preserve table structure but merged cells may cause formatting issues. Consider reformatting complex tables before extraction.

Q9: Is there support for non-English documents?

A: The agent is optimized for English documents. Other languages may work but with reduced accuracy.

Q10: Can I extract specific date ranges or filter results?

A: The agent extracts all available data. Filtering should be done post-extraction using the CSV outputs.