menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Allen Inst...
source image

Marktechpost

1M

read

389

img
dot

Allen Institute for AI Released olmOCR: A High-Performance Open Source Toolkit Designed to Convert PDFs and Document Images into Clean and Structured Plain Text

  • The Allen Institute for AI introduced olmOCR, an open-source Python toolkit for converting PDFs into structured text with logical reading order.
  • Traditional OCR tools face challenges in extracting coherent text from PDFs due to their visual layout emphasis over logical flow.
  • olmOCR leverages a 7-billion-parameter VLM, fine-tuned on 260,000 PDF pages, for accurate extraction by integrating text and visual data.
  • Using document anchoring, olmOCR aligns text metadata with visual elements to enhance model accuracy and readability.
  • The toolkit processes one million PDF pages for $190, significantly more cost-efficient compared to other systems like GPT-4o.
  • olmOCR surpasses competitors in accuracy and efficiency, achieving an alignment score of 0.875 and excelling in structured content recognition.
  • Through human evaluation, olmOCR received the highest ELO rating among OCR methods and improved language model training by 1.3% in benchmark tasks.
  • The system is compatible with inference frameworks like vLLM and SGLang, facilitating deployment across hardware setups.
  • olmOCR's innovation lies in combining textual and image-based analysis for improved extraction accuracy and structured data recognition.
  • The toolkit's cost-effectiveness, high accuracy, and compatibility make it a valuable resource for large-scale document processing and language model training.

Read Full Article

like

23 Likes

For uninterrupted reading, download the app