OCaRctic: the OCR web app enhanced by snowflake-arctic

Motivation

With genealogy as one of my hobbies, coming across archived records, newspaper articles, indexes, and more documentation that is often typewritten but not computer searchable is not uncommon. While running a local instance of optical character recognition software was an option, the cumbersome nature of editing individual images for focused transcription and the lack of proper/advanced downstream solutions for correction made the project just an idea, until...

Inspiration

The last few months I have been ingrained into the Streamlit ecosystem, having shared components with the community and developed web applications with varied purposes. With my acquired skills I recently have had "a itch to scratch" wanting participate in a hackathon.

With the motivation above and the release of snowflake-arctic as an interesting open-source LLM solution within Snowflake and adjacent to the Streamlit ecosystem I knew I had to jump on the opportunity to start developing the OCR app I had imagined. With the hackathon open... why not now?

What it does

OCaRctic takes images with typewritten text, allows the user to select a specific section of interest, performs optical character recognition, followed by an AI-powered correction system that fixes common problems in OCR (typos due to character misidentification, odd symbols, etc). Additionally, OCaRctic presents the opportunity to the user of chatting with the AI-assistant to get further clarification on corrections.

How we built it

Using Streamlit as a base app developing platform (including experimental features!), its streamlit-cropper component to sub-section images, pytesseract for OCR and snowflake-arctic as a corrective and interactive LLM.

Challenges we ran into

Not overloading or over-engineering the app and its features before getting input from potential users.

Accomplishments that we're proud of

Having a simple product with a specific goal that can be easily used by everyone!

What's next for OCaRctic

Inclusion and validation of languages besides English. Assessing the need to perform specific training for language correction and context.

Built With

  • pytesseract
  • python
  • snowflake-arctic
  • st-social-media-links
  • streamlit
  • streamlit-cropper
  • tesseract
Share this project:

Updates