Picture
Minnur Yunusov Senior Drupal Developer Follow
March 26, 2024

Do you manage a website or multi-site system for a large enterprise? If so, you almost certainly have a PDF problem, since search engines prioritize HTML over PDF content.

Government agencies, healthcare organizations, universities, and non-profits tend to have a lot of PDF content on their websites. They usually have little choice, as HTML pages can be impractical or incompatible with public reporting requirements.

Chapter Three has developed a solution to the PDF problem – a module called Document OCR. Document OCR converts PDF content to text using optical character recognition (OCR) services like Google Document AI. The AI tool fetches text from a PDF and the user sends the text to OpenAI, which summarizes it in a single document.

We introduced Document OCR in beta mode in June 2023. Today Document OCR is fully stable and compatible with Drupal 10. According to Drupal.org, 16 sites and counting are already using it.

PDF to JSON screenshots

Who Needs Document OCR?

Any organization that leans heavily on PDF content would benefit from Document OCR. Governments typically use a lot of PDFs on their websites, but many other types of organizations do so as well, including:

  • Universities and colleges
  • Hospitals and other healthcare organizations
  • Court systems and judicial organizations
  • Not-for-profit organizations with reporting requirements
  • Engineering firms and similar companies that house design documents on their websites
  • Human resource departments responsible for maintaining intranets and HR portals

How Does It Work?

The user installs and configures the module, fine tuning the desired import process. The AI tool extracts organized data from selected PDFs, and then converts it to text. The user then initiates sending the text to OpenAI, which returns a page long summary of each PDF. For accessibility purposes, the user can create an audio file of the summary using OpenAI’s audio API.

The module is built using a plugin system, which means it can be extended to employ different services. One such service is Mindee, an OCR API that processes many kinds of official documents, including European passports and license plates.

With the right extension, Document OCR can be used to fetch data from any kind of document and render it in HTML text form.

screenshots of document ocr module

Does It Cost Anything?

The module is free, but OpenAI, Google Document AI, and similar tools are paid services. That said, the majority of large organizations that depend heavily on PDFs are going to find it to be a worthwhile investment.

How Do I Try It Out?

Please contact us to schedule a demo of Document OCR. You can also read more about the module on our site and on the module page on Drupal.org.