Your Drupal and Next.js Experts in San Francisco

March 26, 2024

Do you manage a website or multi-site system for a large enterprise? If so, you almost certainly have a PDF problem, since search engines prioritize HTML over PDF content.

Government agencies, healthcare organizations, universities, and non-profits tend to have a lot of PDF content on their websites. They usually have little choice, as HTML pages can be impractical or incompatible with public reporting requirements.

Chapter Three has developed a solution to the PDF problem – a module called Document OCR. Document OCR converts PDF content to text using optical character recognition (OCR) services like Google Document AI. The AI tool fetches text from a PDF and the user sends the text to OpenAI, which summarizes it in a single document.

We introduced Document OCR in beta mode in June 2023. Today Document OCR is fully stable and compatible with Drupal 10. According to Drupal.org, 16 sites and counting are already using it.

Who Needs Document OCR?

Any organization that leans heavily on PDF content would benefit from Document OCR. Governments typically use a lot of PDFs on their websites, but many other types of organizations do so as well, including:

Universities and colleges
Hospitals and other healthcare organizations
Court systems and judicial organizations
Not-for-profit organizations with reporting requirements
Engineering firms and similar companies that house design documents on their websites
Human resource departments responsible for maintaining intranets and HR portals

How Does It Work?

The user installs and configures the module, fine tuning the desired import process. The AI tool extracts organized data from selected PDFs, and then converts it to text. The user then initiates sending the text to OpenAI, which returns a page long summary of each PDF. For accessibility purposes, the user can create an audio file of the summary using OpenAI’s audio API.

The module is built using a plugin system, which means it can be extended to employ different services. One such service is Mindee, an OCR API that processes many kinds of official documents, including European passports and license plates.

With the right extension, Document OCR can be used to fetch data from any kind of document and render it in HTML text form.

Does It Cost Anything?

The module is free, but OpenAI, Google Document AI, and similar tools are paid services. That said, the majority of large organizations that depend heavily on PDFs are going to find it to be a worthwhile investment.

How Do I Try It Out?

Please contact us to schedule a demo of Document OCR. You can also read more about the module on our site and on the module page on Drupal.org.

Turning PDFs into HTML with Document OCR

Who Needs Document OCR?

How Does It Work?

Does It Cost Anything?

How Do I Try It Out?

How Can We Help With
Your Next Project?

Turning PDFs into HTML with Document OCR

Who Needs Document OCR?

How Does It Work?

Does It Cost Anything?

How Do I Try It Out?

How Can We Help With Your Next Project?

How Can We Help With
Your Next Project?