Created during December 2023

Processing PDFs

Image description
Click to enlarge

When working with LLMs, one of the most common use cases is extracting structured data from PDFs. PDFs are difficult to process in an automated fashion because they can have a very complex structure - containing a mix of text blocks, images and tables that often have no clear borders.

I have tried different packages processing PDFs like pyPDF, Unstructured, Tabular, Camelot and Adobe API service. For me, Adobe is the clear winner. It works amazingly well, especially with table formats. Btw you know that Adobe invented PDF right? So you should not be surprised 😉

👉 Here’s how Adobe PDF Services API handles document extraction:
The output of this operation is a zip package containing the following:
• JSON files with all text organized into contextual blocks (paragraphs, headings, lists, footnotes).
• “table” folder: Data in table is delivered in the JSON and also output as XLSX formats, plus PNG images for easy visual verification.
• “figures” folder: Objects that are identified as images are extracted as PNG files.

👉 How I use it in practice Read table from excel format and combine that with texts in json format to create comprehensive input for LLMs. I have seen big improvement in the accuracy of LLMs when retrieving information from PDFs using Adobe, especially when dealing with information located in tables.

👉 Choosing the right tool to parse PDFs for your needs:
• PyPDF or Unstructured: Great for basic extraction needs where table structure is perhaps not a priority, completely free
• Camelot/Tabula: Specialized for table extraction, free with acceptable accuracy for many use cases
• Adobe PDF Services API: Premium solution when high accuracy is critical