Agentic RAG workflow on tabular data from a PDF file
Agentic RAG workflow on tabular data from a PDF file
In this notebook, we’re building a workflow to extract accurate tabular data information from a PDF file.
The following bullets summarize the notebook, with highlights being:
- Parse the PDF file and extract tables into images (optional).
- A single RAG agent fails to get the accurate information from tabular data.
- An agentic workflow using a groupchat is able to extract information
accurately:
- the agentic workflow uses a RAG agent to extract document metadata (e.g. the image of a data table using just the table name)
- the table image is converted to Markdown through a multi-modal agent
- finally, an assistant agent answers the original question with an LLM
Unstructured-IO is a dependency for this notebook to parse the PDF. Please install AG2 (with the neo4j extra) and the dependencies:
- Install Poppler https://pdf2image.readthedocs.io/en/latest/installation.html
- Install Tesseract https://tesseract-ocr.github.io/tessdoc/Installation.html
- pip install ag2[neo4j], unstructured==0.16.11, pi-heif==0.21.0, unstructured_inference==0.8.1, unstructured.pytesseract==0.3.13, pytesseract==0.3.13
Set Configuration and OpenAI API Key
Parse PDF file
Skip and use parsed files to run the rest. This step is expensive and time consuming, please skip if you don’t need to generate the full data set. The estimated cost is from $10 to $15 to parse the pdf file and build the knowledge graph with entire parsed output.
For the notebook, we use a common finanical document, Nvidia 2024 10-K as an example (file download link).
We use Unstructured-IO to parse the PDF, the table and image from the PDF are extracted out as .jpg files.
All parsed output are saved in a JSON file.
Create sample dataset
Imports
If you want to skip the parsing of the PDF file, you can start here.
Create a knowledge graph with sample data
To save time and cost, we use a small subset of the data for the notebook.
This does not change the fact that the native RAG agent solution failed to provide the correct answer.
Connect to knowledge graph if it is built
Native RAG Agent Solution
The following shows that when use a native RAG agent for parsed data, the agent failed to get the right information (5,282 instead of 4,430).
Our best guess is that RAG agent fails to understand the table structure from text.
Agentic RAG workflow for tabular data
From the above example, when asked the goodwill asset (in millions) of the table NVIDIA Corporation and Subsidiaries Consolidated Balance Sheets, the answer was wrong. The correct figure from the table is $4,430 million instead of $4,400 million. To enhance the RAG performance from the tabular data, we introduce the enhanced workflow.
The workflow consists a group of agent and use groupchat to coordinate. It breaks the RAG into 3 mains steps, 1. it finds the parsed image of the corresponding table. 2. it converts the image to table in structured Markdown format. 3. With the table in Markdown, the workflow answer the question with the correct data.