Streamlining Document Workflows: The Power of LLMs in Knowledge Extraction

Written by Arvin Lim and Nicholas Anile.

Introduction

In recent years, Large Language Models (LLMs) have captured the attention of businesses and tech enthusiasts alike. At Mantel Group, we’ve witnessed this growth firsthand, engaging in numerous GenAI projects since ChatGPT first ignited the hype. Over the past year, we’ve worked across LLM projects of all kinds, ranging from feedback generation, chatbots, and developing analytical agents that provide data-driven insights. As an AI & ML Engineering Team, we’ve actively engaged in every aspect of these projects, from fine-tuning and prompt engineering to integrating them into production-ready solutions. Looking back, it’s amazing to see the evolution of LLMs and how far the domain of Generative AI has come since they first exploded into the mainstream; while some of the excitement may seem overstated, much of it is well-deserved. These models have become pivotal in automating a wide range of tasks, such as customer support chatbots that engage in natural conversations, code generation assistants, and creative writing tools. The enthusiasm surrounding LLMs stems from their remarkable ability to transform workflows, reduce manual labour, unlocking new levels of efficiency in completely new use cases.

With the recent release of several large multi-modal (i.e.non-text based) models, another exciting frontier has emerged: document knowledge extraction through image data. This process automates the otherwise tedious task of retrieving structured data from unstructured documents, such as invoices and receipts. In this blog, we’ll explore how LLMs can be leveraged for document knowledge extraction, the strategies behind this automation, and its potential to revolutionise this traditionally labour-intensive task.

What Makes LLMs Suitable for Knowledge Extraction?

The development of multi-modal LLM models such as Claude 3.5 Sonnet and GPT 4o to name a few facilitates the utilisation of multiple media types such as images in addition to text. This capability is complemented by the large context-windows (context windows refer to the amount of input a model can consider when generating a response) the models have available unlocks the ability to accurately extract data from documents. LLMs also excel at generalising across various document types, much like a human reader. They can pick up contextual cues and infer where key information lies, even if they haven’t been explicitly trained on a specific document layout (which is one of the biggest challenges of traditional document processing techniques) .

Thanks to their natural language interface, LLMs allow us to leverage human-readable definitions of fields - providing these definitions as context to the model eliminates the need to codify thousands of complex business rules to cover all the possible edge cases that exist in documents types and language. This flexibility also means that adapting an existing extraction workflow to new document types or data schemas is relatively straightforward. By simply rewriting the field definitions in plain English, we can repurpose the solution for entirely different use cases, reducing the technical burden and accelerating implementation.

The Dual-Input Solution

The processing pipeline used for the LLM document knowledge extraction solution.


Our document knowledge extraction solution leverages a combination of LLMs and Optical Character Recognition (OCR) technology to automate the process of structuring unstructured data from documents such as invoices. The process begins with the input of three simple components: 

  • A schema that defines the ideal output structure

  • Field definitions that specify what data to extract

  • The document itself

Preprocessing techniques are employed on documents to ensure images of pages extracted from the document are of the best possible quality. This transformation often includes sharpening images for enhanced text readability and/or rotating documents to ensure upright text orientation. These efforts ensure that the image is best placed for accurate text extraction.

Once preprocessing is complete, the resulting image(s) are passed to an OCR engine to extract the raw text contained in the document. OCR engines such as Tesseract receive images as input and return detected text within the image as an output. Both the text and the original document image are then fed into the LLM. This dual-input strategy is key: when passing only the image, the LLM can usually identify the fields in the correct locations but tends to misread characters. Conversely, when passing only the OCR text, while character accuracy improves, the LLM sometimes struggles with identifying field values due to a lack of spatial awareness (e.g., missing context from tabular structures or logos). 

Examples of edge cases that might be difficult for a traditional document knowledge extraction solution to process correctly.

By providing both the image and the text, we achieve the best of both worlds—the LLM uses the image to accurately match fields and the OCR text to reliably extract characters.

This approach delivers several key benefits over traditional document knowledge extraction services:

  1. Superior accuracy: LLMs offer better accuracy by leveraging contextual clues within documents to identify fields that may not be explicitly labelled. They also excel at recognizing patterns based on field definitions and handling edge cases that rigid heuristic based systems often struggle with.

  2. Cost-effectiveness: Depending on the use case and the model employed, LLM-based solutions can be more cost-efficient than other alternatives due to the ever decreasing token.

  3. Generalisability: LLMs’ adaptability to new document types or business use cases is unparalleled. The processing logic is often predominantly contained within the natural language prompt, making it easy to update and maintain.

  4. Direct output into the required format: The LLM can be instructed to produce results directly in the desired schema, eliminating the need for complex post-processing to transform raw outputs into a usable format.

Business Benefits

We have productionised this solution for a client and observed significant improvements in accuracy compared to their OCR only based solution. Field accuracy increased from 80% to 98%, largely due to the LLM’s ability to handle more complex edge cases. In practical terms, this resulted in an increase in the proportion of documents processed fully automatically, reducing burden on human officers reviewing documents manually. Additionally, the solution eliminated a substantial portion of hard-coded business logic, reducing the maintenance effort and allowing the system to process document structures that were previously unsupported by the legacy system.

Below is an illustrative example, highlighting how LLM enabled processing workflow can lead to significant cost savings. At a very generous weighting towards humans being in the loop and low accuracy of the model, the automated solution is 65% more cost effective and executes in a fraction of the time.   


Some high-level assumptions:

  • 50,000 invoices per week.

  • LLM cost is $31.11/1000 invoices  (Estimated cost)

  • Human cost is $200/1000 invoices (1 invoice per 30 seconds, $24/hr)

  • 20% of LLM outputs still fail checks and require human intervention.

This solution has proven not only to be effective within a single business unit but also highly portable across different use cases and clients across a wide range of industries. The key modifications required are simple - adjusting the output structure and field definitions - which underscores the adaptability of the LLM to new challenges.

Moreover, we’ve observed that the model exhibits a surprising ability to make ‘human-like’ decisions in ambiguous scenarios. When faced with poorly defined fields or unusually formatted documents, the model often makes rational choices that a human might arrive at when interpreting the same document. This capability further highlights the flexibility and intelligence of the solution, this ability to reason with some post-processing for validation makes it a powerful tool for automating complex document processing tasks.

While we have primarily focused on LLMs capability to extract knowledge from documents within an operational automation use case, transforming this unstructured data into formats more accessible to business intelligence and other machine learning development creates enormous business potential. For example, uncovering new patterns, trends and relationships that were previously unknown.

What We’ve Learned Implementing These Solutions

Prompt Engineering

Crafting an effective prompt is akin to writing clearly and unambiguously for another human. Throughout testing, we identified situations where our field definitions were not as robust as anticipated. Refining these definitions to provide the most accurate and useful context for the LLM is crucial for optimising the last few percentage points of accuracy. This process includes providing illustrative examples, clarifying edge cases, and alerting the LLM to avoid mistakes we’ve observed previously.

How the natural language field definitions can be easily adjusted to provide additional context for the LLM.


The structure of the prompt - specifically the order of messages - can also significantly impact model performance. In general, prompts should prioritise important context or information at the beginning and end of the prompt. Adjusting the message order may be necessary, not just rewriting field descriptions. In one of our projects, a simple change in the overall structure led to one of the most substantial accuracy improvements we experienced.

Another critical consideration is the LLM being utilised, every model has been trained to expect prompts written in certain styles and using certain language. For example, the models all use different text indicators for placeholders variables and performance can improve dramatically.

Open AIs recommended approach to inserting variables into prompts


Fine-tuning prompts specifically for a particular LLM also necessitates redefining and iterating on those prompts when changing the underlying models. We advocate for solutions that are model-agnostic, especially since LLM advancements are still happening rapidly. While code can be written to relatively easily accommodate model changes, prompts that are tailored to specific models will require continual refinement, adding a degree of rework overhead.

Engineering Considerations

Scalability

One of the key challenges in scaling LLM-based document extraction solutions, especially in enterprise environments, is managing API rate limits imposed by LLM providers. For large-scale operations, careful design is needed to ensure that the system can meet the latency requirements, which may fluctuate depending on internal demand and external factors such as cloud provider load.

Hosted LLMs, like AWS Bedrock, impose rate limits on both API calls and token processing, which creates a mismatch between the ingestion rate and the model’s throughput capabilities. To address this, solutions like API gateways, load balancers, and queuing mechanisms can be implemented to manage input rates and ensure seamless processing.

Additionally, design considerations around LLM usage are crucial. Decisions such as choosing between provisioned throughput or on-demand access, selecting the appropriate cloud provider, and determining the region where services are hosted can all impact performance. If multiple production solutions are utilising an API hosted on the same cloud account, it’s important to establish how resources will be allocated to avoid bottlenecks. For example, you could choose to deploy a solution with lower expected consumption into non-primary regions to maintain your API quota bandwidth in the regions with the highest capacity such as US-West. 

Operations

In a production setting, maintaining accuracy and consistency over time is equally important. Periodic regression testing is essential to detect any performance degradation or data drift. For static LLM models, it’s necessary to monitor new data inputs and label previously unseen invoices to ensure the service remains accurate over time. Furthermore, should the hosted LLM be deprecated or updated, businesses must have a process for regression testing to ensure a smooth transition to a new model without disrupting operations.

These operational practices ensure the system can scale while maintaining the reliability and accuracy required for enterprise document extraction workflows.

Data Challenges

Document knowledge extraction frequently involves handling Personal Identifiable Information (PII), which presents significant challenges, especially when transmitting data to APIs like LLMs hosted on cloud providers. It’s essential to adhere to relevant regulations governing PII handling. Often, this means using LLMs hosted within your own region to prevent sensitive information from being transferred outside your jurisdiction due to data sovereignty restrictions. This requirement may limit the range of models available, depending on your location. In certain industries where these additional requirements exist, a number of significant pre and post processing logic is required throughout the workflows to maintain individual’s anonymity and maintain security best practices. 

Another major challenge is obtaining the document labels necessary for evaluating model performance. Organisations relying on manual processes often have poor-quality, inconsistent data that cannot be used to assess model outputs. In such cases, it is vital to invest time in generating a set of ‘ground truth’ labels for benchmarking different models' performance in model evaluation phases. Regular evaluation against this labelled set provides a measurable metric to track model performance progress over time.

The Future of Document Knowledge Extraction

Looking ahead, the potential applications of this solution extend far beyond simple document knowledge extraction. The same approach could be applied to a variety of industries, from automating legal document analysis and financial report extraction to processing medical records and categorising documents for regulatory compliance. It could even play a role in fraud detection by analysing patterns in invoices and reports that are typically difficult for rule-based systems to catch. As we continue to see advancements in LLM technology, these broader applications are becoming increasingly feasible from both a technology and cost perspective.

The use of LLMs in automating document knowledge extraction showcases the transformative potential of AI in business processes. By reducing manual effort, improving accuracy, and enhancing scalability, we’ve seen how LLMs can revolutionise how a business handles unstructured data. 

AI and LLMs are poised to continue reshaping document processing and beyond. As these models grow more sophisticated and multi-modal capabilities improve, the range of tasks that can be automated will expand even further. The future of AI-driven automation is bright, and the possibilities for reducing manual work, streamlining operations, and uncovering new efficiencies seem limitless.

Looking ahead, we’re going to be diving deep into the LLMs and how they’re evolving the AI landscape right now with a second blog in this series: A Gen AI World: Powering Knowledge-Driven Applications

10/21/2024