Written by Brian Law, Nicholas Anile, Arvin Lim and Kaycee Low.
The Evolving Landscape of Generative AI (Gen AI)
Since ChatGPT burst onto the scene in late 2022, there has been a torrent of innovation promising everything from fully automated business operations through to AI assisted video generation and editing tools. For business, this represents a navigation challenge with a foundational shift of new capabilities, techniques and approaches appearing on a monthly and sometimes even weekly basis.
Given this background, it is important for enterprises to have a flexible approach, investing in projects that can provide immediate value but have the high impact potential to stand the test of time. Whilst it is often forgotten, Gen AI is still a data driven technology and requires converting documents, slides, spreadsheets, audio, and any other media formats into AI consumable friendly content. Establishing and building solid & repeatable foundations is key. Yes! Just like how Data Engineering is key to a thriving classical Machine Learning program, Document Processing via Data Engineering is key to a high performing Gen AI program.
From Prompting to Retrieval Augmented Generation (RAG)
So why document processing? Large Language Models (LLMs) seem almost magical on first use with a level of language comprehension and tailored responsiveness that puts even the most well structured tree-based chatbot experience to shame. It is quickly evident that even when given detailed instructions many gaps in the knowledge of both open source and commercial models exist and that's not even counting the knowledge cliff models possess due to extensive training times & data engineering requirements of the huge data volumes required (often referred to as the knowledge cutoff).
Off the shelf LLMs will likely not understand:
Business Jargon & Acronyms
Business Processes
Niche knowledge / domain areas
Whilst it is fully possible to further train an LLM on an enterprise’s specific business domain and processes (referred to as fine tuning), this requires an extensive document processing pipeline via data engineering to build training datasets. In addition, fine tuning frequently fails to address other common limitations of LLMs such as hallucinations which occur when the LLM generates false but plausible sounding knowledge confidently. The Retrieval Augmented Generation (RAG) architecture provides an approach that can augment an existing LLM with proprietary knowledge whilst reducing the likelihood and impact of hallucinations.
Just like building a training dataset for an LLM, RAG requires that unstructured media be processed into a structured format to work well with LLMs. In the case of RAG, this processed data will be stored in a Vector Database to allow for quick retrieval and use when answering questions from end users.       Â
Challenges with Unstructured Text Data - Making it accessible to our AI
Tabular data (‘structured’ data) can be readily processed and utilised within AI systems. On the other hand, documents, slides, and websites are unstructured media formats built for human visual consumption and can’t be utilised in Gen AI immediately; there is usually no clear metadata defining paragraphs, callout boxes, tables, and diagrams that would help to make information contained easily accessible and understandable for a LLM.
Just like with classical Machine Learning, Data Engineering is here to the rescue, enabling the conversion of unstructured data into structured & consumable data.Â
For Document Pipelines the key steps are:
Parsing: The process of decoding & extracting the information from a specific media format/type.
Chunking: The splitting of the media into smaller more meaningful sizes. For example, paragraphs or sections of a document.
Embedding: For data to be consumable within a model, it must be converted to a numerical representation via embedding.Â
For this article, we will focus on the parse and chunk steps which offer the most room for customisation and tweaking.   Â
Let’s start with parsing: different documents have different formats such as portable documents (pdf), word documents (docx), spreadsheets (xlsx), and websites (html). Each format has its own nuances and requires different processing steps to unlock the data held within. This task has a number of established pathways to solve but it requires a flexible & scalable solution like the Apache Spark Engine in order to run efficiently at scale in a timely manner when you’re considering the huge volume of documents enterprises want to make use of.
Once data has been parsed you can get chunking! Chunking is probably the most complex step in a document pipeline due to the significant amount of variation in established pathways and document types. No one single solution will work for 100% of your documents. Chunking requires not just technical skill to build the processing pipelines but also an understanding of the data being chunked and how it will be utilised downstream. At the most basic level, chunking can be achieved by splitting parsed text by character or word count. For example, creating chunks that are n characters or words long. This often performs okay in limited contexts but for a solution to meet the demands of a varied & demand user base we need to do much better.
One way that we can parse and chunk text better is to utilise computer vision based technologies to essentially allow the computer to ‘read’ a document like a human does, allows us to capture meaning inherent in the format - for example, parsing any headings, images, paragraphs, or columns in the text. Previously, this was a difficult task but now is very useful in the decoding of a document taking into account its formatting and layout.
Visual processing of documents
Visual techniques have been used for processing of text for a long time. Optical Character Recognition (OCR) is the most well known variant and has been powering things like handwriting check processing engines, customs forms and the likes for years. It is also the most common first step into the visual processing world.
To use OCR, we will first convert our documents, page by page and slide by slide into images before we can apply our OCR processing step. Now, this may sound like an antipattern - we can already use a text parsing engine to directly read most computer authored documents - but when we do so, it is quite common to lose information like font size and layout that can be key in understanding the document and the information contained within it.
For many applications, existing OCR techniques can be enough but an emerging technique is to leverage the same technology behind LLMs and self-driving cars to produce more capable document processing engines. Unlike traditional OCR pipelines where the developer has to provide guidance on how to process certain fields and write rules for what is a heading versus a figure description, Vision Language Models (VLM), can receive text based input and turn a complex document into structured data ready for embedding and inserting into a Vector Store database.    Â
Getting even more advanced, you can have the best of both worlds through the development of a processing engine utilising both OCR and VLM components together into a Compound System (also known as Compound AI).
Let's get Building on the Databricks Data Intelligence Platform!
The number of potential applications of AI is endless within enterprise organisations: compound AI document processing pipelines & RAG-uplifted Gen AI are two applications which have huge potential in building enhanced knowledge-driven applications. To better achieve this, organisations need teams building solutions capable of scaling to meet an enterprise's needs and become more data driven. In our follow-up article, we will dive more into the technicals demonstrating how you can leverage Databricks to build a state of the art Document Processing Pipeline on the Data Intelligence Platform.