How to Build a B2B Document Extractor with Both Rules and LLM: A Step-by-Step Comparison

By • min read

Introduction

Extracting structured data from B2B PDF invoices, purchase orders, and receipts is a common challenge. Many developers turn to rule-based approaches using OCR (like Tesseract) or explore modern LLMs (like LLaMA 3) for more flexible extraction. This guide walks you through building the same extractor twice — once with pytesseract rules and once with Ollama + LLaMA 3 — so you can compare performance, accuracy, and maintenance on a realistic B2B order scenario.

How to Build a B2B Document Extractor with Both Rules and LLM: A Step-by-Step Comparison
Source: towardsdatascience.com

What You Need

Step-by-Step Guide

Step 1: Set Up the Environment and Sample Document

First, create a project folder and install dependencies:

pip install pytesseract pdf2image Pillow ollama

Place your sample B2B PDF in the folder. For this guide, we assume a purchase order containing fields like Order ID, Supplier Name, Line Items, Total Amount.

Step 2: Build the Rule-Based Extractor with pytesseract

Create a Python script rule_extractor.py. Use pdf2image to convert PDF pages to images, then apply Tesseract OCR:

from pdf2image import convert_from_path
import pytesseract

images = convert_from_path('order.pdf')
text = pytesseract.image_to_string(images[0])

Now define rules using regex and keyword matching. For example:

Test with your PDF and adjust regex patterns. This approach works well for consistent layouts but fails if the format changes.

Step 3: Build the LLM-Based Extractor with Ollama and LLaMA 3

Create llm_extractor.py. Read the PDF text as before (or use OCR output). Then pass it to Ollama:

import ollama

prompt = """You are a B2B document parser. Extract fields: Order ID, Supplier Name, Line Items (as list), Total. Output only JSON.
Document:
{ocr_text}
""".format(ocr_text=text)

response = ollama.chat(model='llama3', messages=[{'role': 'user', 'content': prompt}])
result = json.loads(response['message']['content'])

This method is layout-agnostic and handles variations naturally. However, it requires running a local LLM and may be slower. You can also tweak the prompt to enforce schema.

How to Build a B2B Document Extractor with Both Rules and LLM: A Step-by-Step Comparison
Source: towardsdatascience.com

Step 4: Compare Outputs and Handle Failures

Run both scripts on the same document. Compare extracted JSON:

For failures, enhance rules with fallback patterns, or improve LLM prompt by providing examples. Consider using both in a hybrid pipeline where LLM acts as a backup.

Step 5: Optimize for Your Use Case

For production, measure accuracy, speed, and maintenance overhead. Rule-based is fast and cheap but brittle. LLM-based offers flexibility but requires GPU and careful prompt engineering.

You can also combine them: try rules first, then use LLM for confidence threshold below 90%.

Tips for Success

By building the same extractor twice, you gain practical insight into trade-offs and can make an informed choice for your B2B document processing needs.

Recommended

Discover More

IBM Unveils AI Operating Model at Think 2026, Pushing Enterprises Past ExperimentationRediscovering Creativity with the Cricut Joy 2: A Q&A ExplorationBartender Pro Turns Your Mac's Notch into a Dynamic Command CenterThe Hidden Origins of Australia's Twelve Apostles: A Q&A on Their Rise from the Sea10 Fascinating Facts About Giant Squid in Western Australia's Waters