Have you ever had a problem extracting pdf files in your environment? There is a modular python library invoice2data that will help you with this process. So basically, it is a library that helps in data mining process where you extract usable data from a larger batch of raw data.


Brief information about how Invoice2data works:

  • Secondly, It searches for the regex you have written in the YAML based template system.
  • Lastly, saves the result you have got in JSON, CSV or XML or renames the PDF to match the content.

Writing a flexible template module, you can achieve following things:

  • Firstly, Prebuilt Plugins are available to match line items and tables.
  • Secondly, Match the pdf files content precisely.
  • Thirdly, Define static fields that are same for every invoice.
  • Fourthly, Define custom fields needed in your organisation.
  • Fifthly, You can define regex for the currency in your pdf.
  • Lastly, Have multiple regex for similar fields.

Installation and Setup:

  • Installing Pdftotext
pip install pdftotext
  • Installing Poppler
## Using Python-pip :
pip install python-poppler
## Using Ananconda :
conda install -c conda-forge poppler

You have to get the latest version of poppler if possible. Poppler is available in different version including macOS Homebrew, Debian and Ubuntu. Without poppler pdftotext won’t read the pdf correctly.

  • Installing Invoice2data
pip install invoice2data

My Experience working with Python Invoice2data :

  • Sample Invoice
  • Yaml Template
# -*- coding: utf-8 -*-
issuer: The XYZ Company
- The XYZ Company
- US Supplier 123
    amount: TOTAL:\s+(\d+,\d+\.\d\d)
    date: Invoice Date:\s+(\d{1,2}\/\d{1,2}\/\d{4})
    delivery_date: Delivery Date:\s+(\d{1,2}\/\d{1,2}\/\d{4})
    invoice_number: INVOICE:\s+(\w{3}\d{1,8})
    sales_order : Sales Order:\s+(\w{2}\d+)
    -   start: Line\s+Product\s+Description\s+Quantity
        end: Prices
        body: (?P<Line>^\d{2})\s+(?P<Product>\w{2}\-\w{2}\-\w+\-\w+)\s+(?P<Description>\w+\s\w+\s\w+\-\w+)\s+(?P<Quantity>\d+)

    remove_whitespace: false
    currency: USD
        - '%d/%m/%Y'
        - en
decimal_separator: '.'
  • CODE
from invoice2data import extract_data
from invoice2data.extract.loader import read_templates

templates = read_templates('Template/')
result = extract_data('Invoice/invoice.pdf', templates=templates)
  • Output
{'issuer': 'The XYZ Company', 'amount': 2912.0, 'date': datetime.datetime(2019, 9, 22, 0, 0), 'delivery_date': datetime.datetime(2019, 5, 10, 0, 0), 'invoice_number': 'INV12345678', 'sales_order': 'SO99999999', 'currency': 'USD', 'Line': '10', 'Product': 'MM-QM-E118-09', 'Description': 'Testing Frame E108-01', 'Quantity': '2', 'desc': 'Invoice from The XYZ Company'}
  • Project Directory Structure
Python invocie2data project directory structure

Remember to right the regex properly or it won’t read the Yaml template. Watch this video for basic regex parsing for a invoice using invoice2data python .


We strongly suggest you to use invoce2data for invoice processing. As long you don’t make any changes to invoice template. You can add your own custom plugins for your template as mentioning few things that can be constant in your invoice pdf to your own requirements. Also there is a catch as writing regex requires some skills in order to create an invoice template. But don’t you worry with that video we have got you covered.

“Where there is data smoke, there is business fire.”

Thomas Redman