Portable Document Format (PDFs) are everywhere and importing a popular python-package like PDF2Image, PDFtoText, or PopplerQt5 is a common approach to dealing with them. Unfortunately, unless you are working with a Linux machine, many users are reporting that these packages are returning errors because they rely on Poppler.
Never heard of Poppler?
Poppler is a utility for rendering PDFs and it’s common to Linux systems, but not Windows. So, naturally, if you want to use Poppler and its associated packages, we need to bridge the gap.
Let’s visit google and see what our options are…
A quick Google (StackOverflow) search reveals that there are many other people having this problem and they are still looking for solutions.
Poppler and Python’s PDF-libraries, which leverage Linux-utilities, don’t play well with Windows.
When we look for solutions, many of them are outdated, ineffective, too difficult, etc…
Of the purposed solutions, one solution appears to work well.
Windows Subsystem for Linux (WSL).
Actually, because of how powerful Windows Subsystem for Linux is, it’s a great solution for other problems which require Linux tools on a Windows machine.
Windows Subsystem for Linux is a compatibility layer for Linux binary executables natively on Windows 10. It recently entered version two (WSL 2) and introduced a real Linux kernel. To put it plainly, WSL makes it feel like you’re working on a real Linux machine (and you are).
In this section, we will, in five short steps, install and setup WSL. Afterwards, we will install and setup Poppler in a few short steps.
Run Window’s Powershell as an administrator.
Enable WSL by executing the ‘Enable-WindowsOptionalFeature’ command:
Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux
Activate the changes by restarting your computer.
Note that, Microsoft says, “This reboot is required in order to ensure that WSL can initiate a trusted execution environment.”
Now, you’re back from a restart, your system’s WSL is enabled, and you are ready to install a Linux distribution.
Go to the Window’s Store and search for WSL.
Getting WSL from Windows Store
Click Ubuntu and choose to install. Note, mine is already installed, so you have to do some imagining here.
Enter WSL through a terminal like this one in VS Code. Notice that, once you enter WSL, the terminal prompt will change. You are now operating within a Linux machine! Exciting!
Conduct the following commands within the WSL-prompt. Note that, you can ignore some of the steps that deal with Tesseract-OCR and PyTesseract. These are for the demo-project which I share at the end of the article.
# Author: Matthew E. Miller # Date: 1/1/2020 # Medium: https://medium.com/@matthew_earl_miller (where this is being published) # Github: https://github.com/matmill5 # Linkedin: https://www.linkedin.com/in/matthew-miller-engineer/ # StackOverflow: https://stackoverflow.com/users/11937169/matthew-e-miller?tab=profile # Command 1: Enter Windows Subsystem for Linux PS C:\Users\Matthew\Desktop\Project> wsl # Command 2: Cleanup [email protected]_name:/mnt/c/Users/Matthew/Desktop/Project$ sudo apt-get clean # Command 3: Update [email protected]_name:/mnt/c/Users/Matthew/Desktop/Project$ sudo apt-get update # Command 4: Get Python 3 on your WSL [email protected]_name:/mnt/c/Users/Matthew/Desktop/Project$ sudo apt install python3 Command 5: Get Python PIP [email protected]_name:/mnt/c/Users/Matthew/Desktop/Project$ sudo apt install python-pip Command 6: Get poppler-utils [email protected]_name:/mnt/c/Users/Matthew/Desktop/Project$ sudo apt install poppler-utils Command 7: Get pdf2image (dependant on poppler and inspiration for article) [email protected]_name:/mnt/c/Users/Matthew/Desktop/Project$ pip install pdf2image Command 8: Get pathlib [email protected]_name:/mnt/c/Users/Matthew/Desktop/Project$ pip install pathlib Command 9: Get pytesseract (if you're doing OCR) [email protected]_name:/mnt/c/Users/Matthew/Desktop/Project$ pip install pytesseract Command 10: Get tesseract-ocr (if you're doing OCR) [email protected]_name:/mnt/c/Users/Matthew/Desktop/Project$ sudo apt-get install tesseract-ocr
Run a program with your newly acquired, ready-to-use, Poppler utilities.
I’ve created this demo script, so you can use it if you don’t have your own. Although, you will need a PDF to mess with.
# Tesseract OCR import pytesseract from PIL import Image import sys from pdf2image import convert_from_path import os import io # If you need to assign tesseract to path # pytesseract.pytesseract.tesseract_cmd = r'C:\Users\Matthew\AppData\Local\Tesseract-OCR\tesseract.exe' pdf_path = 'pdfs/A Production Implementation of an Associative Arran Processor -STARAN - Rudolph.pdf' output_filename = "results.txt" pages = convert_from_path(pdf_path) pg_cntr = 1 sub_dir = str("images/" + pdf_path.split('/')[-1].replace('.pdf','')[0:20] + "/") if not os.path.exists(sub_dir): os.makedirs(sub_dir) for page in pages: if pg_cntr <= 20: filename = "pg_"+str(pg_cntr)+'_'+pdf_path.split('/')[-1].replace('.pdf','.jpg') page.save(sub_dir+filename) with io.open(output_filename, 'a+', encoding='utf8') as f: f.write(unicode("======================================================== PAGE " + str(pg_cntr) + " ========================================================\n")) f.write(unicode(pytesseract.image_to_string(sub_dir+filename)+"\n")) f.write(unicode("======================================================== ========================= ========================================================\n")) pg_cntr = pg_cntr + 1
This code works by converting a PDF to JPG. Then, it conducts OCR and writes the OCR-results to an output-file.
That’s it. You are certified Poppler-On-Windows.
Enjoy the spoils of war! You have gained some seriously new and powerful skills. You are well on your way to becoming a more flexible developer (if you aren’t already).
Newly Acquired Skills:
It’s so important to experiment with these new skills and solidify your understanding. True understanding comes with experience.
I built an OCR application to help document the historical work of emeritus professor and famous computer scientist, Dr. Kenneth E. Batcher. It uses a PDF to image tool for JPEG-conversion. Then, it does OCR on the image and writes the results to an output file. Since this proof of concept works well enough, it’ll eventually be used on document-scans instead of PDFs.