How to Extract and Clean Data From PDF Files in R

How to Extract and Clean Data From PDF Files in RCharles BordetBlockedUnblockFollowFollowingSep 2, 2017Extracting data from PDF files can be cumbersomeDo you need to extract the right data from a list of PDF files but right now you’re stuck?If yes, you’ve come to the right place.Note: This article treats PDF documents that are machine-readable..If that’s not your case, I recommend you use Adobe Acrobat Pro that will do it automatically for you..Then, come back here.In this article, you will learn:How to extract the content of a PDF file in R (two techniques)How to clean the raw document so that you can isolate the data you wantAfter explaining the tools I’m using, I will show you a couple examples so that you can easily replicate it on your problem.Why PDF files?When I started to work as a freelance data scientist, I did several jobs consisting in only extracting data from PDF files.My clients usually had two options: Either do it manually (or hire someone to do it), or try to find a way to automate it.The first way being really tedious and costly when the number of files increases, they turned to the second solution for which I helped them.For example, a client had thousands of invoices that all had the same structure and wanted to get important data from it:the number of sold items,the profits made at each transaction,the data from his customersHaving everything in PDF files isn’t handy at all..Instead, he wanted a clean spreadsheet where he could easily find who bought what and when and make calculations from it.Another classical example is when you want to do data analysis from reports or official documents..You will usually find those saved under PDF files rather than freely accessible on webpages.Similarly, I needed to extract thousands of speeches made at the U.N..General Assembly.So, how do you even get started?Two techniques to extract raw text from PDF filesUse pdftools::pdf_textThe first technique requires you to install the pdftools package from CRAN:A quick glance at the documentation will show you the few functions of the package, the most important of which being pdf_text.For this article, I will use an official record from the UN that you can find on this link.This function will directly export the raw text in a character vector with spaces to show the white space and..to show the line breaks.Having a full page in one element of a vector is not the most practical..Using strsplit will help you separate lines from each other:If you want to know more about the functions of the pdftools package, I recommend you read Introducing pdftools – A fast and portable PDF extractor, written by the author himself.Use the tm packagetm is the go-to package when it comes to doing text mining/analysis in R.For our problem, it will help us import a PDF document in R while keeping its structure intact..Plus, it makes it ready for any text analysis you want to do later.The readPDF function from the tm package doesn’t actually read a PDF file like pdf_text from the previous example did..Instead, it will help you create your own function, the benefit of it being that you can choose whatever PDF extracting engine you want.By default, it will use xpdf, available at http://www.xpdfreader.com/download.htmlYou have to:Download the archive from the website (under the Xpdf tools section).Unzip it.Make sure it is in the PATH of your computer.Then, you can create your PDF extracting function:The control argument enables you to set up parameters as you would write them in the command line..Think of the above function as writing xpdf -layout in the shell.Then, you’re ready to import the PDF document:Notice the difference with the excerpt from the first method..New empty lines appeared, corresponding more closely to the document.. More details

Leave a Reply