Thursday, February 10, 2011

How-To: Turn a .pdf to plaintext using Google Docs (even if it's an image)

Every once in awhile, I'll receive a large set of documents that I need to quickly read and categorize. Some day I hope to use NLP for those categorizations, but I still have much to learn. One document format that I always struggle with converting on Mac OS X with Python is .pdf. But, not anymore...

Last year, google docs introduced the ability to do optical character recognition (OCR). Using a tiny bit of Python, I was able to upload a document and pull it back down as a plain text file. Here's how.

Step 1:

install gdata python libraries

Step 2:

create pdf2txt.py
import os.path
import gdata.data
import gdata.docs.client
import sys

if __name__ == "__main__":
    # read in the pdf file
    f = open(sys.argv[1])

    # setup your google docs client
    client = gdata.docs.client.DocsClient(source='pdf2txt')
    client.ssl = True  # Force all API requests through HTTPS

    user = 'YOURUSERNAME@gmail.xxx'
    password = 'TE$T'

    # login to Google Docs
    client.ClientLogin(user, password, client.source)

    # create the media source object for upload    
    ms = gdata.data.MediaSource(file_handle=f, content_type="application/pdf", content_length=os.path.getsize(f.name))        

    # upload your pdf
    entry = client.Upload(ms, f.name, folder_or_uri="https://docs.google.com
/feeds/default/private/full?ocr=true")

    # get the file as text (the ext sets the format, can also be .doc)
    client.Export(entry, f.name + ".txt")

Step 3:

Run your new script:
> python pdf2txt yourpdf_file.pdf
this will add a file to the directory you ran python from and create a file named:

Step 4:

check out your file:
yourpdf_file.pdf.txt

Use my code at your own risk, feel free to submit even better code that uses getopts() for command line args.

1 comment:

  1. You sir, are my hero! I cannot explain how long I've been looking for a solution like this!

    ReplyDelete