Thursday, February 10, 2011

How-To: Turn a .pdf to plaintext using Google Docs (even if it's an image)

Every once in awhile, I'll receive a large set of documents that I need to quickly read and categorize. Some day I hope to use NLP for those categorizations, but I still have much to learn. One document format that I always struggle with converting on Mac OS X with Python is .pdf. But, not anymore...

Last year, google docs introduced the ability to do optical character recognition (OCR). Using a tiny bit of Python, I was able to upload a document and pull it back down as a plain text file. Here's how.

Step 1:

install gdata python libraries

Step 2:

import os.path
import sys

if __name__ == "__main__":
    # read in the pdf file
    f = open(sys.argv[1])

    # setup your google docs client
    client ='pdf2txt')
    client.ssl = True  # Force all API requests through HTTPS

    user = ''
    password = 'TE$T'

    # login to Google Docs
    client.ClientLogin(user, password, client.source)

    # create the media source object for upload    
    ms =, content_type="application/pdf", content_length=os.path.getsize(        

    # upload your pdf
    entry = client.Upload(ms,, folder_or_uri="

    # get the file as text (the ext sets the format, can also be .doc)
    client.Export(entry, + ".txt")

Step 3:

Run your new script:
> python pdf2txt yourpdf_file.pdf
this will add a file to the directory you ran python from and create a file named:

Step 4:

check out your file:

Use my code at your own risk, feel free to submit even better code that uses getopts() for command line args.

1 comment:

  1. You sir, are my hero! I cannot explain how long I've been looking for a solution like this!