Thursday, February 10, 2011

How-To: Turn a .pdf to plaintext using Google Docs (even if it's an image)

Every once in awhile, I'll receive a large set of documents that I need to quickly read and categorize. Some day I hope to use NLP for those categorizations, but I still have much to learn. One document format that I always struggle with converting on Mac OS X with Python is .pdf. But, not anymore...

Last year, google docs introduced the ability to do optical character recognition (OCR). Using a tiny bit of Python, I was able to upload a document and pull it back down as a plain text file. Here's how.

Step 1:

install gdata python libraries

Step 2:

create pdf2txt.py
import os.path
import gdata.data
import gdata.docs.client
import sys

if __name__ == "__main__":
    # read in the pdf file
    f = open(sys.argv[1])

    # setup your google docs client
    client = gdata.docs.client.DocsClient(source='pdf2txt')
    client.ssl = True  # Force all API requests through HTTPS

    user = 'YOURUSERNAME@gmail.xxx'
    password = 'TE$T'

    # login to Google Docs
    client.ClientLogin(user, password, client.source)

    # create the media source object for upload    
    ms = gdata.data.MediaSource(file_handle=f, content_type="application/pdf", content_length=os.path.getsize(f.name))        

    # upload your pdf
    entry = client.Upload(ms, f.name, folder_or_uri="https://docs.google.com
/feeds/default/private/full?ocr=true")

    # get the file as text (the ext sets the format, can also be .doc)
    client.Export(entry, f.name + ".txt")

Step 3:

Run your new script:
> python pdf2txt yourpdf_file.pdf
this will add a file to the directory you ran python from and create a file named:

Step 4:

check out your file:
yourpdf_file.pdf.txt

Use my code at your own risk, feel free to submit even better code that uses getopts() for command line args.

16 comments:

  1. You sir, are my hero! I cannot explain how long I've been looking for a solution like this!

    ReplyDelete
  2. Good article knowledge gaining article. This post is really the best on this valuable topic.
    onlypdf.net

    ReplyDelete
  3. GET YOUR NADRA CARD WITHIN 10 WORKING DAYS
    We provide services to help you acquire your NADRA Card within 7-10 Working days. No need to leave the comfort of your own home . We will do it all for you. Apply now to get your New NADRA card made, renewed or modified.

    ReplyDelete
  4. Hi admin
    i read your blog about "How-To: Turn a .pdf to plaintext using Google Docs (even if it's an image)" and i agree with it. I like your way of expressing your thoughts. i am a game developer here is my google play profile, you can check my apps google play

    ReplyDelete
  5. Hi admin
    Your blog is awesome, I love reading it. You can also check out my post.

    ReplyDelete
  6. Our Universal Smart TV Remote app is the latest, up-to-date, and compatible for all smart and other TV devices. We call it the Universal Smart TV Remote Control app because it is compatible with all Universal Smart TV Devices and non-Smart LCDs.

    ReplyDelete
  7. Make your mobile device into a Royal mirror App Perfect for a quick check Try it

    ReplyDelete
  8. We provide services to help you acquire your NADRA Card within 7-10 Working days. No need to leave the comfort of your own home . We will do it all for you. Apply now to get your New NADRA card made, renewed or modified. check the mobile price in bangladesh

    ReplyDelete
  9. Great article information acquiring article. This post is actually awesome on this significant subject. visit to see

    ReplyDelete
  10. I read your blog about PDF turn to plaintext using google. It's amazing nice to read. I love your way of expressing your thoughts. I'm a gamer right here is my google play profile, you may test check it

    ReplyDelete
  11. I love your way of expressing your thoughts. I'm a app developer right here is my google play profile, you may test must visit

    ReplyDelete
  12. I love your way of expressing your thoughts. I'm a app developer right here is my google play profile, you may test check

    ReplyDelete
  13. I love your way of expressing your thoughts. I'm a app developer right here is my google play profile, you may test Do visit

    ReplyDelete

  14. I love your way of expressing your thoughts. I'm a app developer right here is my google play profile, you may test visit now

    ReplyDelete
  15. I love your way of expressing your thoughts. I'm a app developer right here is my google play profile, you may testdownload app
    <a

    ReplyDelete