Pete Miron: How-To: Turn a .pdf to plaintext using Google Docs (even if it's an image)

Thursday, February 10, 2011

How-To: Turn a .pdf to plaintext using Google Docs (even if it's an image)

Every once in awhile, I'll receive a large set of documents that I need to quickly read and categorize. Some day I hope to use NLP for those categorizations, but I still have much to learn. One document format that I always struggle with converting on Mac OS X with Python is .pdf. But, not anymore...

Last year, google docs introduced the ability to do optical character recognition (OCR). Using a tiny bit of Python, I was able to upload a document and pull it back down as a plain text file. Here's how.

Step 1:

install gdata python libraries

Step 2:

create pdf2txt.py

import os.path
import gdata.data
import gdata.docs.client
import sys

if __name__ == "__main__":
    # read in the pdf file
    f = open(sys.argv[1])

    # setup your google docs client
    client = gdata.docs.client.DocsClient(source='pdf2txt')
    client.ssl = True  # Force all API requests through HTTPS

    user = 'YOURUSERNAME@gmail.xxx'
    password = 'TE$T'

    # login to Google Docs
    client.ClientLogin(user, password, client.source)

    # create the media source object for upload    
    ms = gdata.data.MediaSource(file_handle=f, content_type="application/pdf", content_length=os.path.getsize(f.name))        

    # upload your pdf
    entry = client.Upload(ms, f.name, folder_or_uri="https://docs.google.com
/feeds/default/private/full?ocr=true")

    # get the file as text (the ext sets the format, can also be .doc)
    client.Export(entry, f.name + ".txt")

Step 3:

Run your new script:

> python pdf2txt yourpdf_file.pdf

this will add a file to the directory you ran python from and create a file named:

Step 4:

check out your file:

yourpdf_file.pdf.txt

Use my code at your own risk, feel free to submit even better code that uses getopts() for command line args.

16 comments:

AnonymousApril 23, 2011 at 12:02 PM
You sir, are my hero! I cannot explain how long I've been looking for a solution like this!
ReplyDelete
Replies
Alex smithApril 2, 2021 at 11:33 PM
Good article knowledge gaining article. This post is really the best on this valuable topic.
onlypdf.net
ReplyDelete
Replies
volcanicashnewstimulusupdateDecember 16, 2021 at 11:36 PM
GET YOUR NADRA CARD WITHIN 10 WORKING DAYS
We provide services to help you acquire your NADRA Card within 7-10 Working days. No need to leave the comfort of your own home . We will do it all for you. Apply now to get your New NADRA card made, renewed or modified.

ReplyDelete
Replies
volcanicashnewstimulusupdateDecember 18, 2021 at 2:46 AM
Hi admin
i read your blog about "How-To: Turn a .pdf to plaintext using Google Docs (even if it's an image)" and i agree with it. I like your way of expressing your thoughts. i am a game developer here is my google play profile, you can check my apps google play

ReplyDelete
Replies
onlineitpark-fashionhikesJanuary 18, 2022 at 11:40 PM
Hi admin
Your blog is awesome, I love reading it. You can also check out my post.
ReplyDelete
Replies
UnknownJanuary 28, 2022 at 11:45 PM
Our Universal Smart TV Remote app is the latest, up-to-date, and compatible for all smart and other TV devices. We call it the Universal Smart TV Remote Control app because it is compatible with all Universal Smart TV Devices and non-Smart LCDs.
ReplyDelete
Replies
UnknownFebruary 1, 2022 at 9:39 PM
Make your mobile device into a Royal mirror App Perfect for a quick check Try it
ReplyDelete
Replies
UnknownFebruary 11, 2022 at 2:23 AM
We provide services to help you acquire your NADRA Card within 7-10 Working days. No need to leave the comfort of your own home . We will do it all for you. Apply now to get your New NADRA card made, renewed or modified. check the mobile price in bangladesh
ReplyDelete
Replies
UnknownFebruary 17, 2022 at 10:27 PM
Great article information acquiring article. This post is actually awesome on this significant subject. visit to see
ReplyDelete
Replies
Harry BendinMarch 4, 2022 at 6:46 AM
I read your blog about PDF turn to plaintext using google. It's amazing nice to read. I love your way of expressing your thoughts. I'm a gamer right here is my google play profile, you may test check it
ReplyDelete
Replies
Harry BendinMarch 12, 2022 at 12:46 AM
I love your way of expressing your thoughts. I'm a app developer right here is my google play profile, you may test must visit
ReplyDelete
Replies
Harry BendinMarch 16, 2022 at 4:19 AM
I love your way of expressing your thoughts. I'm a app developer right here is my google play profile, you may test check
ReplyDelete
Replies
Harry BendinMarch 19, 2022 at 9:25 AM
I love your way of expressing your thoughts. I'm a app developer right here is my google play profile, you may test Do visit
ReplyDelete
Replies
Harry BendinMarch 25, 2022 at 5:28 AM

I love your way of expressing your thoughts. I'm a app developer right here is my google play profile, you may test visit now
ReplyDelete
Replies
Harry BendinMarch 28, 2022 at 7:50 AM
I love your way of expressing your thoughts. I'm a app developer right here is my google play profile, you may testdownload app
<a
ReplyDelete
Replies
americanthinker+9pmnewsAugust 29, 2022 at 5:42 AM
Continue Reading
ReplyDelete
Replies

Add comment