Last year, google docs introduced the ability to do optical character recognition (OCR). Using a tiny bit of Python, I was able to upload a document and pull it back down as a plain text file. Here's how.
Step 1:install gdata python libraries
Step 2:create pdf2txt.py
import os.path import gdata.data import gdata.docs.client import sys if __name__ == "__main__": # read in the pdf file f = open(sys.argv) # setup your google docs client client = gdata.docs.client.DocsClient(source='pdf2txt') client.ssl = True # Force all API requests through HTTPS user = 'YOURUSERNAME@gmail.xxx' password = 'TE$T' # login to Google Docs client.ClientLogin(user, password, client.source) # create the media source object for upload ms = gdata.data.MediaSource(file_handle=f, content_type="application/pdf", content_length=os.path.getsize(f.name)) # upload your pdf entry = client.Upload(ms, f.name, folder_or_uri="https://docs.google.com /feeds/default/private/full?ocr=true") # get the file as text (the ext sets the format, can also be .doc) client.Export(entry, f.name + ".txt")
Step 3:Run your new script:
> python pdf2txt yourpdf_file.pdfthis will add a file to the directory you ran python from and create a file named:
Step 4:check out your file:
Use my code at your own risk, feel free to submit even better code that uses getopts() for command line args.