Last year, google docs introduced the ability to do optical character recognition (OCR). Using a tiny bit of Python, I was able to upload a document and pull it back down as a plain text file. Here's how.
Step 1:
install gdata python librariesStep 2:
create pdf2txt.pyimport os.path
import gdata.data
import gdata.docs.client
import sys
if __name__ == "__main__":
# read in the pdf file
f = open(sys.argv[1])
# setup your google docs client
client = gdata.docs.client.DocsClient(source='pdf2txt')
client.ssl = True # Force all API requests through HTTPS
user = 'YOURUSERNAME@gmail.xxx'
password = 'TE$T'
# login to Google Docs
client.ClientLogin(user, password, client.source)
# create the media source object for upload
ms = gdata.data.MediaSource(file_handle=f, content_type="application/pdf", content_length=os.path.getsize(f.name))
# upload your pdf
entry = client.Upload(ms, f.name, folder_or_uri="https://docs.google.com
/feeds/default/private/full?ocr=true")
# get the file as text (the ext sets the format, can also be .doc)
client.Export(entry, f.name + ".txt")Step 3:
Run your new script:> python pdf2txt yourpdf_file.pdfthis will add a file to the directory you ran python from and create a file named:
Step 4:
check out your file:yourpdf_file.pdf.txt
Use my code at your own risk, feel free to submit even better code that uses getopts() for command line args.
You sir, are my hero! I cannot explain how long I've been looking for a solution like this!
ReplyDelete