Monday, February 14, 2011

3 things you forgot about when adding a new language to your runtime environment.

Many engineers seek to use the right tool for the job (and they should). Most tech managers get a bit freaked out at the idea of endless language proliferation (who cares?). To help your tech managers be a little less freaked out and ensure you never ever again get stuck in a J2EE stack, there are 3 things that you need to keep in mind when adding Python, Ruby, Haskell or Lisp to your runtime environment.

  1. The MySQL bindings will suck.
  2. The initial appserver config will be wrong.
  3. The exception handling and reporting will be useless.
We'll miss you too Duke.

The MySQL bindings will suck.

I'm 0 for 3 (Java, Ruby, Python) in implementing a new language and having anything sort of like smart idle connection timeout handling. I've needed to implement wrappers in Java (hopefully this isn't still the case), patch the Ruby MySQL gems and have tried to figure out the one right Python wrapper for ensuring my app doesn't need a restart once connections are timed out by the server. These are all solvable. Just make sure you test for these issues and resolve them prior to production deployment.

The initial appserver config will be wrong.

Most languages run in some sort of appserver, typically that's wsgi or some mod_ apache plug-in. But, some appservers have strong language preferences - esp. in the Java world. Make sure you understand how your new language is going to run in a new appserver - multithreaded or multiple processes? how many threads or processes do you need? are database connections pooled?

The exception handling and reporting will be useless.

On our team, at Knewton, we email all engineers exceptions and stack traces. We've needed to implement this for both Ruby and Python. For both, we needed to add in important information for debugging - request params, headers, along with a usable stack trace. One other common issue we've encountered is that all libraries for connecting to external resources (curl) have useless default errors. Most of these libraries do not tell you explicitly what they were trying to connect to when the exception has occurred - "Destination Unreachable" is not useful; which destination (hostname, ip, port)? We've patched curb in Ruby and are currently patching Python to let us know what we were unable to connect to.

These are the most common issues I've seen when adding a new language to an environment. Are there any I'm missing (or conveniently forgetting about)? If you can think of any, drop me a comment below.

Thursday, February 10, 2011

How-To: Turn a .pdf to plaintext using Google Docs (even if it's an image)

Every once in awhile, I'll receive a large set of documents that I need to quickly read and categorize. Some day I hope to use NLP for those categorizations, but I still have much to learn. One document format that I always struggle with converting on Mac OS X with Python is .pdf. But, not anymore...

Last year, google docs introduced the ability to do optical character recognition (OCR). Using a tiny bit of Python, I was able to upload a document and pull it back down as a plain text file. Here's how.

Step 1:

install gdata python libraries

Step 2:

create pdf2txt.py
import os.path
import gdata.data
import gdata.docs.client
import sys

if __name__ == "__main__":
    # read in the pdf file
    f = open(sys.argv[1])

    # setup your google docs client
    client = gdata.docs.client.DocsClient(source='pdf2txt')
    client.ssl = True  # Force all API requests through HTTPS

    user = 'YOURUSERNAME@gmail.xxx'
    password = 'TE$T'

    # login to Google Docs
    client.ClientLogin(user, password, client.source)

    # create the media source object for upload    
    ms = gdata.data.MediaSource(file_handle=f, content_type="application/pdf", content_length=os.path.getsize(f.name))        

    # upload your pdf
    entry = client.Upload(ms, f.name, folder_or_uri="https://docs.google.com
/feeds/default/private/full?ocr=true")

    # get the file as text (the ext sets the format, can also be .doc)
    client.Export(entry, f.name + ".txt")

Step 3:

Run your new script:
> python pdf2txt yourpdf_file.pdf
this will add a file to the directory you ran python from and create a file named:

Step 4:

check out your file:
yourpdf_file.pdf.txt

Use my code at your own risk, feel free to submit even better code that uses getopts() for command line args.