Pete Miron: 2010

Saturday, December 11, 2010

There's no rollback?

Preventing bugs before production is really, really hard. Basically, your code moves from the warm, safe, controlled environment of your dev and testing environment to the cold, harsh reality of production. Since you know some of your deployments are going to fail, and you can't predict which ones are going to be a failure, you always need to be able to rollback.

Last night, a vendor completed a hardware migration that hosed a critical component of one of my applications. I woke up this am to a flurry of crisis emails. My response was simple, so... "rollback." To my surprise, rolling back was not an option. How in the world can any vendor, with customers, release a change where rolling back is not an option? The outage lasted over 3 hours - for those keeping score at home, one of those outages moves your app from a 99.99% reliability to slightly better than 99.95%. For a portion of our students, it meant a rescheduled class, which really sucks.

Rollbacks require a little bit of additional planning and sometimes additional work to support, but they are always, always, always worth the additional expense. When a bug occurs, you can't possibly know how long it will take to fix, just ask tumblr.

your software without rollbacks.

Here are some basic tips for ensuring rollback:

Ask the question to yourself or your engineers, "How do we rollback?"
Make sure your app and database changes are backwards compatible.
Deploy and validate database (and any other infrastructure) changes prior to any user-facing app changes.
Deploy and validate an app at a time.
Script all of your deployments and rollbacks. We have our own in-house built deployment system, but chef and puppet seem to be popular tools for creating these scripts.

You can check out a reminder I wrote to myself about how to identify when to rollback here.

There is exactly one type of change (that I've encountered) that is actually hard to rollback from, public DNS changes. So try really hard to not screw those up or set your TTLs really low.

On behalf of all of your future customers, please make sure you have a way to rollback your software.

Tuesday, November 9, 2010

Start hiring interns, today!

If you're an engineering manager at ~~a startup~~ any company anywhere, you need to start hiring interns immediately. Here are the most common reasons I hear for not hiring interns:

I don't have the time to manage/mentor them

You have some set of work you need to get done, probably more work than you have people to do the work. The question you need to answer is whether or not the work that interns will help get done will be worth the time investment in the interns. I'd say about only 1/4 of the interns I've hired need more management or mentoring than any other new hire we bring on. Usually, this is a problem with confidence and can quickly be remedied by pointing the intern toward the right places to find solutions to their problems before asking for help. This does not mean that you do not need to structure projects for the interns, but certainly not more so than for any of your other engineers. The overall complexity of the projects you give them may be lower than the problems you give your most senior engineers, but the projects themselves should not require more structure.
I don't have time to sift through all those resumes

This is a problem common in all hiring and an extension of the 1st excuse. Set up a simple set of criteria to strongly filter your set of resumes. We look for interns with a website, high GPAs, and some set of achievements and extracurriculars. Read Joel Spolsky's sorting resumes article for more on how to do this well for any job. When using the right process, reviewing resumes takes way less time than you think it will.
I need people with more experience

You probably need some people with more experience, but more likely than not, you have a lot of things you do every day that could be effectively completed by an intern. Furthermore, I am consistently floored by the experience level that most interns come in with. They have grown up with technology starting at younger ages than all of the more senior engineers have. They can't remember a time before computers. They know more than you might think. More importantly, if you're getting interns from great schools with great programs in their junior or senior years, they are likely learning skills and techniques that aren't even in common practice yet.

Once you've accepted these 3 reasons for not hiring interns as fallacies. Here's what you need to do:

Choose which schools you want to focus on. The more selective the school, the earlier you'll need to get started.
Post your internship position with those schools' career centers. The ideal time to do this is in September, but I've hired great interns as late as March/April. If you're in NYC, you need to apply with hackny and NYC Turing Fellows. At Knewton, we managed to land a hackny fellow last year. Check out Knewton's very own Stuart Partin at http://hackny.org/a/2010/08/video-of-the-hackny-summer-2010-demofest/. This summer will be our first sponsoring a NYC Turing Fellow.
If you're interviewing remotely, use a remote screen sharing tool to do the rough equivalent of a face-to-face interview. Dim dim or WebEx are great tools for this. These tools allow you to speak with the internship candidate and white board in real-time - Skype with Etherpad work great, too.
Pay your interns! You will get real value from them, you should give them some back - outside of experience.
Get them releasing production-level code early. This is probably the most important process to demystify for a new engineer. Most have never even heard of a release or deployment procedure. Show them it's not that hard.
If they're really good, hire them full-time.

Once you've hired your interns, here is a sampling of the type of things you can expect based on what I've seen our interns accomplish:

Significant free trial conversion improvements.
My first hire at Knewton (from a previous company internship).
Faster loading web pages, thanks to some smart SQL tuning and rewriting and a pretty kickass performance report.
More user-friendly administrative interfaces.
Expanded functional testing coverage through Selenium.
Chat integration with Trac.
Movement of large sections of code to a single build process.
User research.
iPhone app specs.

At the end of the day, the excuses most people come up with for not hiring interns are easily outweighed by the production you can expect. Even when paying interns, the value outweighs the cost. If you get really good at hiring interns, you will not only get greater output from your team, you will also create an incredibly valuable recruiting channel.

Please feel free to drop me any questions or comments below. If you're interested in interning at Knewton, email your resume to techjobs@knewton.com.

Thursday, September 16, 2010

How to Scale Backend Infrastructure

Earlier this week, I participated in a talk at Hunch on Scaling Backend Infrastructure with Tom Pinckney (Hunch), Kiril Sheynkman (Thansys), and Jeff Hammerbacher (Cloudera).

How to scale inside the request loop

The majority of web applications have a user sitting behind a web browser who lands on a site, clicks on a button and expects something to happen - quickly. As more users visit your site, they compete for scarce resources - CPU cycles, RAM, hard disk access, and bandwidth. Your goal is to add more resources to support more users - here's how.

Build a system that can be scaled.

Make sure your hardware is clone-able.

If you lose one machine, you need to be able to build a new machine that exactly matches that old machine from OS to configuration.
Common tools for configuration: puppet, chef, or some bash scripting.

Be daring in your language/framework choice.

Choose a language that optimizes for your ability to quickly develop software and your ability to acquire more developers who can program in that language.
Language choice impacts per-request performance.
Common Tools: Python, Ruby, Scala, Groovy, Clojure.

Be conservative in infrastructure choice.

Prefer meat-and-potatoes over next-new-thing, assuming a similar price and available required features on the meat-and-potatoes option.
Make sure you have internal knowledge/capability, active community support, or commercial support options if you go with the next-new-thing.
Use infrastructure others have proven can scale.
NoSQL can be made to work. But, make sure you have good support.
Common Tools: MySql, Apache Web Server.

Separate user data from reference data.

Enable eventual scaling by keeping users on separate databases.

Make sure it performs. Know when it doesn't.

Measure your application performance.

Measuring machine performance is not enough.
Know how long it takes each component in your system to respond. From MySql explain plans to queue depth, to each webservice call, to end-user experience, know how long each component of a single request takes.
Most application monitoring and measurement requires some bit of custom coding.
Common tools: MySql explain, Nagios, Cacti, AppFirst (looks promising, but haven't tested, yet).

Performance test to find obvious bottlenecks and config flaws.

Analyze real-world scenarios to design performance testing. Look at what users actually do on the site to develop your performance test plans. Identify the most common paths or the most frequently accessed pages.
Design performance tests with product/user experience folks to make sure how they expect users to use your site is captured.
Design your tests to validate horizontal scalability. Run your tests with one of everything. Add machines. What happens to your performance?
Common tools: Selenium Grid, Grinder, BrowserMob, Sauce Labs.

Address performance problems.

Check your configurations against recommended configurations of your infrastructure.
Typically, out of the box, you are not giving your database enough memory for the hardware it's running on.
Startup times for app servers worker threads are often pretty slow. Try to start several at initial server startup and make sure they stay running as long as possible.

The database is usually the focal point of most performance problems. Here are a few suggestions to help db performance:

Add indexes.
Tom recommended limiting joins. I find you can get better overall performance by limiting queries to the database.
ORM is fine for initially building your product, but once you're looking to scale, you need to replace ORM (like Rails) with hand-written, optimized SQL.

Limit access to shared resources.

Decouple reads and writes with queues.

Reading and writing from the same logical or physical hard disk is a slow, expensive process. But, you need to store your data somewhere..

MySql slaving is essentially a cheap, easy to use queuing mechanism.
You can scale MySql slaves to insanely large sizes. I've seen 90-way slaving.
There are queues available for most languages.
Common tools: beanstalkd, ActiveMQ, Mule.

Limit contention for hard disk access.

Keep as much of your database and indexes in RAM as possible.

Pin users to separate databases.

To limit the amount of reading and writing that needs to be done to any one database server, you can design your system to limit the amount of data stored on that server in a couple of common ways.

Hash by user id for user-specific database.
Provision users to a user-specific database, with a shared master index of databases for users.
Once you've done that, you can add more databases as you add more users.

Notes on scaling outside the request loop - analytics

These are notes from Jeff's talk. However, you might be better off checking out his book: Beautiful Data: The Stories Behind Elegant Data Solutions. The general idea is that you cannot analyze using the same systems and infrastructure you're using to run the system.

You can probably get by with MySql for a bit.
Extract, Transform, and Load (ETL) will always take more time (both developer time and run time) than you think it should.
The easier to analyze the data, the more requests you'll get to analyze data.
SQL is not a standard. (I'd generalize to no software standard is standard.)
Hadoop (MapReduce) can parallelize data analysis across multiple machines to speed up analysis.
Powerful analytics can produce powerful features. (Think "people you might know" in Facebook.)
Use analytics infrastructure to precompute high-read information. (Think Google's entire web index.)

Please get me your comments on this post. This is really just a start to catalog some of these points for others to use.

Monday, June 7, 2010

All failed deployments are anachronisms.

Your code doesn't care what day it released on. If there is an extended outage or degradation as a result of a deployment, the code is always in the wrong place at the wrong time. Here are all the wrong days to deploy a broken release:

Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday

and the times:

One O'Clock
Two O'Clock
...

and obviously a bunch of holidays and unrelated business synchronizing events (black friday, sales deadlines, etc) that you also shouldn't deploy on.

However, if you really believe this, you should stop writing, managing or using software NOW! Unfortunately, you will deploy broken code, because:

"I’m sorry to say so but, sadly, it’s true that Bang-ups and Hang-ups can happen to you."
- Dr. Seuss (Oh, the places you'll go!)

And when you do deploy that broken code, it probably had nothing to do with when it was deployed. In fact no one probably would've noticed if you did just one thing...

ROLLBACK!

Three rules:

Code must always be able to be rolled back.
Rollback must be a single command.
The rules for rollback must be simple, easy-to-follow and aggressive. (ie. Customer call related to issue with release, exception related to a release, etc.)... then, just...

ROLLBACK!

Then, figure out what went wrong, how to prevent it from happening again... rinse and repeat... any day... any time.

Thursday, June 3, 2010

Follow your debugging process, stupid. In 10 easy steps.

This post is a reminder to myself. I wasted a lot of my personal time and time from members of my team by not following, a simple, repeatable debugging process. The below process won't work 100% of the time, but when it does it will save you hours, stomach lining and and leave you extra gas in the tank for handling problems that are actually hard to replicate and debug.

Verify the input and output creating the defect.
Replicate the defect in your development environment.
Write a unit test that replicates the defect using the same input as the defect and desired output. (You may want to extend this to other permutations of the defect as well.)
Verify the unit test breaks.
Fix the code.
Verify the unit test passes.
Verify that the defect is resolved in your local development environment.
Release to next environment (Prod/QA)
Verify that the defect is resolved in the next environment.
Repeat until defect is fixed for all users.

Do not try to skip steps in the process. You will miss something important. You will waste yours or other people's time. It's not worth it.

Wednesday, March 31, 2010

Don't be the slowest gazelle

Every morning in Africa, a gazelle wakes up. It knows it must outrun the fastest lion or it will be killed. Every morning in Africa, a lion wakes up. It knows it must run faster than the slowest gazelle, or it will starve. It doesn't matter whether you're a lion or a gazelle -- when the sun comes up, you'd better be running.
-- proverb attributed to Roger Bannister (first man to break the 4 minute mile)

At a startup, there are inevitably moments when you question whether or not you're moving too fast. From defects, regressions, and poor performance to missed requirements, and reimplemented requirements. Now you have more customers, you're investing more in marketing, you can't afford the mistakes you used to be able to tolerate. Your startup team wants to begin establishing slower, more measured process for greater predictability and fewer mistakes. You might think about hiring a more "experienced" program manager or project manager, maybe you think about moving to a waterfall process. Although it seems counter-intuitive, you must not slow down... you need to speed up. In his post, Speed up or slow down? lean startup evangelist Eric Ries refers to this as the speed-up-or-slow-down-moment. Eric advocates for speeding up -- "The day-to-day process that startups build should also attempt to maximize speed of learning." Not only do I strongly agree with the need to speed up, I believe you need to speed up regardless of your company's current size, age or market cap. You need to focus on staying lean and agile -- or suffer the fate of the slowest gazelle...

I'd like to use Eric's broadened definition of a startup, with minor modification:

"A startup is a human institution designed to create a ~~new~~ product or service under conditions of extreme uncertainty."
-- Eric Ries

I've struck out new. While this may be a useful modifier for the traditional view of a startup, I strongly believe it is not a precondition for applying many of the lean startup principles. All businesses, regardless of product or service (old or new), are operating under conditions of extreme uncertainty. Over half the companies in the Dow Jones Industrial Average, still a key benchmark for performance of the US Equities Market, were added in just the past 20 years. Even if you think you've moved beyond startup classification, there are lions stalking you -- competitors, regulators, patent trolls, your customers' whims. You need to be faster than all of them.

In 2007, at my previous company Vonage, we were found to be infringing on Verizon's patents. Verizon sought and won an injunction from our use of those patents -- which included technology at the core of Vonage's call processing infrastructure. I will abstain for now from editorializing on the sanity of these rulings but, I wrote about my general sentiments on a similar patent issue here. Vonage managed to appeal and receive a stay for the injunction. This stay set the timeline for our designing patent workarounds. A patent workaround is basically a new way of getting something to work that can not be read on the patent. A judge then needs to approve the workaround. If you think involving business owners in designing software is hard, try working with lawyers... Then, drop a judge in as your QA team. Only if the judge doesn't approve your workarounds, it is basically lights out. Goodbye 2mil customers; farewell $50mil monthly revenues; goodnight future potential for pivots.

Prior to beginning the patent workarounds, it took months to deploy substantial changes to our call processing system. The planned deployment cycle was 6-8 weeks; 3-4 weeks for QA plus another 3-4 weeks to deploy. Inevitably, between week 6 and week 8, a critical defect would be found (memory leak, call loop, etc), the code would be rolled back and the cycle would begin again.

Gaining agreement to move faster with code deploys with QA and Operations was initially met with antagonism - "well, if your coders didn't put any bugs in the code, it would get deployed faster." Once our survival as a company was inextricably tied to our ability to modify and deploy our core processing system, that agreement became substantially less complicated. We had already begun to develop code quality testing and automated integration tests. The manager of the team ran to Best Buy for a switch to build a network that the development team could run automated deployments on. We worked with the QA team to define a < 1 work day set of tests to be ran for the most frequently used features. Then, we (substantially) automated the deployment and production integration testing of the call processing system.

In around 2 months, we went from taking months to deploy the system, to at the height of the workaround madness, deploying the entire system of ~100 machines in less than 24 hours -- for a system supporting 2 million customers making calls next door, across continents, to the other side of the world and most importantly to 911. All systems have bugs and defects, but, when you can fix those bugs and redeploy on a daily basis the impact of any one bug is substantially reduced. Most importantly, we learned what we needed to do to release quickly and often, with lower stress and higher quality thanks to fewer regressions, better automated testing and faster identification of bugs caused by rare or edge cases. By hitting those edge cases early on in the process, we knew we had a stable base to build from if there was a ruling that required any new change.

Being able to "run faster" helped ensure Vonage's future in a very uncertain time. No matter how big your company is, there is no way of knowing where the "lion" is going to come from. Rest assured they're out there; will you be ready to outrun them when they pounce? For a technology company, choosing to speed up improves the likelihood you'll be able to outrun your competitors, react to regulators, work around predatory patent litigation and most importantly react to your customers' needs. Don't be the slowest gazelle.

* As a footnote, Vonage did end up settling with Verizon before the workarounds came to a court decision. Even though we were confident we could quickly modify our code to whatever specs were needed to qualify for a workaround, in an appeals situation you're not guaranteed the judge will accept your workaround. A single day without using our call processing technology would have essentially put Vonage out of business.

Monday, March 1, 2010

On becoming a practicing software engineer

If you're a recent or soon-to-be college grad (or maybe you realized your undergrad degree in Art History ain't gonna pay the bills) and you are passionate about computers and computer programming, here are my tips for becoming a successful practicing software engineer. Many of these things probably aren't the things they taught you in your college programming classes, but all of these are important.

Practice! Practice! Practice! You learn to code by, um, reading and writing code! If you don't have much experience and want to get started, find an open source project you care about and contribute a patch or two. The "View Source" feature of Web Browsers and the open source movement are 2 of your greatest assets when learning to code. Use them! As someone who started his career copying BASIC programs from Compute! and Byte magazines, I can't tell you how great it was to discover the magical "View Source" menu item in Netscape. Open Source projects will teach you about packaging, style guidelines, automated testing, bug tracking, and version control -- while also giving you much needed practice.

Don't forget, your code doesn't *work* until someone else uses it. If you can't work at a startup and need to get code into the hands of users quickly, open source projects are a great way to go. I still recommend going with the startup, though. At Knewton, everyone from summer interns to new full-time engineers ship code (that customers actually use) in their first couple weeks of starting. I'd like to get this down to the first day.

In your early practicing, make sure to develop really strong habits. I learned the most of my habits from Steve McConnell's Code Complete and Kent Beck's eXtreme programming. Write code others can support if they need to, but try to make it so they don't need to support it :)

I'd also recommend checking out PragDave's Code Kata site to work on solving problem:
http://www.codekata.com
Work at a startup! You will learn more in your first month at a startup than you will in your first year in any other company. The first company I worked for was a 3-person shop in Syracuse, NY. I learned everything from how to become a practicing software engineer, to how to be a customer support person, estimate and bid consulting jobs, write requirements, QA, write user manuals, configure SQL Servers, configure IIS Servers, configure linux firewalls ... If your first job entails you being handed requirements that you then write code for and hand "over the wall" to QA - run!

Don't just take my word for it. Here's what one of our former interns who recently graduated and landed a sweet job in CO had to say:

"I had NO experience as a coder. You guys gave me a LOT. In fact, more skillz than you can really understand. Things that transferred over beyond Ruby, RUnit, Rails, MySQL unix commands (I'm loving that I actually understand how to use the CLI in my Ubuntu set up btw...), etc - more the ability to take a bunch of instructions I barely understood and google my way/solve my way to a solution."
Chris Dixon also has a couple nice posts on this topic:
Joining a startup is less risky than you think.
Every time an engineer joins google a startup dies.

More specifically, work for my startup:
Knewton Jobs

If there's nothing interesting to you there, check out these couple of other sites:
Startuply.com
Jobnob.com
Don't be afraid to make mistakes! or Perfection is for popes and Chinese emperors.
In 1999, I got my first big job at an online brokerage (Datek) in NYC. This was back when everyone, probably including you, traded online and doubled their money daily - basic fundamental laws of economics changed... until they didn't. The day of my first big release, I took down the ability to login to the production site at market open. This was, as my CTO at the time reminded me, "reeeeaaaaaaalllllly baaaaaddddd." However, we talked through the problems and took the appropriate steps to ensure I couldn't make the same mistake again (it involved improper DB connection handling, lack of performance testing, deployment timing and rollback procedures). If he fired me, maybe I'd feel differently about making mistakes. Fortunately, for me, he understood these types of mistakes will happen -- as long as you're willing to grow from the mistakes and not repeat the same mistake twice.

Don't forget that coding is a creative endeavor; typically, there is no one correct solution. Be prepared to try several.
Reality...zen and the art of "boring" tasks.
You studied sexy problems in school, you know how to solve nine-queens efficiently, find shortest path in a graph, compute Chebyshev distance in a metric space...the large part of your day as a software developer will not be spent working on such problems. More than likely you will spend a day on a far wider range of tasks that are comparatively less interesting in an engineering/problem-solving sense. These lower-level tasks are generally more simple and yet each decision that is made in their execution can be evaluated and perhaps improved upon. Design details as small as method signatures, naming conventions, loop constructs or recursion, tail-recursion or not (maybe even that's too sexy!), are both extremely important and regularly overlooked. A large program is built on many lines of code. Each line contains required prior decisions to produce. An appreciation for these small details will contribute a lot not only to the quality of the program as a whole, but to the education of the coder.

If you're not already in a coding-related field, look for ways to make your current job more efficient through automation. This can be as simple as creating access databases, word mail merges and batch files to automate tasks that used to take you hours or days to complete. This is actually how I got my start coding professionally, by building MS Access database apps that made week-long tasks takes hours (mass mailings to customers at an HVAC rep and students at the SU Masters of Public Administration program).
Finish!
You don't get any points for effort. You need to finish what you start. Half-written, incomplete code atrophies very, very quickly. If you find yourself starting more than you finish, you need to revisit the scope of your problems. Code against smaller problems, but finish the code!

While not everyone is destined to be a great coder, if you're interested in learning how, I strongly recommend the list above. I'm not certain this is an exhaustive list; I'd love to hear any other suggestions. Good luck, and have fun!

Thursday, February 18, 2010

Continuous Deployment Prereq #1 - Maker's Mark Nagiorb

At Knewton, we've used a build orb to continuous build success for the past 1.5 years. We're currently working on Continuous Deployment for one of our subsystems. As a prerequisite, we've setup nagios to monitor system state. If you work at a startup, have customers and don't have nagios setup yet... start setting it up today. It can easily monitor system state, mysql (replication monitoring is especially useful), and scripts for new probes are straightforward - for example, we're currently monitoring our zendesk support ticket queue, you can also synthetically test your web app. Out of the box, nagios has email, and web-based monitoring, there are also scripts for sending IMs... The most useful of all of our custom extensions from @devondjones is the Maker's Mark Nagiorb. Using an arduino, an LED array and a liquid etched, empty Maker's Mark bottle, our entire office now knows if all systems are fully functional, of if we're going to need another bottle of whiskey.

All Systems Go!
No warning, critical or unknown nagios alerts.

Warning!
New ticket comes in from zendesk, or mysql replication behind. Also, while a bit harder to show in static photos, the orb will blink with the number of open warning statuses. So, if you have zendesk tickets and your mysql replication is behind, it will blink twice every minute.

Critical
Critical messages are similar to warnings, but will flash red instead of yellow. If there are several blinks, it's time to hit the full Maker's Mark bottle :)

DIY
If you're interested in having your own Nagiorb, you can get most of the way there by following Devon's build orb instructable. I'll try to coax him into adding a sample nagios script in there as well.

Saturday, January 23, 2010

Microsoft does not want your money!

This just in... Microsoft does not want your money! At least for up to 24 Hours beginning 10pm PT Monday, January 25th.

As part of our continued efforts to improve the Zune Marketplace service onXbox LIVE, Zune Marketplace will be undergoing scheduled maintenance for up to 24 hours, starting 10:00 P.M. (PT) on January 25, 2010. During that time, you will be unable to rent or purchase video content.

How is it o.k. for any major web business to go offline for up to 24 hours? How many people would have started using the Zune Marketplace on January 25th that never will now? I know, you're thinking nobody - except maybe this guy:

All joking aside, designing solutions to required downtime is not hard to do. I'm responsible for an online education site that can't have that many more users than the Zune. Yet, my team (which I'm sure is significantly smaller, but better looking than the team working on the Zune Marketplace) managed to release many times per week with only :30 mins of scheduled downtime for the entire year (the :30 mins was avoidable, but would've required ~3 days of work).

Typically, all you need to do is decouple client releases from application server releases from database releases. Then release in reverse order... required db changes go before application server releases go before client releases.

Now, we're starting work on a continuous deployment system so that all deployments happen on check-in. (Don't worry, there will be automated tests.) If you're a Microsoft engineer, read more at: Read Eric Ries's post on Continuous Deployment in 5 Easy Steps.

Taylor Mali: What teachers make | Video on TED.com

On behalf of my wife, my sister, my aunt, Ms. Mitchell, Doc Isles, Mrs. Drennen, Ms. Gramaglia, Prof. Leuthold, Prof. Swischuk, and all the teachers at Knewton.

What do teachers make ... Teachers make a goddamn difference, now what about you?

Taylor Mali: What teachers make | Video on TED.com