Thursday, September 16, 2010

How to Scale Backend Infrastructure

Earlier this week, I participated in a talk at Hunch on Scaling Backend Infrastructure with Tom Pinckney (Hunch), Kiril Sheynkman (Thansys), and Jeff Hammerbacher (Cloudera).

How to scale inside the request loop
The majority of web applications have a user sitting behind a web browser who lands on a site, clicks on a button and expects something to happen - quickly. As more users visit your site, they compete for scarce resources - CPU cycles, RAM, hard disk access, and bandwidth. Your goal is to add more resources to support more users - here's how.
  1. Build a system that can be scaled.
    • Make sure your hardware is clone-able.
      • If you lose one machine, you need to be able to build a new machine that exactly matches that old machine from OS to configuration.
      • Common tools for configuration: puppet, chef, or some bash scripting.
      • Be daring in your language/framework choice.
        • Choose a language that optimizes for your ability to quickly develop software and your ability to acquire more developers who can program in that language.
        • Language choice impacts per-request performance.
        • Common Tools: Python, Ruby, Scala, Groovy, Clojure.
        • Be conservative in infrastructure choice.
          • Prefer meat-and-potatoes over next-new-thing, assuming a similar price and available required features on the meat-and-potatoes option.
          • Make sure you have internal knowledge/capability, active community support, or commercial support options if you go with the next-new-thing.
          • Use infrastructure others have proven can scale.
          • NoSQL can be made to work. But, make sure you have good support.
          • Common Tools: MySql, Apache Web Server.
          • Separate user data from reference data.
            • Enable eventual scaling by keeping users on separate databases.
        1. Make sure it performs. Know when it doesn't.
          • Measure your application performance.
            • Measuring machine performance is not enough.
            • Know how long it takes each component in your system to respond. From MySql explain plans to queue depth, to each webservice call, to end-user experience, know how long each component of a single request takes.
            • Most application monitoring and measurement requires some bit of custom coding.
            • Common tools: MySql explain, Nagios, Cacti, AppFirst (looks promising, but haven't tested, yet).
            • Performance test to find obvious bottlenecks and config flaws.
              • Analyze real-world scenarios to design performance testing. Look at what users actually do on the site to develop your performance test plans. Identify the most common paths or the most frequently accessed pages.
              • Design performance tests with product/user experience folks to make sure how they expect users to use your site is captured.
              • Design your tests to validate horizontal scalability. Run your tests with one of everything. Add machines. What happens to your performance?
              • Common tools: Selenium Grid, Grinder, BrowserMob, Sauce Labs.
              • Address performance problems.
                • Check your configurations against recommended configurations of your infrastructure.
                • Typically, out of the box, you are not giving your database enough memory for the hardware it's running on.
                • Startup times for app servers worker threads are often pretty slow. Try to start several at initial server startup and make sure they stay running as long as possible.
                    • The database is usually the focal point of most performance problems. Here are a few suggestions to help db performance:
                      • Add indexes.
                      • Tom recommended limiting joins. I find you can get better overall performance by limiting queries to the database.
                      • ORM is fine for initially building your product, but once you're looking to scale, you need to replace ORM (like Rails) with hand-written, optimized SQL.
                  • Limit access to shared resources.
                    • Decouple reads and writes with queues.
                      • Reading and writing from the same logical or physical hard disk is a slow, expensive process. But, you need to store your data somewhere..
                      • MySql slaving is essentially a cheap, easy to use queuing mechanism.
                      • You can scale MySql slaves to insanely large sizes. I've seen 90-way slaving.
                      • There are queues available for most languages.
                      • Common tools: beanstalkdActiveMQ, Mule.
                      • Limit contention for hard disk access.
                        • Keep as much of your database and indexes in RAM as possible.
                      • Pin users to separate databases.
                        • To limit the amount of reading and writing that needs to be done to any one database server, you can design your system to limit the amount of data stored on that server in a couple of common ways.
                          • Hash by user id for user-specific database.
                          • Provision users to a user-specific database, with a shared master index of databases for users.
                          • Once you've done that, you can add more databases as you add more users.
                      Notes on scaling outside the request loop - analytics
                      These are notes from Jeff's talk. However, you might be better off checking out his book: Beautiful Data: The Stories Behind Elegant Data Solutions. The general idea is that you cannot analyze using the same systems and infrastructure you're using to run the system.
                      1. You can probably get by with MySql for a bit.
                      2. Extract, Transform, and Load (ETL) will always take more time (both developer time and run time) than you think it should.
                      3. The easier to analyze the data, the more requests you'll get to analyze data.
                      4. SQL is not a standard. (I'd generalize to no software standard is standard.)
                      5. Hadoop (MapReduce) can parallelize data analysis across multiple machines to speed up analysis.  
                      6. Powerful analytics can produce powerful features. (Think "people you might know" in Facebook.)
                      7. Use analytics infrastructure to precompute high-read information. (Think Google's entire web index.)
                      Please get me your comments on this post. This is really just a start to catalog some of these points for others to use.