Thursday, September 16, 2010

How to Scale Backend Infrastructure

Earlier this week, I participated in a talk at Hunch on Scaling Backend Infrastructure with Tom Pinckney (Hunch), Kiril Sheynkman (Thansys), and Jeff Hammerbacher (Cloudera).

How to scale inside the request loop
The majority of web applications have a user sitting behind a web browser who lands on a site, clicks on a button and expects something to happen - quickly. As more users visit your site, they compete for scarce resources - CPU cycles, RAM, hard disk access, and bandwidth. Your goal is to add more resources to support more users - here's how.
  1. Build a system that can be scaled.
    • Make sure your hardware is clone-able.
      • If you lose one machine, you need to be able to build a new machine that exactly matches that old machine from OS to configuration.
      • Common tools for configuration: puppet, chef, or some bash scripting.
      • Be daring in your language/framework choice.
        • Choose a language that optimizes for your ability to quickly develop software and your ability to acquire more developers who can program in that language.
        • Language choice impacts per-request performance.
        • Common Tools: Python, Ruby, Scala, Groovy, Clojure.
        • Be conservative in infrastructure choice.
          • Prefer meat-and-potatoes over next-new-thing, assuming a similar price and available required features on the meat-and-potatoes option.
          • Make sure you have internal knowledge/capability, active community support, or commercial support options if you go with the next-new-thing.
          • Use infrastructure others have proven can scale.
          • NoSQL can be made to work. But, make sure you have good support.
          • Common Tools: MySql, Apache Web Server.
          • Separate user data from reference data.
            • Enable eventual scaling by keeping users on separate databases.
        1. Make sure it performs. Know when it doesn't.
          • Measure your application performance.
            • Measuring machine performance is not enough.
            • Know how long it takes each component in your system to respond. From MySql explain plans to queue depth, to each webservice call, to end-user experience, know how long each component of a single request takes.
            • Most application monitoring and measurement requires some bit of custom coding.
            • Common tools: MySql explain, Nagios, Cacti, AppFirst (looks promising, but haven't tested, yet).
            • Performance test to find obvious bottlenecks and config flaws.
              • Analyze real-world scenarios to design performance testing. Look at what users actually do on the site to develop your performance test plans. Identify the most common paths or the most frequently accessed pages.
              • Design performance tests with product/user experience folks to make sure how they expect users to use your site is captured.
              • Design your tests to validate horizontal scalability. Run your tests with one of everything. Add machines. What happens to your performance?
              • Common tools: Selenium Grid, Grinder, BrowserMob, Sauce Labs.
              • Address performance problems.
                • Check your configurations against recommended configurations of your infrastructure.
                • Typically, out of the box, you are not giving your database enough memory for the hardware it's running on.
                • Startup times for app servers worker threads are often pretty slow. Try to start several at initial server startup and make sure they stay running as long as possible.
                    • The database is usually the focal point of most performance problems. Here are a few suggestions to help db performance:
                      • Add indexes.
                      • Tom recommended limiting joins. I find you can get better overall performance by limiting queries to the database.
                      • ORM is fine for initially building your product, but once you're looking to scale, you need to replace ORM (like Rails) with hand-written, optimized SQL.
                  • Limit access to shared resources.
                    • Decouple reads and writes with queues.
                      • Reading and writing from the same logical or physical hard disk is a slow, expensive process. But, you need to store your data somewhere..
                      • MySql slaving is essentially a cheap, easy to use queuing mechanism.
                      • You can scale MySql slaves to insanely large sizes. I've seen 90-way slaving.
                      • There are queues available for most languages.
                      • Common tools: beanstalkdActiveMQ, Mule.
                      • Limit contention for hard disk access.
                        • Keep as much of your database and indexes in RAM as possible.
                      • Pin users to separate databases.
                        • To limit the amount of reading and writing that needs to be done to any one database server, you can design your system to limit the amount of data stored on that server in a couple of common ways.
                          • Hash by user id for user-specific database.
                          • Provision users to a user-specific database, with a shared master index of databases for users.
                          • Once you've done that, you can add more databases as you add more users.
                      Notes on scaling outside the request loop - analytics
                      These are notes from Jeff's talk. However, you might be better off checking out his book: Beautiful Data: The Stories Behind Elegant Data Solutions. The general idea is that you cannot analyze using the same systems and infrastructure you're using to run the system.
                      1. You can probably get by with MySql for a bit.
                      2. Extract, Transform, and Load (ETL) will always take more time (both developer time and run time) than you think it should.
                      3. The easier to analyze the data, the more requests you'll get to analyze data.
                      4. SQL is not a standard. (I'd generalize to no software standard is standard.)
                      5. Hadoop (MapReduce) can parallelize data analysis across multiple machines to speed up analysis.  
                      6. Powerful analytics can produce powerful features. (Think "people you might know" in Facebook.)
                      7. Use analytics infrastructure to precompute high-read information. (Think Google's entire web index.)
                      Please get me your comments on this post. This is really just a start to catalog some of these points for others to use.


                      1. Nice write up Pete, thanks for sharing.

                        I'm curious what your opinion on in memory cacheing systems like redis or memcached is?

                      2. In memory caching is the logical extension of getting your db in ram. We use memcached pretty extensively. The market data systems I've worked on kept all realtime quotes in in-memory hashtables.

                        I've not personally used reddis, but was discussing it today as a possible replacement for a homegrown workflow system. If you use it, let me know what you find.

                      3. I'd like to highlight an implied concept, which is usually taken for granted until something bad happens: Build your systems with the idea of robust infrastructure. When you build systems that can really scale you know that hardware and software will break. Make your systems robust enough to withstand multiple hardware/software failures at the same time. Just because you can reproduce a system quickly doesn't mean you don't have a giant single point of failure. Provisioning and configuration management is not enough. The App must understand how to survive when components go offline.

                      4. Great point, Mike.

                        One thing I didn't mention in the initial post is that an old boss of mine used to threaten to do "Axe Testing". Where he threatened to take a random box in the data center off line, and we needed to make sure the system stayed online. I've had a few boxes in AWS become totally inaccessible. We need to write a post on when bad things happen to good infrastructure.

                        - both plugs in a server plugged into same (failed) UPS.
                        - Data Center power runs generator but AC does not. (how hot did those machines get?)
                        - DNS failure causes app server to take 60 seconds to find its localhost IP ( - I believe that was an Oracle JDBC driver.
                        - Machines auto-negotiate network down to 10mbps. (personal fave.)
                        - Nasdaq shut off switch due to NetBIOS chatter on misconfigured server.
                        - Shower pump floods server closet and takes out T1 routers.

                        Things will go offline in unpredictable, unimaginable ways.