I Broke My Website!

OK, so my site looks good and is working fine using existing resources I’ve allocated. How do I know if I’ve selected enough resources for the day 30 people are loading my blog at the same time? Or 50? Perhaps I should increase my resource requirements, but how do I know what is needed?

Enter the Load Test. This is a standard tool used by web dev ops both to compare the same site over time and through version changes, and to compare sites to other similar sites. The Load Test makes use of a friendly bot to set a specific browsing profile, and to operate that profile at a certain quantity of concurrent connections and for a specific duration.

Today I’m using Blazemeter, a freemium tool in which you can configure your tests and review reports comparing various tests. I’ve chosen a max of 50 concurrent users, stepped up in 5 stages over 15 minutes, with the total test running 20 minutes. This is the top end of what the free account gives you.

To be clear, 50 concurrent users in this case does not equal the traffic profile generated by 50 human users browsing your site at the same time. Real human users have think time, read time and get-coffee time that affect the resources required for real users. 50 concurrent users in this simple test equals 50 concurrent clicks with the 50 users clicking again as soon as the page loads–so it represents a multiplicative factor of real human users who would generate the same demand. Elaborate real-user browser profiles can be made, but here I’ve opted for a quick-and-dirty load generation to see what happens–and at what point I can break my site.

At minimum, this in conjunction with our performance tracking using StatusCake gives us a baseline for existing configuration using Amazon CDN, and a retest after major changes to the site (compressing images, switching CDN, etc) gives us an idea on the performance costs of various design decisions, developments and integrations.

Here’s my initial test:

Locked Mysql

As you can see, the 50-user test level was too much for the resources that I had contracted with the Akash provider to use. The site began experiencing serious performance issues at the 20-user level beyond what the average human would wait for, and the site was 100% down by the time we reached 50 users.

I killed the load test, but discovered the Mysql instance was no longer responding. I executed my well-practiced redeployment and site recovery process, and was back online.

This brings us to important production principles:

  1. Test breakpoint, new configurations and functionality on a staging deployment, made up of the same site content and spec–you can do some wizardry with host file edits and DNS bypass in order to keep the same URL’s without going live with the test site
  2. Test your limits incrementally: I saw a lot of errors were occurring at 30 users, but didn’t intervene
  3. Adjust your contracted resources and keep testing! Remember that the complexity of your site affects resource demand, and plenty of articles exist on site optimization.

With a more limited test of 20 concurrent users, stepped up in 5 increments over 15 minutes, I obtained the following result:

Part of my objective in testing was to get a baseline against which I would test other developments. A few days ago I mentioned that I was using Amazon’s S3 and Cloudfront as a CDN to offload my image delivery. So my next step was to set up a decentralized alternative to S3 on my dev site and compared the results with the live site:

In the response time graph, (the blue line being S3) we can see that over most of the load, the response time is comparable, only spiking towards the end of the test with 20 users clicking constantly.

Apply Your Values

All these findings are just measurements until you apply your specific use case and its value system. As a blog about decentralization, with a low expectation of traffic sites, these findings tell me that the S3 alternative is worth going for, if only to support another decentralization product and give its potential a good test drive.

For a major production site given to unexpected traffic spikes, I would make different judgments and aggressively test until I’m comfortable I won’t be getting the 2 AM call with a panicked customer on the line.


Add a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.