Adventures in load testing
I'm at a point now where POP Forums is mostly ported to ASP.NET Core, and I'm super happy with it. I've done some minor tweaks here and there, but mostly it's the "old" code running in the context of the new framework. I mentioned before that my intention is not to really add a ton of features for this release, but I do want to make it capable of scaling out, which is to say run on multiple nodes. I haven't done any real sniffing on SQL performance or anything, but I noticed that I had some free, simple load testing available as part of my free-level Visual Studio Team Services account, so that seemed convenient.
I want POP Forums to be capable of running across multiple nodes (scale out, as opposed to scale up, which is just faster hardware). Getting it there involves a few straight forward changes:
- Shared cache. v13 used in-memory cache only, but swapping it out for something off the box is easy enough. Since Redis is all the rage, and available in Azure, I picked that.
- Changing the SignalR backplane for the real-time client notifications of updates. When there's a new post, we want the browser to know. Turns out, Redis does pub/sub as well, so I'll add that. It looks like it's almost entirely configuration.
- Queued background activity. There are four background processes that run in the app, right under the web context. These are email, search indexing, scoring game calculation and session management. Except for the last of those, I had to do some proper queuing so multiple instances could pick up the next thing. It has been a solved problem for a long time, and wasn't hard to adjust.
The incentive to run multiple nodes isn't strictly about scale, it's also about redundancy. Do apps fail in a bad way very often? Honestly, it hasn't happened for me ever. In the cloud, you only need redundancy for the few minutes where an instance's software is being updated. I speculate that I've encountered this at weird times (my stuff only runs on a single instance), when the app either slowed to a crawl or wasn't responsive.
Writing a cache provider for Redis was pretty straight forward. I spun up to two instances in my Azure dev environment, and with some debugging info I exposed, could see requests coming from both instances. I had to turn off the request routing affinity, but even then, it seems to stick to a particular instance based on IP for awhile. More on that in a minute. I've started to go down the road of hybrid caching. There are some things that almost never change, chief among them is the "URL names," the SEO-friendly forum names that map to forum ID's appearing in the URL's, and the object graphs that describe the view and post permissions per forum. I was storing these values in the local memory cache ("long term") and sending messages via Redis' pub/sub messaging to both commit them to local cache and invalidate them. I think what I'm going to do is adopt StackOverflow's architecture of an L1/L2 cache. Cached data only crosses the wire if and when it's needed, but collectively is only fetched from the database once.
The first thing I did was run some load tests against the app in a single instance, using various app service and database configurations, and all of them ran without any exceptions logged. This was using the VSTS load testing, which is kind of a blunt instrument though, at "250 users," because it doesn't appear to space out requests. The tests result in tens of thousands of requests in a single minute, which wouldn't happen for 250 actual users unless they read really fast. If I set up just "5 users," using the lowest app service and database, it results in comfortable average response times of 49ms and an RPS of 35. That's a real life equivalent of 400-ish unique users in my experience.
App Service | Database | Average response time | Requests per second |
S1 (1 core) | B (5 DTU's) | 2.5 sec | 261 |
S2 (2 cores) | S0 (10 DTU's) | 2.5 sec | 316 |
S3 (4 cores) | S1 (20 DTU's) | 1.5 sec | 522 |
Honestly, these results were better than I expected. That's almost 2 million usable requests per hour, and you can go higher on the database if you had to push more through it. The app service certainly had plenty of overhead to work with.
Next up, I switched to the Redis cache and spun up three instances. This is where things got... not great. The first problem is that I couldn't really do an accurate test, because the load balancing mechanism does not spread the load very well. It appears to route to the same instance based on IP, even when the routing affinity is turned off. When you're using VM's in Azure, the affinity mechanism has three modes, the last of which essentially round-robins the requests, even if they're from the same IP. I am not certain if there's any way to force this with App Services the way you do VM's. The bottom line is that I couldn't really test this across three nodes.
The second problem is that I was logging a ton of Redis timeouts. If the Redis call fails, I catch the exception and let the app read from the database (and further cause issues by trying to write the data back to the cache). I'm using the Stackexchange.Redis client. I left it at the default of 1 second to timeout. As best as I can tell from the exception, it's an issue with the client queuing up more requests than it can handle, because the resources of the Redis instance aren't even remotely stressed. The log entries indicate long queues for the client, in the thousands, even when I set it to allow asynchronous, non-serial responses. I'm still trying to work this problem. It seems strange that the client can't handle a large volume of calls. I can increase the timeout a little, but it still fails, and if you get to two seconds, you might as well be calling from the database. I'm sure going to the SO cache architecture will mitigate this as well.
Performance and scale problems are fun, provided you're working in a solid code base. I'm not sure I would entirely call POP Forums solid, because it really is an evolutionary mess after 15 years or so (from ASP to WebForms to MVC, with a lot of recycled code), but at least the various concerns are generally well-factored. This is all largely academic for me, because none of the sites that I run require enormous scale. I need to find someone to use the app in more extreme circumstances. Maybe I need to see if I still know anyone over in MSDN for those forums. Having worked in that code years ago, I know how much work it needs.