In late 2017 we had the opportunity to revamp a website several of our team had previously worked on while employed at the client company. Ultimately the original goal was to reskin the site with a responsive design in order to consolidate the codebase instead of two separate websites—one for mobile, one for desktop. The current site was nearing six years old and had a lot of baggage that carried over from its previous iteration.
Researching our options and the long-term goals of the company, we decided that rebuilding the site from the ground up using ASP.NET Core hybrid with both React and regular Razor views was going to be the best approach so, after securing the client’s blessing, we set to work.
The first couple months were a bit slow-go as we were getting our feet wet with the technology and differences from the previous versions. We learned how to accomplish server-side rendering with React and pass it off to the client in a way that would take advantage of the netcore application. Time went on and we were ready for launch, or so we thought.
Launch day came and things seemed to be going well until 9 am, then the site crawled to a standstill and seemingly stopped responding to new requests. We panicked, thought it perhaps a fluke, reset the site, and watched it go another two hours before suffering the same fate once again. What had happened? We made plenty of guesses and assumptions and decided we’d address them that coming week. With our heads low, we reverted the site back to the previous version and went back to the drawing board.
The symptoms we experienced were that requests to the site would start off fast but eventually as more and more queued in, requests would start taking longer and longer and eventually seemingly never complete and culminate in many cases with a 502 error (meaning that the reverse proxy – aka Kestrel – was failing to respond in a timely manner).
Initial research indicated that we were overloading the server with requests that used to run through IIS but were now running proxied through to Kestrel. It seemed highly plausible seeing as the site did have thousands of images linked to the blog posts and a high number of other “static” resources. We found an article by Rick Strahl detailing how to sideload those requests into IIS and not drop them down to Kestrel, thus keeping the APIs alone through Kestrel and freeing up some resources.
Armed with some new knowledge I set to task and munged with the site until I forced all static requests to serve up through IIS instead of proxied through Kestrel. Initial results seemed promising. I might also add that Rick has a great tool for “surging” your site (aptly named WebSurge) that I used to test my results. Things seemed dandy so we tried another release. And failed. Once again, we were faced with a problem that we made many educated assumptions about and trudged forward attempting to fix it – we figured that we still had tons of dynamic images being served up through the site, maybe we should offload those to a specific “CDN” site instead.
The task in mind, I took some time and built out the “CDN” site to serve up all our dynamic images. We also came across some threads through use of our highly honed Google-Fu skills that suggested with ASP.NET Core we needed to use async calls as often as absolutely possible due to the way the framework is built and intended to be used. We worked on converting as many of our “big” calls to async, found a few superfluous calls that, while good, should not have been firing on every single page load and fixed them to be on-demand instead. Found a few areas that people were POST-spamming some forms and fixed those.
Deployment attempt #3. Failed. By this point in time, we’re all getting pretty down on ourselves, ready to change careers, consider hari-kari, or whatever other thoughts crossed our minds. By all accounts, this site should be working and should be working just as *fast* as it was on our test servers. While we felt like we were getting representative results with the WebSurge tool, we weren’t quite so sure and thus were faced with trying to find a way to more accurately represent load and thus track down the bottlenecks bringing the site to its knees.
Enter K6. Ok, so we had been asking our all-powerful leader (Nate Zaugg) if he had any ideas on how to appropriately load test our site and see if he had some insight to tracking down the issues we were facing. We figured the problem but just had no way of appropriately identifying it.
Thus, in the last days, He descended from His holy throne and deigned to interact with us mere mortals, He created some scripts using the K6 tool, added some traces to our code to measure call lengths, and helped us identify the bottlenecks (which I might add were secretly my suspicion all along). We had added a whole new custom content integration system which, in and of itself, is and was freaking awesome, but the way we were consuming it on the new site was not so awesome. We were creating WAY too many calls to the system, many—if not most—of the calls could be cached, and were ultimately overloading the system.
I need to segue here for a moment and address some differences between “legacy” ASP.NET and ASP.NET Core. In the good old days, we developers had a lot of things abstracted away from us between ASP.NET and the way IIS handled it for us. Timeouts? Handled. Too many requests? Handled. IIS had a great way of queuing requests and holding on to them and then, paired with ridiculous timeouts, the end-user wouldn’t actually see a dreaded 5xx error for a crazy long time. Now, let’s get real here for a moment; if you’re waiting for more than 2-3 seconds on a page are you actually going to continue sitting there? Of course not. The previous site iteration under load could reach load times of up to 20s which was pretty ridiculous. None of us actually realized that because none of us were ever on the site during those high load times and since the logs weren’t being filled up with 5xx errors, we were none the wiser. The new site, on the other hand, was under high scrutiny since we had just launched it so a) we noticed it and b) ASP.NET Core handles things differently since it is proxied through IIS but not actually directly handled via IIS. Key distinction.
Ok, back to the problem at hand. Bottlenecks identified in our code, we were now faced with having to fix a large portion of our code to cache a bunch of stuff but also to readdress the whole async pathway that I previously mentioned. ASP.NET Core was built to be async and you really do yourself a disservice if you don’t take advantage of that. That doesn’t mean fill your codebase up with a load of Task.Run statements either. For this to work, you need to start at the lowest level possible (ie., database calls, service calls, IO, etc.) that *could* be async and then work your way back up the chain. In our case, the site doesn’t have a single database call directly but handles everything through WCF services. We added async options to nearly every single one of those service calls and then worked our way back up added asynchronous methods and restructuring some of the code in said methods to take better advantage of asynchrony.
Fixes in place, async up the wazoo, caching up the wazoo, we’re finally ready for what we hope is the real launch attempt. We’re all a bit nervous of course; we’ve just stripped the undercarriage and although it’s been “tested”, we’re not really 100% sure anymore. We ran K6 testing like mad against all the pages, had our comparisons between the current live site and new test site as well as several variations of the tests against the test site before our fixes were implemented. Things all looked great and, in some cases, we saw over 40x speed increases between the then current live site on a particular page and the same page on the new site.
It worked. Finally.
So, TLDR, what did we learn?
- Use async as much as possible in ASP.NET Core but don’t do it “just because” and/or through Task.Run (there are plenty of articles why not to do that; feel free to not take my word for it and look them up).
- Use load testing early and often. K6 is a good option. Don’t take for granted that the page will be fast under load because it probably isn’t as fast as you think it is.
- Tracing is your friend. Use it to measure how long your calls are taking to aid in identifying bottlenecks.
- Offload static content through IIS if you can manage, this will free up ASP.NET Core/Kestrel to do what it does best.
- Use your resources (ie., swallow your pride and ask for help). The corollary here is if you are asked for help, provide it in a timely manner.