First of all, let me say this is neither a postmortem report nor an incident timeline but rather a self-critic of how we handled (or failed to) a production issue.
We recently received some complaints from some of our clients that they got an error while clicking a button in our application. It was a button of critical action though. Had we lost this functionality, the magnitude of the complaints would have been bigger. That was what we thought, to be honest.
The error message that our clients saw was very generic and difficult to interpret. So we integrated Sentry to get more insights into what was happening on the client-side. But it didn’t help either, because what we saw in Sentry dashboards was just “Failed to Fetch” messages.
In the meantime, our application logs were indicating that the requests were successfully processed and responded to. That brings to mind a client timeout or an abrupt client connection close. But neither we have timeouts in clients nor the latencies are high.
We had heard previously that we are using a Web Application Firewall (WAF) service to protect our applications from malicious attacks. Therefore we raised the issue to our infra team. They temporarily whitelisted our endpoints from WAF and later realized that CPU usages of our NGinX servers were high. Also, we looked at the NGinX logs in Elastic Search and noticed some random responses with HTTP 499 status code which means the client closed the connection. Although it was very clear in the 499 status code explanation that it happens when the connection is closed by the client, at that time we unconsciously ignored this fact and wishfully deduced that the culprit is NGinX and WAF is innocent. (I’m sure there is a specific bias for this kind of situation in the cognitive bias literature.)
So infra team scaled up the NGinX machines and we began to observe the logs. For some time we didn’t see any 499 status codes in the logs and even the clients acknowledged that it is working. But of course, this was essentially the nature of intermittent errors. They tend to happen intermittently, not all the time. Soon, we saw the error logs again.
The pressure was becoming much more intense, it was very hard to appease some of our infuriated clients and justify why we can’t solve such a seemingly simple problem or even can not give an ETA for a fix. During a long Google Meet session with the infra team, I learned how to query our AWS ELB logs via Athena and noticed the 499 status code there as well. Infra team deployed an NGinX configuration change which stopped the 499 status code logs in NGinX logs. After a short period of relief, I noticed that errors still exist in ELB logs. In fact, It turned out that the new configuration has just suppressed the client connection closes and NGinX kept working and logging normally by ignoring it.
Infra team drew a diagram of all the layers including the WAF and AWS Route 53 which a client request passes through to our application, we discussed and tried to figure out which layer might cause the problem. Eventually, they remembered a config change in WAF which was deployed a few weeks ago to support different http2 behaviors, and rolling it back solved the problem.
Later, what we went through reminded me of Joel Spolsky’s Law of Leaky Abstractions.
“All non-trivial abstractions, to some degree, are leaky.”
Our applications run above a bunch of abstractions. From infrastructure layers to third-party libraries… We can not solve all of our problems by just focusing on our codes and ignoring the underlying layers. The first step to addressing this flaw is basic awareness. Knowing which hops a request passes through from the client’s browser to the database. So that you can think of it and inquire further details from related people when necessary.
Secondly, even if your abstractions don’t leak, you still have to know their behaviors. For example, it took a lot of time for us to figure out why we don’t see the exception details. Had we known the Fetch API behavior before this case, we could have interpreted the messages earlier.
Furthermore, being familiar with your application’s own behavior and the logs it generates during normal times is quite important (which, ironically, is one of the biggest challenges companies face during the Era of Great Resignation). We spent too much time trying to comprehend some misleading logs. For instance, we saw that all client IP addresses belong to the USA which is, as a matter of fact, very unlikely. Later we learned that those were the IP ranges of our WAF.
In short, there are plenty of commentaries out there on the differences between programming and software engineering, and I think those aforementioned belong to the software engineering camp.