Thoughts Over an Annoying Production Issue

Balkan Güler
5 min readApr 20, 2022

--

First of all, let me say this is neither a postmortem report nor an incident timeline but rather a self-critic of how we handled (or failed to) a production issue.

Photo by Dimitar Donovski on Unsplash

We recently received some complaints from some of our clients that they got an error while clicking a button in our application. It was a button of critical action though. Had we lost this functionality, the magnitude of the complaints would have been bigger. That was what we thought, to be honest.

The error message that our clients saw was very generic and difficult to interpret. So we integrated Sentry to get more insights into what was happening on the client-side. But it didn’t help either, because what we saw in Sentry dashboards was just “Failed to Fetch” messages.

This was quite unexpected to us. Because we were at least hoping to see the HTTP status code in the logs. So we dug into our javascript fetch middlewares to see if any exception detail is missing in the chain of propagation.

An overly simplified diagram of the system.

In the meantime, our application logs were indicating that the requests were successfully processed and responded to. That brings to mind a client timeout or an abrupt client connection close. But neither we have timeouts in clients nor the latencies are high.

We had heard previously that we are using a Web Application Firewall (WAF) service to protect our applications from malicious attacks. Therefore we raised the issue to our infra team. They temporarily whitelisted our endpoints from WAF and later realized that CPU usages of our NGinX servers were high. Also, we looked at the NGinX logs in Elastic Search and noticed some random responses with HTTP 499 status code which means the client closed the connection. Although it was very clear in the 499 status code explanation that it happens when the connection is closed by the client, at that time we unconsciously ignored this fact and wishfully deduced that the culprit is NGinX and WAF is innocent. (I’m sure there is a specific bias for this kind of situation in the cognitive bias literature.)

Fog dissipates slowly.

So infra team scaled up the NGinX machines and we began to observe the logs. For some time we didn’t see any 499 status codes in the logs and even the clients acknowledged that it is working. But of course, this was essentially the nature of intermittent errors. They tend to happen intermittently, not all the time. Soon, we saw the error logs again.

On the other hand, the reason why we don’t see the details of the error message in javascript level and in Sentry was still a mystery. I checked the Fetch API documentation and saw that in some cases (CORS errors i.e.) Fetch Api throws this exception. So it turned out it is actually normal to not see any details at the javascript level. But I knew we can see CORS error messages in browser consoles. Therefore we enabled console capture functionality in Sentry but unfortunately, our Sentry quota had been exceeded and we had to wait a few days to reset. Since it wasn’t a good excuse to tell our clients for putting off the solution, we contacted some clients who experiences the problem more frequently and got their browser console logs. But nothing significant showed up.

The pressure was becoming much more intense, it was very hard to appease some of our infuriated clients and justify why we can’t solve such a seemingly simple problem or even can not give an ETA for a fix. During a long Google Meet session with the infra team, I learned how to query our AWS ELB logs via Athena and noticed the 499 status code there as well. Infra team deployed an NGinX configuration change which stopped the 499 status code logs in NGinX logs. After a short period of relief, I noticed that errors still exist in ELB logs. In fact, It turned out that the new configuration has just suppressed the client connection closes and NGinX kept working and logging normally by ignoring it.

Infra team drew a diagram of all the layers including the WAF and AWS Route 53 which a client request passes through to our application, we discussed and tried to figure out which layer might cause the problem. Eventually, they remembered a config change in WAF which was deployed a few weeks ago to support different http2 behaviors, and rolling it back solved the problem.

We needed this clarity to solve it.

Later, what we went through reminded me of Joel Spolsky’s Law of Leaky Abstractions.

“All non-trivial abstractions, to some degree, are leaky.”

Our applications run above a bunch of abstractions. From infrastructure layers to third-party libraries… We can not solve all of our problems by just focusing on our codes and ignoring the underlying layers. The first step to addressing this flaw is basic awareness. Knowing which hops a request passes through from the client’s browser to the database. So that you can think of it and inquire further details from related people when necessary.

Secondly, even if your abstractions don’t leak, you still have to know their behaviors. For example, it took a lot of time for us to figure out why we don’t see the exception details. Had we known the Fetch API behavior before this case, we could have interpreted the messages earlier.

Furthermore, being familiar with your application’s own behavior and the logs it generates during normal times is quite important (which, ironically, is one of the biggest challenges companies face during the Era of Great Resignation). We spent too much time trying to comprehend some misleading logs. For instance, we saw that all client IP addresses belong to the USA which is, as a matter of fact, very unlikely. Later we learned that those were the IP ranges of our WAF.

In short, there are plenty of commentaries out there on the differences between programming and software engineering, and I think those aforementioned belong to the software engineering camp.

--

--