As software developers, we all strive to write bug-free and reliable code. However, sometimes unexpected issues arise, and we need to react quickly to mitigate their impact. In this article, we will discuss several real-world incidents that occurred in a production environment. We will examine the root cause of the incident and discuss the lessons learned from this experience.

Prior events

On March 2nd, release, migration to the latest container platform, accompanied with a huge release for the squad I’m in.
On March 9th, the first release after the platform migration was deployed to production.

Production incidents

Container restart caused by memory leak

On March 7th, I was notified by a tech lead from a different squad about a memory leaks in our containers running in production. The said memory leak had forced the production containers to be restarted for 2 times since the deployment, this is obviously not a good sign, so I put away all of my work and start the investigation on this issue immediately.

Investigation on what has changed

Since March 7th was only the 3rd working day after the release on March 2nd, I started by investigating the code changes that had been merged. I spent roughly 15 minutes on this investigation, and there was nothing obvious to me that can cause this leak from the code perspective.

Another question that immediately comes to my mind was that, if that was indeed a code issue, it should have been caught by testers. So looking into the code might not be useful after all.

My next step was to look into the logs. What I can see from the logs is that there was lots of 403 errors related to a 3rd party SDK. This SDK keeps trying to connect to an external address, even if it has received a 403 error, it had generated a lot of errors in our logs.

This made me suspect the actual cause of the issue lays at the network layer. So, I contacted the network team to look into this issue.

Eventually, we were able to locate one of the URLs was blocked at the Squid proxy level, and after adding it to the white list eventually resolved the problem.

Root Cause Analysis

The root cause of the memory leak was because of the continuous networking call that is failing. Because this network calls are not properly garbage collected, the memory will grow exponentially, which will eventually result in a OOMKILL (out of memory kill) from OpenShift 4.

Mitigations

  • Short term: allow outbound traffic to 3rd party site.
  • Long term: Check and see why the network call is causing the memory leak. Even in the case of a network failure, the memory leak should not have happened in well written code.

Lessons learnt

  • Platform migration need to be treated with care. In any case, the minimal monitoring needs to be done in lower environment after the migration has been done.
  • Think in advance, what network layer issue could be caused by the migration, such as ports, firewalls, and proxies.

Container back-off in production environment after deployment

This one was a really unexpected one. The deployment of the weekly release started at 1 pm on March 9th. Shortly after the release, the DevOPs notified us that they had noticed our containers were restarting every few minutes. Because we had no immediate solution, we had to back-off the weekly release.

Investigation

Due to the importance of items in this weekly release, we had to begin the investigation immediately. I led the investigation and start with almost the same thing as before: checking the list of changes, and then checking the logs.

One of the most suspicious items on the list is Node 18 upgrade. To further prove this, I had to look into the logs. In the log, I can see that an error was thrown and the container crashed. The error was related to an error from down stream.

Looking at the logs from previous days, I found those error had existed for quite some time, and in fact, the error has been there since the release on March 2nd. At this point, I am confident that the issue was related to Node 18 upgrade.

Root Cause Analysis

  • The Node 18 upgrade is obviously a direct cause of the crash. Starting in Node 15, one of the behaviors has changed, more specifically, an unhandled promise will crash the application. This behavior was different in previous version of Node (e.g. Node 12). In previous versions, unhandled promise rejection will not crash the app, but will show a deprecated warning.
  • However there are more to it. Multiple questions also need to be answered:
    • What was wrong with the request send to down stream service?
    • Why was the error response returned from the downstream service not handled correctly – why is the unhandled promise rejection thrown rather than a grace failure?
    • Why wasn’t this caught in the unit test?
  • Beside coding, I also have my doubts on testing as well as release management:
    • How was the testing done on the Node 18 migration, why wasn’t this issue found during manual testing?
    • What was the risk with upgrade to the Node 18, was it made clear and known to the product owner and team?
    • What was the rationale to bundle the Node 18 migration together with something that’s of high importance and urgency to the business?

Mitigations

  • Short term:
    • On the night of this incident, we developed a proper workaround to the issue. I called it a work around, because the downstream error is still not mitigated, we are still seeing the error, but we add a flag to Node engine so it would treat the unhandled promise rejection same as before using the flag --unhandled-rejections=warn, this mitigation was deployed the next day.
    • At the same time, developers of downstream services notified us of an issue with the way we are calling the downstream service, which resulted in the error response. Because this issue had high customer impacts, we had to do another deployment to production on the same day.
  • Long term:
    • Thoroughly investigate the way we used promises in the codebase and made sure we’ve had correctly mitigated all place that have this issue before we remove the flag.
    • Investigate and find ways to cover this specific scenario (unhandled promise rejection) through unit tests.
    • Investigate what can be improved in the process to avoid such things from happening.

Lessons Learnt

  • The main issue with the code turned out to be a misunderstanding of how promise works: in JavaScript, the promise code start execution as soon as it is created. Promise.all is merely a mechanism to wait for the process to finish execution. This misunderstanding caused the erroneous code which led to the unhandled promise rejection and the eventual crash.
  • It was also discovered that the wrong request to the downstream service had existed since at least the previous October, and the issue was amplified after the new journey went live on March 2nd. This was due to a miscommunication at the time.
  • Testing: Testers need to cover some negative scenarios when doing regression. The direct cause of the issue would have been caught if some negative scenarios are covered as part of the regression testing.
  • Product Ownership: Product owner should evaluate the risks associated with release items together with coding and testing team. Both teams should give independent evaluation of the risk associated with the items. Product owners should make the decision based on the importance and risks associated with each release item. Items with high business value thus high importance should not be bundled with items with high risks.

Conclusion

These incidents highlighted the importance of understanding how promises work in JavaScript, comprehensive testing, effective communication and collaboration between teams, and risk evaluation in the release process. The team was able to identify the root cause of the issue, develop short-term and long-term mitigations, and learn valuable lessons for future development and testing processes.