Post-Mortem on CodeGrade's Temporary Downtime on 21 April 2022

April 25, 2022

Post-Mortem AutoTest incident 21 April 2022

In 30 seconds...

This blogpost is the post-mortem for the AutoTest outage, which caused intermittent outages, that happened between the night of April 20th 2022 and April 21st 2022. We understand that this issue impacted teachers and students, and we apologize to everyone who was affected. We hope that this post-mortem will give insight into what went wrong and how we are going to prevent these kind of issues from happening in the future.

Incident summary

Between Wednesday 2022 April 20th 21:00 CEST and Thursday 2022 April 21st 14:30 CEST problems with AutoTest occurred. Between this period some users were not able to use AutoTest at all, and during times CodeGrade was unreachable. The root cause of this incident was that some queries that locked certain important rows returned too much data, causing requests to timeout. Because the locks on the database were still held, other requests would also timeout. This then caused a very high load on our database, causing unrelated request to also fail.

Timeline

April 20th 2022 12:30 CEST: Monitoring detected a single assignment running too slow.
April 20th 2022 14:30 CEST: Temporary fix deployed, full investigation and fixed for planned for the next day.
April 20th 2022 21:00 CEST: Problem returns, now also causing a very high database load.
April 20th 2022 21:30 CEST: Problem is escalated to the highest internal severity level.
April 20th 2022 22:00 CEST: Database is upgraded, and assignments causing issues are paused.
April 20th 2022 23:00 CEST: On call plan for nighttime is created, first work to create a permanent fix has started.
April 21st 2022 07:00 CEST: All development effort is focussed on finalizing our permanent fix, temporary fixes are no longer developed and pausing assignments is stopped.
April 21st 2022 07:15 CEST: The issue can be replicated on our staging environment.
April 21st 2022 10:00 CEST: Individually affected customers are notified.
April 21st 2022 13:00 CEST: All customers are notified of downtime.
April 21st 2022 13:15 CEST: A fix that is thought to permanently solve the issue is deployed to our staging environment.
April 21st 2022 14:30 CEST: After testing the permanent fix is deployed to all customers. This drastically reduces load right away.
April 21st 2022 15:15 CEST: All customers are notified that the issue has been resolved. Internally priority shifts to carefully manually monitoring all our systems.
April 21st 2022 23:59 CEST: No issues are detected and the incident is marked internally as resolved.

Root cause

The root cause for this issue was twofold. The initial problem was caused by the fact that some AutoTest configurations were generating more output than anticipated. CodeGrade has limits so that this should not impact our database, however because of some oversight on our side we were storing too much data in a single database row. This should normally not be a problem, however the issue was worsened by the fact that this data was stored in rows that we often lock using (SELECT ... FROM ... FOR UPDATE) for correctness guarantees. These large rows were causing requests to timeout, however because of a missing configuration value this did not result in transactions to be aborted. This caused other queries that tried to lock the same rows to be slow, and these requests to also timeout. These two issues combined resulted in a high load to an external broker service, which also locks rows in the same database. This broker had a programming error that caused it to deadlock under such high load. This resulted in an even higher database load (caused by the deadlock detection algorithm), causing even non AutoTest related requests to timeout.

Resolution and recovery

The initial problem with AutoTest was detected at 12:30 CEST on April 20th. This was remedied by a temporary fix. At this moment we did not know the root cause yet, however as our temporary fix seemed to work we decided to work on the permanent fix the next day and not increase the severity level. When the problem returned at 21:00 CEST we tried to remedy the solution first by upgrading our database to machine approximately 16 times the size. This did have some effect, but it did not solve the issue completely. At this point we started pausing multiple assignments we thought were causing the issue. About two hours later, multiple engineers had been notified and started investigating the root cause. This effort is made more difficult by the fact that the database server was responding very slowly, and that initially the issue could not be replicated on our staging environment.

The next day we were able to fix the issue completely with multiple fixes:

We removed a large part of the locking in the broker, instead using a WAL like technique to allow lockless changing of rows. Some locks that were in place to prevent jobs starvation have been removed entirely, instead we now are able to detect this, rare, issue and recover from it quickly.
We no longer retrieve output from steps from our database if we don’t need them. This means we never retrieve logs when we lock database rows. Because of the used datatype this should also mean that the logs are not read from disc anymore by our database.
If logs exceed a certain size we no longer store them in the database at all. We now store them in S3 if they exceed a certain size. This is done transparently to the user and does not change our API.

Corrective and preventative measures

Over the weekend we’ve conducted an internal review of the outage. The aspects we’ve identified that will be improved are listed below:

Better identify which columns in database tables can grow in size, and make sure that each instance is restricted either by setting limits, or by storing fields over a certain side in S3.
Further reduce locking where it is not needed. In many cases invalid rows can be easily filtered out while reading, thus removing the need for locking a metadata row.
Identifying more columns that can be loaded only on demand. Our ORM has built-in support for this practice, we just need to continue with our effort to identify the places where this would increase performance.
Improve our monitoring for database load. We were not able to identify quickly enough what was causing the high database load, as the load was not reflected in CPU or disk usage.
Creating better internal processes about when to increase severity. We lost valuable time on the 20th of April as we postponed developing a fix and starting our root cause analysis.
Tied to the previous point: We are improving how we communicate during outages. We weren’t as open about the state of the outage as we believe we should have been. We did update our status page (https://status.codegrade.com), however we should have communicated earlier and with more detail.
We are also working towards having a clearer separation between subsystems, so that load caused by AutoTest has less or no affect on basic functionality such as submitting and viewing feedback.

Again we apologize for the disruption. We hope that with this post it is more clear what happened, and how we are going to prevent these kind of issues from causing significant disruptions.

Sincerely,

Thomas Schaper
Co-founder & CTO

‍

Book a demo