2024-04-22 - Hub Incident Post Mortem

Community Article Published May 17, 2024

Beginning on Monday, April 22nd, 2024, at 8:45, the Hugging Face Hub experienced an extended outage. The incident lasted until Tuesday, April 24th, at 10:03 when service was fully restored. During the incident, the Hugging Face website was slowed down or inaccessible for most users.

This blog post describes the timeline of the event, what failed, and the changes we made to ensure that such an event does not happen again.

All timestamps/dates are in CET unless explicitly stated otherwise

Context

Here is a simple diagram showing the architecture of the Hugging Face Hub in production:

And some links to the relevant bricks:

Summary

22/04 at 8:45, instabilities started, escalating until 22/04 at 9:30 where the hub became unresponsive.

Two read only nodes were down, as we can see in the Atlas metrics; they kept restarting:

The cluster autoscaled once, but we then had to manually scale to the next tier, which put the cluster back in a somewhat normal state. From the support case:

At 04/22/24 - 07:04:24 AM UTC the cluster tier starting for increasing the cluster from tier 1 to tier 2 and the cluster was upgraded and came into the HEALTHY state 04/22/24 - 07:33:52 AM.

another instance of manual auto-scaling was triggered at 04/22/24 - 12:46:48 PM UTC to bring the cluster on the tier 3.

We noticed a spike in traffic coming from an internal source using a Java/17 user-agent, so we added a firewall rule at 22/04 10:22 to block all requests with that UA, fearing this was the cause. While it alleviated the load, this didn’t fix the underlying issue.

22/04 at 10:44, we blocked the quicksearch feature on the hub.

The hub was still unstable although the database seemed healthy, so we attempted to scale the nodes of the hub’s k8s cluster, given we had a configuration issue and it had not been done automatically.

22/04 at 14:26, tweeted about incident

22/04 at 14:50 Switched to maintenance mode and displayed a 503 page. This stopped traffic in an attempt to alleviate load and return to a nominal state.

22/04 at 15:00, downscaled all the hub’s pods.

22/04 at 15:51, everything was back online.

23/04 at 21:30, hub down

This time the number of connections and the system memory drastically increased, then the primary OOMed. Then, the cluster changed its primary node but the same thing happened and the cluster elected the previous primary as the new primary.

23/04 at 21:56, we decided to switch the hub to maintenance mode and block all incoming traffic. That reduced the load on the database and the cluster returned to its nominal state.

24/04 at 8:00, hub was slow / unresponsive again. We could see the same pattern as the previous night occur.

24/04 at 8:12 hub was scaled to 0 and a maintenance page was set. Around 24/04 at 8:18 we lifted the maintenance page as everything was stable and everything went back to normal.

24/04 at 10:03, a suspicious metrics pointed at hub going down again, so we decided to set an aggressive rate limit which made every metric go back to normal and the hub did not go down.

Then traffic remained stable as we were cleaning up old unused indexes, gathering more metrics about the database. The following tools & resources proved to be very useful:

Root Cause Identification

We had been through all charts and graphs to try to see if something came up but we couldn’t find anything. In a last effort attempt, we decided we needed a graph for the breakdown of the number of requests per url called, built it and when looking at the 2nd crash on 23-04-2024T21:30:00, we observed a suspiciously timed request:

When calling https://huggingface.co/api/models/sentence-transformers/all-mpnet-base-v2/revision/main, we could observe that there was a list of 300+ associated spaces with this model.

What we had been observing when going through all the logs & metrics of the Hub’s mongodb database is that *_info_cache collections were overused: poor average latency, very high network out.

Note here that response times are biased because of the latencies observed during the outage. Also, both screenshots come from different mongodb replicas.

This prompted us to think that this call was in fact using space_info_cache (which it was) and may be the cause for an excessive usage of the database. So we decided to investigate further: we isolated that specific requested URL, and the increase in traffic to that endpoint correlated perfectly with an increase in database network out usage, which had been a precursor in some of the crashes.

We finally decided to attempt to reproduce the problem by running the following k6 script:

import http from "k6/http";

export const options = {
  vus: 512,
  duration: '10m',
};

export default function() {
  const response = http.get("https://huggingface.co/api/models/sentence-transformers/all-mpnet-base-v2/revision/main");
}

$ k6 run kill_hub.js

Which resulted in an increase in all aforementioned metrics and when attempting to load huggingface.co we observed either slowness or total unresponsiveness. So we stopped running the script to avoid breaking the mongodb cluster. That was it!

Note: we ran the script 3 times, once at 24/04 18:05, 18:15 and lastly ~18:45

From our observations, the first crash was not caused by a call to this URL, and when looking at the breakdown per URL at the time of the incident, we can see other suspicious looking queries:

This call saw a big increase in requests right before the incident, and given it has some of the properties of the call analyzed above (it uses datasets_info_cache, one of the 3 collections that put a significant strain on the database), we can conclude that this was the trigger for the incident.

We had trouble reproducing with the call to https://huggingface.co/api/datasets/lighteval/mmlu/revision/main given that the next morning a lot of work had been done by the hub team to mitigate the root cause.

Lastly, on the 23/04 evening crash, we saw an increase in requests to the same endpoint:

These very resource intensive requests put the database in a degraded state. What made things worse is that our Hub’s NodeJS code didn’t cancel requests if our users cancelled theirs. Which caused a thundering herd issue, where all already ongoing db request would continue to run until completion, even if no one was listening at the other end. For example, when a page was slow to respond, and I hard refreshed my page, all ongoing connection would keep running.

So — with back of the envelope numbers — if we had 10 db requests and 50k users trying to do something on a Monday morning, and they refreshed their page 3 times after waiting 30s, mongodb would have 2 * 10 * 50000 more requests than usual at that time. Meaning that if our DB was not in ship-shape, i.e. someone hit it real hard, things would snowball, ultimately leading to a crash.

Mitigation / Solution

To prevent any other snowball effect, we’ve added lots of timeouts to requests based on their usual completion time. We’ve also searched and optimised all requests that took too long to respond or responded with a big body during the events.

We won't go over the details of what was fixed as there's quite a large number of PRs that went through! But tl;dr, poorly designed database queries were reworked.

Long Term Solution

Moving *_info_cache to Redis.
Making it possible to abort any db request if a client aborts theirs.

What went well / wrong ?

The Good

Improved our understanding of the system, which enabled us to anticipate failures and address them proactively before any downtime
We cleaned up the database, amongst other things we removed old indexes, improving efficiency
Our in-depth inspection of the database revealed architectural flaws and bottlenecks, such as a collection with large unused indexes
New dashboards and monitoring metrics were built during the incident, which will be valuable for future monitoring
We learned which metrics need closer monitoring to anticipate and prevent system crashes
Lots of team members were involved and got their hands dirty to fix things asap

The Bad

Extended downtimes
It took 2-3 days to identify the root cause
We had unrelated simultaneous system failures, increasing the stress and difficulty in restoring normal operations

Timeline

timestamp	event
22/04 8:45	Instabilities start
22/04 9:04	autoscale tier 1 → tier 2
22/04 9:30	Hub unresponsive
22/04 9:33	mongodb cluster is in HEALTHY state
22/04 10:22	Add firewall rule to block all traffic with Java/* user agent
22/04 10:44	Disable quicksearch feature on the hub
22/04 14:26	public communication about the outage
22/04 14:46	manual scale tier 2 → tier 3
22/04 14:50	enable maintenance mode with WAF
22/04 15:00	Downscale all hub pods to 0
22/04 15:51	Everything is back to normal
--	--
23/04 21:30	hub down
23/04 21:56	maintenance mode + block all traffic
23/04 22:19	maintenance is lifted and things go back to normal
--	--
24/04 8:00	Hub unresponsive
24/04 8:12	Maintenance page + downscaling all hub’s pods
24/04 8:18	lift maintenance & things are back to normal
24/04 10:03-10:13	sudden spikes in metrics, aggressive rate limit is set and things go back to normal

Upvote