Between 10:32 UTC 2024-08-06 and 20:40 UTC 2024-08-07, we experienced three instances affecting both S3 and user services in all regions.
Starting at 10:32 UTC 2024-08-06, our queueing service reached a full capacity state which impacted our database cache causing it to become unresponsive. The Wasabi Operations team initiated a restart to the primary database in an attempt to clear out all stale connections to the database while simultaneously clearing the queuing service queue. When this action failed to bring the database into a fully operational state, the secondary database instance was promoted to primary. At 11:20 UTC the S3 service was fully operational again. Between 13:17 UTC and 13:23 UTC, the database was restarted once more by Operations in order to fully incorporate our queueing service library.
Between 02:55 UTC to 03:35 UTC on 2024-08-07, a second event occurred when our Operations team identified a configuration issue within the queueing service and the previously promoted secondary database instance. This configuration issue was causing timeouts to occur on user services such as our Web Console, WAC API, and WACM interface. Our Operations team then promoted the primary database back to production, alleviating these issues. There was no impact to S3 services during this event.
Between 20:30 UTC to 20:44 UTC on 2024-08-07, a third event occurred when an automation cluster was failing to be seen by our automation service, causing a small decrease in accepted traffic to our S3 vaults. Our Operations team then recreated and redeployed this cluster, fully restoring the S3 service.