

10·
21 days agoAs someone that works at another Amazon AWS dependent org, it also took out us. It was awful. Nothing I could do on my end. Why the fuck didn’t it get rolled back immediately? Why did it go to a second region? Fucking idiots on the big teams side.
I got paged 140 times between 12 and 4 am PDT. Then there was another one where I had to hand it off at 7am because I needed fucking sleep. And they handled it until 1pm. I love my team, but it’s so awful that this even was able to happen. All our our fuck ups take 5-30 mins to roll back or manually intervene. This took them 2+ hours, and it was painful. Then it HAPPENED AGAIN! Like what the fuck.
Last night from 12-4am, it was almost every region impacted so it didn’t help that much.
But we do have failovers for customers that they need to activate to just start working in another region.
But our canaries and infrastructure alarms cannot do that since they are for alerts in the region.