This week we had a technical incident and I was brought in at last minute to open a P1 ticket to declare it as a critical incident and inform teams/ stakeholders. At that moment, it was nerve-wracking as it’s my 1st time to do so, at the end of US work day (most of my team are in Europe and already off-hours), and I had many questions without answers yet I managed it through by reading team correspondences, tickets, and system demo video/ note. The issue was resolved the next day early morning and the root cause was country firewall blocking access to the 3rd party service instead of our system issue. Nevertheless, I received questions on the delay of declaration and potential over-estimate of the impacts.
Looking back, what’s fortunate and went well were (1) the issue was resolved quickly by the 3rd party after case was raised, (2) we had a demo of incident management system that morning and access was granted to additional people, otherwise, I’d not have access, neither idea where to start to raise P1, (3) teamwork in the resolution.
The unclear responsibilities and scale of impacts, investigation of root cause, time zone difference/ key member out-of-office, and team new to the processes (internal & external reporting) added time to the declaration. I took responsibility for my part and have learned throughout. I’m glad that we had a post mortem to discuss the incident. Nobody pointed finger but discussed where we can improve and we agreed on 5 actions to make incident detection/ management better forward. That’s another ‘luck’ I have, a great team who are not only talented but also honest and constructive to make each other better and successful!