rssLink RSS for all categories
 
icon_red
icon_green
icon_red
icon_green
icon_red
icon_green
icon_green
icon_orange
icon_red
icon_red
icon_green
icon_green
icon_green
icon_orange
icon_green
icon_red
icon_green
icon_blue
icon_red
icon_red
icon_green
icon_red
icon_green
icon_green
icon_orange
icon_green
icon_green
icon_green
icon_green
icon_green
icon_orange
icon_red
icon_green
icon_green
icon_green
icon_green
 

FS#51415 — Outage ApiServer GRA7 June 5th Post-Mortem

Attached to Project— Kubernetes
Incident
Backend / Core
CLOSED
100%
|--------------------------------------------------------------------------------------------------------|

Between 19:37 UTC and 20:03 UTC

Was in intervention about incidents raised by OpsGenie during an on-call period like:
- customer's node re-installation
- workload rescheduling to spread the load between several admin nodes

|--------------------------------------------------------------------------------------------------------|

2021-06-05 20:05 UTC

Detect that one of our proxy in front of customer's apiservers was consuming too much CPU.
This proxy is called Pokeflute
Starting to recycle Pokeflute

|--------------------------------------------------------------------------------------------------------|

2021-06-05 20:15 UTC

Received lots of different alerts symptomatic of a communication problem with customer's ApiServers
Stoping some alerting bots
Starting investigation

|--------------------------------------------------------------------------------------------------------|

2021-06-05 20:15 ~ 20:45 UTC

Logs analysis to try to understand the situation and identify the root cause

|--------------------------------------------------------------------------------------------------------|

2021-06-05 21:23 ~ 22:20 UTC

Clean last traces of the outage:
- redeploy some customers clusters/nodes/components in ERROR
- clean alerts and restart all monitoring/alerting bots

|--------------------------------------------------------------------------------------------------------|

2021-06-05 22:20 UTC

End of the incident

|--------------------------------------------------------------------------------------------------------|
Date:  Wednesday, 16 June 2021, 17:46PM
Reason for closing:  Done