Incident report covering the disturbance 2022-12-02 between 11:48 until 12:23
11:48 one host in our Ovirt virtualization cluster failed and rebooted. Unfortunately did the VM’s running on this host get stuck in a unreachable state, and the high-availability failed to restart them. We then tried to restart them manually without success, and it took some time to restart the engine (controller) before the vm’s got unstuck and high-availability did it’s job and restarted them on the other hosts.
This failure brought down 4 important vm:
* rabbit01 – our message broker for the UC2 platform
* sip03 – one of six sip proxies for the UC2 platform
* sipac03 – a sip account server
* siptrunk02 – one of two upstream facing sip proxy in NgCore
The failure of rabbit01 was the most serious and made all events to fail, that included CDR, presence and more. It also made the web-portal to fail (we are investigating why at the moment)
siptrunk02 should not be a major fault, but we had problems with fail-over of calls going to PSTN via siptrunk02. This should not happen as we should fail-over the call to siptrunk01.
Actions we are going to take:
* Build a rabbitmq cluster, when we started to use rabbitmq in 2013 – they didn’t have a good way to cluster it. Now there is Quorum Queues and we have started to test this now. Plan is to make rabbitmq resilient, if it’s not possible – we will have to look for an message broker that can be.
* Find a fix why we failed to fail-over calls in NgCore.
* Find the reason why the host failed