Driftstörning / Service Disruption
Incident Report for Infracom Infinity
Postmortem

Incident report covering the disturbance 2022-12-02 between 11:48 until 12:23

11:48 one host in our Ovirt virtualization cluster failed and rebooted. Unfortunately did the VM’s running on this host get stuck in a unreachable state, and the high-availability failed to restart them. We then tried to restart them manually without success, and it took some time to restart the engine (controller) before the vm’s got unstuck and high-availability did it’s job and restarted them on the other hosts.

This failure brought down 4 important vm:

* rabbit01 – our message broker for the UC2 platform

* sip03 – one of six sip proxies for the UC2 platform

* sipac03 – a sip account server

* siptrunk02 – one of two upstream facing sip proxy in NgCore

The failure of rabbit01 was the most serious and made all events to fail, that included CDR, presence and more. It also made the web-portal to fail (we are investigating why at the moment)

siptrunk02 should not be a major fault, but we had problems with fail-over of calls going to PSTN via siptrunk02. This should not happen as we should fail-over the call to siptrunk01.

Actions we are going to take:

* Build a rabbitmq cluster, when we started to use rabbitmq in 2013 – they didn’t have a good way to cluster it. Now there is Quorum Queues and we have started to test this now. Plan is to make rabbitmq resilient, if it’s not possible – we will have to look for an message broker that can be.

* Find a fix why we failed to fail-over calls in NgCore.

* Find the reason why the host failed

Posted Dec 05, 2022 - 13:39 CET

Resolved
Övervakning
Vi har nu övervakat berörda tjänster och anser att problemet är åtgärdat.

Vi beklagar det träffade för er och era kunder under denna driftstörning.

Skulle ni eller era kunder uppleva fortsatta problem efter klartiden, är vi tacksamma för återkoppling.

Post mortem
Huvudanledningen till orsaken för serverstoppet kommer att utredas och information gällande detta kommer att publiceras i denna incident.

----------

Monitoring
We have now finished monitoring.

We confirm that the issues have been resolved.

We apologize for any inconvenience you and your customers experienced during this incident.

If you or your customers experience remaining issues after the resolution time above, please contact us.

Post mortem
The main reason for the server outage will be investigated and information regarding this will be published in this incident.
Posted Dec 02, 2022 - 15:19 CET
Monitoring
Tjänster återställda
Tjänst startades om i vårat VM-kluster ca 12:15
Servrar uppe ca 12:25
Resterande mindre fel åtgärdades fram till ca 13:05.

Övervakning
Vi kommer givetvis att fortsätta monitorera trafiken fram till 16:00 och samtidigt utreda grundorsaken.

Skulle ni eller era kunder uppleva fortsatta problem efter tidpunkten för ovan angiven klartid, är vi tacksamma för återkoppling.

----------

Services restored
A service in our VM-cluster where restarted at 12:15
Servers went up 12:25
Smaller errors were taken care of until approximately 13:05

Monitoring
We will continue to monitor the traffic until 16:00 and at the same time investigate the root-cause.

If you or your customers experience remaining issues after the resolution time above, please contact us.
Posted Dec 02, 2022 - 13:12 CET
Investigating
Tid
2022-12-02 11:45

Omfattning
Vi har för närvarande en driftstörning och felsökning pågår med högsta prioritet.
Problem att logga in i applikationer
Events skickas inte ut
Problem med ett par sip-servrar som hanterar enskilda användares enheter.

Nästa uppdatering
12:20

----------

Time
2022-12-02 11:45

Description
We are currently having operational disruption and investigating the issue with the highest priority.
Login problems
Events that should be pushed out
2 servers handling Phones/SIP-clients

Next update
12:20
Posted Dec 02, 2022 - 12:06 CET
This incident affected: Infracom Infinity (API, Web Portal, Apps, Desk Phones & other SIP Devices).