Delayed queue board presence

Incident Report for Sangoma

Resolved

RFO: Sangoma Contact Center (SC3) CX Outage


Date: December 6th, 2023


Issue Description:

On Monday December 4th, around 4:20 PM EST the Sangoma Contact Center (SC3) CX platform experienced an issue due to a network outage from an upstream internet provider in us east region. The service disruption lasted around 55 minutes (between 4:20 PM EST and 5:15 PM EST).

At 4:10 PM EST, our monitoring system alerted us about an abnormal level of unprocessed telephony event messages in the CX event queue. Those telephony event messages coming from the CX platform voice servers are used to provide real time feedback on the supervisor and agent panels. When the event occurred, the size of the queue receiving those events spiked to 15,000 events suddenly, creating an important backlog on the processing of those events and leading to agent and supervisor panels becoming unresponsive.

Sangoma does not take outages lightly and are constantly striving to provide quality reliable service to our customers. We understand that you rely on Sangoma for the 24/7 service and we appreciate your trust and understanding. Sangoma has already invested in new Juniper routers for all of its data centers and is actively working on expanding CCaaS datacenter redundancy with its latest Las Vegas Datacenter.


Root Cause:

The monitoring system detected an unusual number of events in queues, prompting the CX development team to investigate. They found that 90% of these events were calls from our system to agents in a specific US eastern region. Approximately 600 agents in that area were logged in, but their network connection failed due to an outage with their internet service provider (ISP). The ACD attempted to send new calls to these agents, but due to the connectivity issues, the calls failed. The ACD then looped, continuously attempting to reach the disconnected agents, creating a backlog of telephony events. This backlog delayed the processing of events from other agents without connectivity issues, resulting in slow queue panel loading and unprocessed call answer events.

How was service restored:
After identifying the root cause of the issue, our main focus was to decrease the backlog of events present in the queue in order to allow events to be processed in a real time manner. At 4:45 PM EST, we deployed an additional event consumer to speed up the consumption of the queue backlog and allow the system to catch up. That strategy paid off, because the level of unprocessed events in the queues went down from 15,000 to 5,000 in less than 15 minutes and the customers already started feeling the improvements on the system's health. At 5:15 PM EST, the system was able to catch up and process all the event backlog and the service returned to a normal state.


What is Sangoma doing so this does not happen again?

The additional event workers deployed will allow the system to handle any spike of queued events in the near future.

Furthermore, Sangoma CX development team has identified some potential improvements on the ACD software code to avoid having ACD flood the event queues with call trials, when a high number of agents become suddenly unreachable. Those improvements will be implemented in our upcoming version 3.0 of the ACD software scheduled for early january 2024.
Posted Dec 11, 2023 - 16:25 EST

Monitoring

A fix has been implemented by our development team to resolve the delays experienced on the queue board along with hang ups. We will continue to monitor the situation and if you have any further problems please reach out to our support team at 844-302-7827.
Posted Dec 04, 2023 - 17:56 EST

Investigating

We are presently looking into reports indicating that agents are encountering issues with accessing their queue board, observing delays in their presence on the queue board, and experiencing delayed hangups.
Posted Dec 04, 2023 - 17:15 EST
This incident affected: Business Voice Unified Communications Platform (UCaaS) (Sangoma CX (Business Voice)).