Post-Mortem February 27, 2019
(all times in Mountain Standard Time)
1:30pm - An issue was identified by the team where call volume was abnormally low and began to investigate.
1:33pm - it was identified that Calgary was having trouble processing calls and registrations from devices. This was immediately escalated to our softswitch vendor while we investigated the core logs.
1:34pm - We attempted to redirect all traffic to the support Ottawa and Vancouver servers with a SIP 503 Redirect.
1:56pm - At the request of the softswitch vendor after investigating the system logs, a core reboot was completed on Calgary to restore services.
2:09pm - Calgary remained stable for a few minutes then proceeded to continue to fail to process calls again.
2:10pm - A second core restart was completed and 503 was left in place until further investigation from the softswitch engineering team could be completed.
At this point most endpoints and calls were properly failed over to either Ottawa or Vancouver. Inbound and Outbound calls were being redirected from Calgary and process by Ottawa and Vancouver.
During this period some phones were unable to failover or were connected manually (such as SIP Trunks, softphones, manually configured devices, paging devices) to Calgary and could not fail over. These were offline during this period.
2:15pm - We began reaching out to customers with phones that do not support DNS-SRV for failover to assist with directing their traffic.
2:50pm - It was discovered that some SIP trunks were not configured properly to fail over and we began working to provide an solution for any out of service trunks.
3:00pm - An issue with SIP over TLS was discovered to be the culprit for the calls and registrations not processing. We began adjusting Calgary to disable TLS processing and began testing to confirm failure would not re-occur.
3:03pm - Once a solution was found for SIP trunks without failover, we reached out to partners to point them to sip.siplogin.ca which provides fail over without DNS-SRV.
3:15pm - After successfully verifying the issue would not persist we brought Calgary back online with limited functionality for TLS.
TLS should be impacting very few customers and only those likely with SIP ALG challenges moved to TLS to circumvent the SIP ALG configuration in the local network. If your phone is still unable to connect or make calls, please contact support and we will assist.
The Softswitch vendor has a patch being developed to resolve the TLS issue. There is no specific timeline when it will be available but we will provide a maintenance notification once it is available.
We believe all services have been restored in Calgary and we will be monitoring. contact support if you are experiencing any further issues.
Calls and phones should fail over to our other Data Centers in Ottawa and Vancouver. We'll update this alert as soon as services in Calgary are normalized.
At this time we are investigating intermittment issues affecting some calls.