Issue Summary:
On Thursday 11 April 2024 customers experienced performance degradation within Ci Anywhere. Some varying performance degradation was reported by a small number of customers from 10.30am. From 3pm more customers reported consistent performance degradation issues and analysis identified all issues were related at 3.18pm. Status page updates were provided regularly from 3.23pm as the incident was investigated and mitigated with all customers' performance returned to normal at 9.45pm.
Root Cause Analysis:
The background service used to cache records for users reached a maximum number of connections and started producing errors and prevented any further users connecting to the service. The failover service became overloaded which caused the DPs to stall. The alerts that were in place did not highlight the number of connections reaching threshold limits.
Corrective Measures:
Updated configuration to force users to have new connections created and old connections dropped. This change did not improve performance.
Recycled every app server. This change did not improve performance.
Two new services for background caching were built and the customer data sets split between these services to further balance the required connections.
Preventive Measures:
The alert threshold for errors on the background service has been adjusted and additional alerts created for the background caching service.
The playbook has been adjusted to reorder the steps to be undertaken should a similar issue occur and to cater for the new alerts built.