Issue Summary:
On Thursday 27 February at 12.00am GMT alert monitoring indicated that our cloud orchestration platform was spiking above its normal response time. The TechnologyOne team began an investigation immediately. The impact was seen by users on DP jobs queuing or unable to be submitted. Users also experienced longer run times on worksheet processes due to the DP jobs taking longer to be picked up and processed.
Root Cause Analysis:
Queue and processing limits reached due to long-running processes locking the cloud orchestration database. This caused the CPU utilisation to max out at 100%. Whilst the cloud orchestration database was recovered within 45 mins the Cloud DP Service did not recover due to the backlog of DP jobs. The TechnologyOne team undertook several actions to clear the backlog from the Cloud DP Service and this was stabilised at 9.33am GMT.
Corrective Measures:
Restarted all the tasks supporting the Cloud DP.
Built additional microservice clusters and scaled out the microservice cluster to handle the load.
Recycled servers in the microservice cluster.
Scaled back the number of DP Servers (due to auto scaling) to reduce the load.
Preventive Measures:
A full review of the Cloud DP service in conjunction with an upstream provider is underway with the expectation additional mitigations will be implemented.
An ongoing project is being accelerated to further enhance the scalability and performance underload for the DP microservice and is planned for completion by August 2025.