DP Service Disruption for ALL customers / UK Region / ALL Releases

Incident Report for TechnologyOne

Postmortem

Issue Summary:
On Thursday 27 February at 12.00am GMT alert monitoring indicated that our cloud orchestration platform was spiking above its normal response time. The TechnologyOne team began an investigation immediately. The impact was seen by users on DP jobs queuing or unable to be submitted. Users also experienced longer run times on worksheet processes due to the DP jobs taking longer to be picked up and processed.

Root Cause Analysis:
Queue and processing limits reached due to long-running processes locking the cloud orchestration database. This caused the CPU utilisation to max out at 100%. Whilst the cloud orchestration database was recovered within 45 mins the Cloud DP Service did not recover due to the backlog of DP jobs. The TechnologyOne team undertook several actions to clear the backlog from the Cloud DP Service and this was stabilised at 9.33am GMT.

Corrective Measures:
Restarted all the tasks supporting the Cloud DP.
Built additional microservice clusters and scaled out the microservice cluster to handle the load.
Recycled servers in the microservice cluster.
Scaled back the number of DP Servers (due to auto scaling) to reduce the load.

Preventive Measures:
A full review of the Cloud DP service in conjunction with an upstream provider is underway with the expectation additional mitigations will be implemented.
An ongoing project is being accelerated to further enhance the scalability and performance underload for the DP microservice and is planned for completion by August 2025.

Posted Mar 04, 2025 - 18:16 AEST

Resolved

After 2 hours monitoring, this incident is now resolved.

We will perform a post incident review to identify underlying cause, and preventive action to avoid a repeat in the future, and post here on completion.

We apologise for how you and your business may have been affected by this incident.

Posted Feb 27, 2025 - 22:10 AEST

Monitoring

Our team has verified the implementation of a fix is complete.

We will monitor the logs for the next 2 hours to ensure no further impacts.

Posted Feb 27, 2025 - 19:35 AEST

Update

Our logs show a large increase in new DP jobs progressing.

We continue to apply mitigations, and the next update will be provided within 60 minutes or sooner.

Posted Feb 27, 2025 - 19:11 AEST

Update

Our investigation into the UK DP services is ongoing. We are currently examining logs for any errors and reviewing reports from customers.
Next update will be provided in 60 minutes, or sooner if new information becomes available.

Posted Feb 27, 2025 - 18:17 AEST

Update

Our investigation continues for the UK DP services however the logs are showing promising progress.

We will continue to apply mitigations, and next update will be provided in 60 minutes, or sooner if new information becomes available.

Posted Feb 27, 2025 - 17:17 AEST

Investigating

We are investigating an issue impacting the DP service for UK Region.

This will present as:

- Users see DP jobs as "submitted"
- Users in a worksheet waiting on a DP job to be completed will see that part of the process is continuing to spin.

We are applying mitigations and next update will be provided in 60 minutes, or sooner if new information becomes available.

Posted Feb 27, 2025 - 16:20 AEST

This incident affected: Software as a Service - United Kingdom & Europe (Batch Services (DP Jobs)).