DP Service Disruption for All customers ALL Releases

Incident Report for TechnologyOne

Postmortem

Issue Summary:
On Thursday 27 February at 10.40am AEST alert monitoring indicated that our cloud orchestration platform was spiking above its normal response time. The TechnologyOne team began an investigation immediately. The impact was seen by users on DP jobs queuing or unable to be submitted. Users also experienced longer run times on worksheet processes due to the DP jobs taking longer to be picked up and processed.

Root Cause Analysis:
Queue and processing limits reached due to long-running processes locking the cloud orchestration database. This caused the CPU utilisation to max out at 100%. Whilst the cloud orchestration database was recovered within 45 mins the Cloud DP Service did not recover due to the backlog of DP jobs. The TechnologyOne team undertook several actions to clear the backlog from the Cloud DP Service and this was stabilised at 7.33pm AEST.

Corrective Measures:
Restarted all the tasks supporting the Cloud DP.
Built additional microservice clusters and scaled out the microservice cluster to handle the load.
Recycled servers in the microservice cluster.
Scaled back the number of DP Servers (due to auto scaling) to reduce the load.

Preventive Measures:
A full review of the Cloud DP service in conjunction with an upstream provider is underway with the expectation additional mitigations will be implemented.
An ongoing project is being accelerated to further enhance the scalability and performance underload for the DP microservice and is planned for completion by August 2025.

Posted Mar 04, 2025 - 18:09 AEST

Resolved

After 2 hours monitoring this incident is now resolved.

We will perform a post incident review to identify underlying cause, and preventive action to avoid a repeat in the future, and post here on completion.

We apologise for how you and your business may have been affected by this incident.
Posted Feb 27, 2025 - 22:09 AEST

Monitoring

Our team has verified the implementation of a fix is complete.

We will monitor the logs for the next 2 hours to ensure no further impacts.
Posted Feb 27, 2025 - 19:33 AEST

Update

Our logs show a large increase in new DP jobs progressing with over 90000 completed in the last hour.

We continue to apply mitigations, and the next update will be provided within 60 minutes or sooner.
Posted Feb 27, 2025 - 18:34 AEST

Update

Our logs continue to show that error rates on DP servers have decreased we can see new DP jobs progressing over 46000 completed in the last hour however this is a much lower rate than normal.

You may see this presenting as:

- Users see DP jobs as "submitted" for longer time then normal.
- Users in a worksheet waiting on a DP job to be completed will see that part of the process is continuing to spin.
- An error appears if user forces a job to run.

We continue to apply mitigations, and the next update will be provided within 60 minutes or sooner.
Posted Feb 27, 2025 - 17:31 AEST

Update

Our logs continue to show that error rates on DP servers have decreased we can see new DP jobs progressing however the queue for submitted jobs is growing.

This issue will be presenting as:

- Users see DP jobs as "submitted"
- Users in a worksheet waiting on a DP job to be completed will see that part of the process is continuing to spin.
- An error appears if user forces a job to run.

We continue to apply mitigations, and the next update will be provided within 60 minutes or sooner.
Posted Feb 27, 2025 - 16:33 AEST

Update

Our logs continue to show that error rates on DP servers have decreased we can see DP jobs progressing however we also see growth in Submitted jobs.

This will be presenting as:

- Users see DP jobs as "submitted"
- Users in a worksheet waiting on a DP job to be completed will see that part of the process is continuing to spin.

We continue to apply mitigations, and the next update will be provided within 60 minutes or sooner.
Posted Feb 27, 2025 - 15:37 AEST

Update

We have identified the root cause of the issue and have undertaken steps to mitigation.
Our logs show that error rates on DP servers have decreased however we still see a large number of DP logs in a submitted state.

This will be presenting as:

- Users see DP jobs as "submitted"
- Users in a worksheet waiting on a DP job to be completed will see that part of the process is continuing to spin.

The next update will be provided within 60 minutes or sooner.
Posted Feb 27, 2025 - 14:42 AEST

Update

Our team is continuing to investigate an issue impacting the DP service and identify a solution to resolve.

We can see the impact for customers as follows:

- Users see DP jobs as "submitted"
- Users in a worksheet waiting on a DP job to be completed will see that part of the process is continuing to spin.

Due to the investigation, the next update will be provided in 60 minutes, or sooner if new information becomes available.
Posted Feb 27, 2025 - 13:43 AEST

Investigating

We are investigating an issue impacting the DP service for ANZ Region / All Releases.

Impact/Error/How to verify: A subset of customers are experiencing DP jobs stalling or major delays.

Due to the investigation, the next update will be provided in 60 minutes, or sooner if new information becomes available.
Posted Feb 27, 2025 - 13:01 AEST
This incident affected: Software as a Service - Australia & New Zealand (Batch Services (DP Jobs)).