Performance Degradation - ANZ Region / 2023A Release
Incident Report for TechnologyOne
Postmortem

Summary: On January 11th, an incident occurred due to an unexpected surge in user traffic impacting one of our customers on a multi-tenant cluster of servers. This incident required immediate attention and intervention from our SaaS Operations team to resolve and ensure the stability of our services.

Root Cause: The primary cause of the incident was an extraordinary increase in user traffic directed towards a specific customer in our multi-tenant environment. This sudden spike in demand lead to a situation where the existing auto-scaling configurations were insufficient to handle the increased load effectively.

Impact: The unexpected traffic surge led to a strain on our multi-tenant infrastructure, risking the performance and availability for other customers hosted on the same cluster. This situation necessitated urgent measures to isolate the affected customer to prevent a cascading effect on the overall service quality.

Resolution Steps: To mitigate the impact and resolve the issue, the following actions were taken by our SaaS Operations engineers:

  1. Immediate Isolation: The customer experiencing the traffic surge was swiftly isolated from other customers in the cluster. This action was crucial to ensure that service quality and availability for other customers were not compromised.

Following the resolution, a comprehensive review is being conducted to analyze the incident and develop strategies for future prevention. As part of our commitment to continual service improvement, the SaaS Operations is working towards a detailed run book. This run book outlines specific actions and protocols to follow in similar situations, aiming to enable faster and more effective responses in future incidents.

Conclusion: This incident has been a valuable learning experience for our team. We remain committed to providing reliable and high-quality services to all our customers and will continue to improve our systems and processes to prevent similar incidents in the future. We appreciate the understanding and support of our customers during this time and assure them of our unwavering dedication to service excellence.

Posted Feb 01, 2024 - 08:22 AEST

Resolved
We're pleased to announce that the recent service disruption has been resolved. Our team has successfully applied a fix and is actively monitoring the service to ensure stability is maintained.

Preliminary investigations indicate that a significant increase in traffic was the primary cause of the incident. We're committed to a thorough analysis and will share the root cause analysis review once the investigation is complete.

We sincerely apologize for any inconvenience this disruption may have caused. Your patience and understanding during this time have been greatly appreciated.
Posted Jan 11, 2024 - 10:38 AEST
Identified
Hi all,
Our engineers have identified the issue and are in the process of applying a fix.
The next update will be provided once we have applied and tested the fix. Thank you
Posted Jan 11, 2024 - 09:26 AEST
Investigating
We have identified a selection of customer environments experiencing performance degradation within ANZ
Our engineers are currently investigating the root cause.

We shall aim to provide you with an update within the next 60 minutes. Thank you
Posted Jan 11, 2024 - 09:08 AEST
This incident affected: Software as a Service - Australia & New Zealand (User Experience).