Performance issues

Incident Report for TrekkSoft

Postmortem

Post Mortem of the incident

Summary:

On the morning of February the 26th we migrated the TrekkSoft servers from our Cloudscale hosting provider in Zurich to the new Amazon Web Services in Ireland.

After the migration was complete, usage of the system increased as the merchants began to take bookings and use the system.

We spotted a major drop in performance. The root cause was one database host that was throttling under the amount of requests per second. This database slowdown caused a significant drop in performance to our applications (Merchants landing pages - CMS, Backoffice, public and private API and mobile apps), in some cases rendering them inoperable.

What Happened

6:45am - 8:29am CET - We completed the AWS migration.
We tested all the main cases and monitored all hosts and the preliminary results were satisfactory.
9:00am CET - Our applications began handling an increased amount of requests as the system came back online and usage of the system ramped up.
One of the main database hosts (MySQL) began struggling with the amount of requests. This affected the performance of our application, preventing normal functionality.

Contributing Factors

Uncertainty regarding the performance of the new AWS infrastructure vs CloudScale.
We compared all hosts in CloudScale vs AWS to ensure the same hardware requirements.
The infrastructures are different.

Steps Taken

Phase 1:

Increase the size of the database in AWS to increase performance (no downtime was required at this point).
Contact AWS support to provide for more information about the resizing time.
The database resize was to take AWS too long to deploy, so we decided to apply another workaround, described below.

Phase 2:

We put all the webapps in maintenance mode (down time).
We created a new, larger database (downtime was required to avoid data loss).
We extracted all data from one database to another, now using a migration system in AWS.
The new database created failed.
This required a new approach, described below.

Phase 3

We created a new empty database (again, downtime was required to avoid data loss).
We proceeded with a manual dump of the data from the old database to the new one. The process took 4 hours and was successful.
3:20PM CET The new infrastructure was ready to be released at aprox.
We have been monitoring and tweaking the system over the last 24 hours to improve performance.

‌

Impact

Low number of bookings from 7:00am to 4:30pm CET (about 9 hours). Some merchants were unable to process any bookings, while others still managed to take some. The impact here is financial loss to all parties.

Benefits

The objective behind the migration that caused the issue.

Overall long term increase in performance.
Up to date industry infrastructure.
More direct control over our infrastructure.
Infrastructure ready to apply autoscaling in case of a peak of request per second.

Lessons learned

We will strategically time operation of this scale so that we have more time to react and avoid peak booking hours.
Triple-check hardware and settings specifications.
Bulletproof checklist.
Replicate the system and run stress tests.
Build our infrastructure with extra capacity and resources/have a larger infrastructure as a backup.

We apologise deeply for this incident.

Posted Feb 27, 2020 - 16:51 CET

Resolved

The issue is now resolved. Booking functionality has been resumed as well as access to all pages and accounts. We will continue to monitor the situation closely making sure that the system performs properly. Loading times might still be longer than you are used to while we continue to improve. We will be updating this status page accordingly.

Posted Feb 26, 2020 - 16:35 CET

Update

As we continue to monitor and work to fix the issue, we have identified the source of the problem. The root cause lies within one of our service providers. We rely on this partner service in order to provide a solution and therefore it is difficult to give an exact time frame at this time. However we are doing everything in our ability to get our systems running as soon as possible. We will continue to update this page as more information becomes available. Apologies for the inconvenience caused.

Posted Feb 26, 2020 - 12:42 CET

Update

As we continue to monitor the situation and take steps so provide a solution. We have identified multiple failed bookings. As a precaution we have temporarily disabled bookings in order to avoid failed attempts, until we can remedy the situation. Thanks for your patience.

Posted Feb 26, 2020 - 10:59 CET

Update

We are continuing to monitor for any further issues.

Posted Feb 26, 2020 - 10:54 CET

Update

We are continuing to monitor for any further issues.

Posted Feb 26, 2020 - 10:22 CET

Monitoring

Trekksoft systems are currently experiencing performance issues, including difficulty accessing some pages and longer than usual loading times, timeouts and possibly other issues as well. The issues are a side effect of a scheduled maintenance operation to improve the overall reliability and speed of our services. We are monitoring the situation closely and taking steps to counteract the issue.

Posted Feb 26, 2020 - 10:09 CET

This incident affected: TrekkSoft Application, TrekkSoft API, Backend Mobile Applications, POS Desk, TrekkSoft Website, Payyo, and Channel Manager.