Registered downtime 3rd of October 2024
Incident Report for TrekkSoft
Postmortem

Incident Date: October 3rd 2024
Incident Duration: Approximately 20 minutes
Affected Services: TrekkSoft API, TrekkSoft Application, POS Desk

Incident Description:
At approximately 12:15 PM CET on October 3rd, 2024, the system went down.

Impact:
The redis node used for session storage from API was rebooted and came back approximately 20 minutes later.  The node went out of service outside the maintenance windows. We opened a support ticket with AWS to understand why it was rebooted. 

Resolution:
The incident was resolved due to the rebooted redis node (used for session storage from API).

Learnings:
API uses redis-core-production for session storage. This is a one node instance.

Preventive Measures

  • Review AWS Fault Tolerance reference: Mitigating Failures - Amazon ElastiCache (Redis OSS) 
  • Ensure proper configuration is used for the redis-core-production instance
  • Review and improve our API session handling logic, or even consider other types of persistence for session storage
Posted Oct 04, 2024 - 17:06 CEST

Resolved
The incident has been resolved and all TrekkSoft functionalities are operating as expected.

We have determined that the issue originated from one of our infrastructure services unexpectedly stopping and restarting, triggered a cascading effect, leading to a brief system outage. The responsibility for maintaining this service lies with our cloud services provider, AWS, and we have reached out to them for further clarification.

We will explore measures to mitigate this type of issue on our end and will provide a postmortem of the incident in the coming days.

We apologize once again for any inconvenience this may have caused.
Posted Oct 03, 2024 - 15:07 CEST
Update
We are continuing to investigate this issue.
Posted Oct 03, 2024 - 15:05 CEST
Investigating
TrekkSoft experienced a downtime during the last hour, with the outage lasting approximately 10 minutes.

Since then, all systems have come back online. We are actively investigating to determine whether this was due to a potential cyber attack, while also reviewing our infrastructure for other possible causes.

We will provide further updates as soon as we have more information.

We apologize for this inconvenience.
Posted Oct 03, 2024 - 13:28 CEST
This incident affected: TrekkSoft Backoffice, TrekkSoft API, and POS Desk.