Case chaos engineering experiment

Chaos engineering for sports giant

Exciting news from our recent collaboration with Gluo at an A-brand in the sports industry! We conducted the first-ever chaos engineering experiments in Belgium during the Innovation Days. 

Chaos engineering is a powerful extension of non-functional testing that enables us to disrupt a system’s normal operation in a controlled manner to build confidence in its resilience and recovery capabilities while minimizing the impact on end-users. 

With Black Friday fast approaching, our client wanted to ensure they were fully prepared to handle the anticipated surge in online traffic and meet the high demands of their customers. To achieve this, our focus was on identifying pain points in their system, implementing improvements, and training the team on quick problem analysis and resolution. 

Our client’s focus? Uncovering pain points in the system, defining and implementing improvement actions and training the team on rapid analysis and pinpointing problems.

Through an intense four-day workshop, we worked with the customer to introduce chaos engineering to the ticket printing system for logistics purposes.

Brainstorm sessions

After meeting the team, the project started. We were given a brief introduction to the system, consisting of cloud components, third-party software and on premise hosting subsystems.

The significance of pairing a system component with a chaos experiment cannot be overstated. Since this sports industry leader was taking its first steps into chaos engineering, we adopted the approach of “starting small and ending big.” Which experiments would deliver the greatest outcomes by targeting which system components? We identified the most crucial components of the system and carried out experiments like hard shutdowns and CPU usage, aiming for standalone components to accurately measure the impact of the experiments. 

For each combination of the experiment and component, we analysed the normal behaviour, expected impact of the experiment, expected recovery process and possible fallback mechanism to abort the test and monitoring to study the impact and results. 

In order to ensure meaningful results and build confidence in the system, it was important to replicate the load and infrastructure that existed on Black Friday. To achieve this, the Chaos Game Days were well prepared, with a performance script that simulated representative load and expanded the final component – virtual printers. 

Implementation

The long-awaited day has arrived! We are ready to perform the first chaos experiments. Our analyses are displayed on large screens as we begin the initial tests. 

Since this is our first interaction with chaos, automation is not yet feasible, and a manual approach was selected. It was essential to instill confidence in the system, and our time was valuable. As a result, we concentrated on content, execution, monitoring, and the impact of the experiment on the system. In the future, we propose automating the successful experiments and deploying them at random intervals to receive prompt feedback throughout the development process. 

Lessons learned

The Chaos Game Days proved to be fruitful, as they led to the identification of several technical action points and opportunities for improvement in infrastructure and monitoring. The team gained a deeper understanding of the system’s weaknesses and how they manifest, which will better prepare them for busy days like Black Friday. 

The primary takeaway from this experience is that thorough preparation and a set of non-functional performance scripts are invaluable. 

As a result of the success of these Game Days, the next ones have already been scheduled in our calendar! 

Do you have a similar challenge that you need help with? Feel free to get in touch!