【LPD】Chaos Engineering at Rakuten and Quick Hands-on

日本語版はこちら

 

Hello, I'm Burak, a Senior Systems Engineer in the Solution Engineering Section of Rakuten's Leisure Products Department. My role is to ensure that our systems operate efficiently and reliably, aligning with Rakutens mission to deliver innovative and high-quality products.

Passionate about engineering, I focus on resolving challenges and implementing solutions that enhance our services, contributing to our goal of being a global leader in the industry.

At the Rakuten Leisure Products Department, we recognize that downtime can result in customer trust and lost revenue. To address this, we have adopted Chaos Engineering into our Site Reliability Engineering (SRE) practices. By intentionally introducing controlled failures into our test environments, we proactively identify and address system weaknesses before they affect our users. This approach ensures our systems remain robust, reliable, and capable of delivering seamless experiences to our customers.

 

Understanding Chaos Engineering

Chaos Engineering is the practice of intentionally introducing failures or errors into systems to identify weaknesses and improve reliability. It involves designing controlled experiments that simulate real-world failure scenarios, such as network outages, application errors, or sudden traffic spikes, without requiring any changes to the infrastructure or application configuration.
This non-intrusive method allows us to quickly and easily run multiple experiments, testing resilience in realistic scenarios while preserving the integrity of the system's original setup.

For example, we might simulate a surge in user traffic to evaluate scalability or inject fake data to test how effectively our systems can handle errors. These experiments help uncover vulnerabilities and address potential issues before they escalate into major problems, reducing the risk of downtime, financial loss, or customer dissatisfaction.

At Rakuten, adopting Chaos Engineering has already yielded measurable benefits. By proactively identifying and addressing system vulnerabilities, we have significantly reduced downtime, improved system reliability, and enhanced our ability to deliver seamless customer experiences.
This proactive approach has strengthened customer trust and helped us maintain our competitive edge in the industry.

In todays complex, distributed system architectures, Chaos Engineering is an essential practice for building confidence in a systems ability to handle unexpected disruptions. By proactively testing resilience in a way that mirrors real-world conditions, we ensure our systems remain reliable, robust, and ready to meet the challenges of tomorrows online services.

 

The Intersection of Chaos Engineering and SRE

Site Reliability Engineering (SRE) is a discipline dedicated to designing, maintaining, and improving the reliability of large-scale distributed systems. Its core goals include ensuring high availability, optimizing performance, and enabling scalability. Chaos Engineering complements these objectives by offering a structured approach to testing and enhancing system resilience.
Together, they form a powerful framework for building robust, reliable systems that can withstand real-world challenges.

By simulating failures, Chaos Engineering helps identify bottlenecks, single points of failure, and critical dependencies that might otherwise go unnoticed. For example, a Chaos Engineering experiment might simulate a database outage to reveal whether the system can gracefully handle the loss of a critical component. This proactive approach allows teams to address vulnerabilities before they impact users.

Controlled experiments evaluate how well a system can recover from failures. For instance, a test might involve shutting down a specific server or service to see if the system can reroute traffic and maintain functionality. These experiments ensure that even during unexpected disruptions, downtime is prevented, and our systems continue to operate as designed.
Chaos Engineering can simulate scenarios like sudden traffic spikes to test whether the system can scale up resources quickly and handle increased demand. For example, an experiment might simulate a sale campaign surge in traffic to ensure the system can maintain performance under heavy load.

 

Tools and Techniques

At Rakuten, we utilize a range of tools and techniques to implement Chaos Engineering effectively, ensuring our systems remain resilient and reliable.

One of the primary tools we rely on is Chaos Mesh (https://chaos-mesh.org/), an open-source Chaos Engineering platform maintained by the Cloud Native Computing Foundation (CNCF).
Chaos Mesh offers a powerful and flexible framework for simulating failures in distributed systems, making it an excellent fit for our advanced infrastructure. Its Kubernetes-native design and functions align well with Rakuten's infrastructure and operational needs and allow us to seamlessly integrate Chaos Engineering practices into our existing workflows and services.

Using Chaos Mesh, we apply various techniques to uncover potential vulnerabilities. These include:

Fault Injection: Introducing controlled failures, such as shutting down specific servers or services, to test how effectively the system reroutes traffic and maintains functionality.

Traffic Simulation: Simulating sudden spikes in traffic to evaluate scalability and ensure our systems can handle peak loads, such as during high-demand events like major sales campaigns.

Error Injection: Injecting invalid or corrupted data into the system to test error-handling mechanisms and ensure data integrity under abnormal conditions.

By leveraging Chaos Mesh and these techniques, we can proactively identify weaknesses, improve system design, and ensure our services deliver a seamless and reliable experience for our customers, even in the face of unexpected events.

A Quick Hands-on with Chaos Mesh

Let’s take a closer look at Chaos Mesh. In this section, we’ll walk through setting up Chaos Mesh in a local environment and running a simple chaos experiment step by step. I hope this will provide a practical, hands-on introduction to using Chaos Mesh.

Step 1: Set Up a Local Kubernetes Cluster

We can use any Kubernetes cluster, but for demonstration purposes, we will use Minikube in a local environment. Minikube is a lightweight tool that allows you to run a local Kubernetes cluster, and installing it is very easy and straightforward by following the official getting started guide (https://minikube.sigs.k8s.io/docs/start/).

 

Once the Minikube is up and running, we can verify the status of our cluster by running:

Will display:

 

Step 2: Install Chaos Mesh

Chaos Mesh can be installed easily using Helm, a package manager for Kubernetes, so first we need to make sure that we have Helm (https://helm.sh/) installed.
To check, we execute the following command:

This should display the installed version (the version may vary but must be newer than 3.5.4):

If Helm is not installed, it can easily be installed using the official installation guide:
https://helm.sh/docs/intro/install/

After confirming Helm is available in our environment, we can follow the official installation guide:
https://chaos-mesh.org/docs/production-installation-using-helm/

First, we add the Chaos Mesh Helm repository to our local Helm configuration:

Then, we install Chaos Mesh in a namespace called chaos-mesh. Please note that here we are using the docker-based charts, but you can pick different configurations from the official site.

To check the running status of Chaos Mesh, we execute the following command:

command

kubectl get pods -n chaos-mesh Should display all the pods running:

 

Step 3: Run a Basic Chaos Experiment

Now that Chaos Mesh is installed, let’s run a simple experiment to simulate a pod failure in our cluster. For this we will deploy a simple nginx server by creating deployment.yaml and applying it:

yaml

We should see two nginx pods are running:

 

Now let’s create a chaos experiment. We will create an experiment.yaml file for a pod failure experiment.

yaml

his experiment will randomly kill one Nginx pod for 30 seconds.

Let’s apply the experiment to our cluster:

Now we can monitor the experiment by checking the status of the Nginx pods, please note the Error status:

After 30s it will go back to normal

There are many different experiments that can be applied. Please check the chaos-mesh
documents for details (https://chaos-mesh.org/docs/basic-features/#fault-injection)

Once done, we can run the following commands to clean up the resources and stop the Minikube cluster:

 

Conclusion

At Rakuten, Chaos Engineering is a vital component of our SRE practices, empowering us to proactively uncover and resolve potential risks. This ensures our systems consistently deliver high availability and provide seamless experiences for our customers. Within the Leisure Products Department (LPD), this approach plays a key role in driving Rakuten’s overall success, enabling us to maintain our leadership position while setting new benchmarks for reliability, resilience, and innovation.

 

Looking ahead, we plan to expand our Chaos Engineering practices by exploring additional tools and techniques to further enhance system reliability. We aim to integrate even more advanced failure scenarios into our experiments. By continuously evolving our Chaos Engineering practices, we are committed to staying at the forefront of innovation and ensuring our systems are prepared to meet the challenges of the future.

 

*The information in this article is current as of September 12, 2025.

Come work with us!

Commerce & Marketing Company Leisure Product Department (LPD) is seeking talented individuals to join our team in developing new services, managing daily operations, and driving continuous improvements. Recruitment is open for a variety of positions, including engineers and product managers. We look forward to receiving your application!

  →Click here for LPD hiring details

→Click here for all hiring details

global.rakuten.com