Introduction
Hi, this is Trueman, an Application Engineer working at Rakuten Group's Branch. When I started working with the car services team 3 years ago, another more senior new comer Burak Dogruoz had just changed the local dev environment from Vagrant over to Docker. I'm not sure if any of us knew it at the time but that would be the start of our gradual transition towards Continuous Integration (CI). However, the progress on this front was slow and intermittent as we had to balance improving our software release life cycle with pushing out new features and maintaining our services. Now, three years later, I believe we've achieved a CI system that works well enough to share with everyone. Of course, there's still a lot of room for improvement and we'll continue to build on it but I think it's better to have an imperfect system that works now than a "perfect" system next year (which would probably be outdated by the time it's released anyways).
While I can't share the full set-up with you here (and it might not be usable as-is on public clouds since it's set up on Rakuten's own private OneCloud), I do want to share with you our journey towards CI to give you some ideas of how you can also guide your own team towards CI (if you're not already using it). Looking back, I think we probably could've skipped a step or two so maybe this help you get there faster and more efficiently than us.
Step 1. Recognize the Problem
When I first joined the team, it was common practice to run all the tests in whatever repository you're working on first before starting to work on your own feature... and fix or note down which tests are already failing before you even started. And there would always be failing tests before you start... so much more than was reasonable. Thankfully, the team recognized the problem for what it was and began to put some effort into measuring the number of failing tests and their cost in pre-dev time in preparation for resolving the problem. I won't tell you exactly how many failing tests we found but I can tell you that number was up in the hundreds.
Step 2. Fix the Problem (spoilers: we only fixed the symptom)
Seeing how many failing tests we had, you might be just as alarmed as I was at first. We began to put a concerted effort into investigating and fixing as many of these tests as we could. Thankfully, the large majority of the failing tests only failed due to the differences in dev environment provided by docker vs the old vagrant VM's we used to use. There was also another large subset of failing tests that were failing due to to the tests not having been updated to keep up with changing code and specs. Of course we also found a handful of actual bugs that we fixed in the process but thankfully nothing major.
So there you go, we fixed all the tests and now problem solved right? Who needs CI anyways?
Yea of course it wasn't that easy. Basically as soon as we fixed a batch of failing tests a new release would introduce a new set of them.
Step 2.1 Fix the Problem (try again)
At first, being the naive optimistic new developer that I was, I didn't understand it. Why didn't people just run the tests before release? Eventually, after looking into it, we discovered there were several reasons why despite our best efforts, failing tests still snuck in to our code on a regular basis.
- Running all the tests was slow, it could take somewhere between 10-30min for the slower set of integration tests, and over an hour to run all the page tests of a single repository. Developers didn't want to run all the test after every little change and they would forget to run all the tests at least the once just before release.
- We had some repositories being used as dependencies for multiple other repositories. So when changes are made to dependency repo A and project repo B for a new feature, we would only run tests for repo A and B and not realize that the changes to repo A broke something in repo C (which also depended on repo A).
- We have some developers using Windows, and others using Macs, and there are some tests that will pass on one OS but fail on the other.
At first, we tried addressing #1 with refactoring tests to run faster, training developers to remember to run tests (and learn to determine which subset of tests matter for their changes), and just putting a bigger emphasis on running tests in general. It worked surprisingly well... until we started getting a steady influx of new people joining our team.
Eventually, we took on a migration project that gave us our chance at moving away from the architecture of having repositories dependent on other repositories (resolving #2) and also taking our first shot at a fledgeling CI system to resolve #1 and #3.
Step 2.2 Fix the Problem (with CI this time!)
With the migrated codebase, we decided to include test-running into our Jenkins job for
deploying to staging (STG) and production (PROD). Basically we force the user to prove that the
tests pass first before we allow any deployment on both STG and PROD. The main challenge to
setting this up was figuring out how to provide the external dependencies for the tests, as we
didn't want our tests interfering with our STG or PROD database or api providers. The solution
we went with was to run a set of docker containers on the Jenkins server to replicate our local
dev environment.
Here's the structure of our Jenkinsfile to give you an idea of how this works:
Jenkinsfile
pipeline { agent any environment { ... } parameters { ... } stages { stage('Checkout latest from repositories here') { steps { dir('repo-1') { checkout([ $class: 'GitSCM', branches: [[name: "${REPO_1_BRANCH}"]], userRemoteConfigs: [[credentialsId: 'SomeCredentialId', url: "REPO1_URL"]] ]) } ... } } stage('Test') { steps { dir('path to jenkins docker files') { sh 'docker-compose down' sh 'docker-compose up --build -d' ... } dir("path to repository for testing") { sh './gradlew clean && ./gradlew test --tests package.name.for.integration.tests.*' } } } stage('Parallel prune and build') { parallel { stage('docker-compose prune') { ... steps { dir('path to jenkins docker files') { sh 'docker-compose down' } sh 'docker volume prune -f' } } stage('build') { steps { ... dir("path to repository for deploying") { ... sh "./gradlew assemble --info" } } } } } stage('Deploy'){ ... } ... } post { ... } }
Step 3 Enjoy the Results (with new pain points)
This system showed its benefit right away. Less than a week after it was up and running, Jenkins blocked its first deployment, forcing the dev team to fix their broken tests before release. With the new system in place, releasing new features for this repository could happen much faster, with basically no time spent pre-development on fixing old broken tests. However, it did have the downside of adding a lot of waiting time to STG deployment for STG tests and PROD deployments for releases.
Step 4 Improve the Solution
Another developer, Tasuku Nakagawa, from a different team but the same group, decided to take a different approach for their own service. Instead of forcing tests to run on deployment, which increased our STG test and PROD release times, he instead set up a system to run the tests on each git push, with the results being reported back and displayed on Bitbucket. Our team wanted to adapt his system to our own service as well.
The CI system he set up for his team used a Jenkins deployment running on OneCloud's Container-as-a-Service CaaS (Rakuten's proprietary version of AWS, GCP, AKS) instead of physical Jenkins servers that we use for deployment. Since our group as a whole wants to eventually migrate everything to the cloud, we decided to follow suit. Configuring the necessary Kubernetes files and modifying the docker files was actually quite challenging for me as I had only limited experience with Kubernetes in the past. However, with Nakagawa-san's help and his proof-of-concept as reference, we were able to adapt a similar system to our service.
With the change, the CI flow now separates the testing from deployment.
The tests run on every push to an open pull request:
We then set up a rule on Bitbucket to prevent merging feature branches to master branch until Jenkins has reported to Bitbucket that the latest commit has passed all tests. This achieves the same gatekeeping to prevent introducing failing tests to releases while also keeping our deployments and releases as slim as possible.
Now the actual deployment can be simplified back to:
For those of you who are interested, here's the Jenkinsfile and deployment.yaml structure for running tests on Kubernetes:
Jenkinsfile
pipeline { agent { kubernetes { defaultContainer 'project' yamlFile 'ci-pod.yaml' } } environment { ... } stages { stage('Set up DB') { steps { container('db') { ... } } } stage('Test') { steps { script { try { sh './gradlew clean build' notifyBitbucket(buildStatus: 'SUCCESSFUL') } catch (e) { notifyBitbucket(buildStatus: 'FAILED') error 'test failed' } } } } } ... }
ci-pod.yaml
apiVersion: v1 kind: Pod metadata: ... spec: imagePullSecrets: ... securityContext: ... containers: - name: project image: base-image-name-for-running-tests resources: ... command: [ "sleep" ] args: [ "infinity" ] volumeMounts: - name: volume-name-for-gradle-conf mountPath: "/home/gradle/.gradle" - name: db image: base-image-name-for-db-container resources: ... command: [ "sleep" ] args: [ "infinity" ] ... volumes: - name: volume-name-for-gradle-conf persistentVolumeClaim: claimName: ...
Conclusion
That's basically where we're at today. As this is a story that spans three years, there's obviously a lot of detail I had to leave out to keep the story to a reasonable length, but I think there's enough here for you to see the big picture. In hindsight, I think we could have skipped a step or two, but I don't think any of it was a waste as we improve and learn a little each time. If you're working in a team without your own CI pipelines, hopefully this will give you some ideas on how to get started.
Obligatory Plug
Thanks for reading. I hope this will be useful (or at least mildly entertaining) for you. If you’re interested in looking into and solving interesting problems like these, consider applying to join us at Rakuten.