Cloud-Era Testing Infrastructure: Test everything, everywhere

Get to know our solution to shift-left prod-like tests, without sacrificing dev velocity, which also makes the testing environment, whether it's the CI containerized runners or a developer running the tests locally - completely transparent to the testing suite.

With microservices-oriented architecture becoming the go-to choice, engineers are introduced to many new challenges. Let’s dive into one in particular - creating tests that guarantee two core goals in any testing suite:

• Correctness: tests must mimic real world scenarios
• Velocity: tests must be easy to create, execute and investigate

Usually, as we try to shift left (towards development phases) correctness guarantees, we incorporate prod-like components in the test suites.

A major enabler for that strategy is the ability to use containers programmatically as part of the test suite. Containers are the perfect vessel to mimic real world conditions as they will often be the same components used in production, excluding configurations/resources that most of the time do not compromise correctness.

Additionally, containers can be used to execute the tests themselves, as they provide a consistent runtime environment, eliminating the infamous class of problems known as  “but it works on my machine”. However, taking a closer look at the container-based testing approach, there are important things to consider. 

It starts with one

Let’s look at a simple case - db migrations tests. There are many approaches to implementing such tests, but a valid solution will be to spin up a db container, run the migrations on top of it and verify they were executed successfully.

That’s great! This test case by itself sets a high bar for correctness, as the migrations will be executed against a prod-like db, and booting one container is a speedy task these days.

Taking a closer look into the above solution, we’ll see we’ve introduced new challenges to our testing ecosystem:

That container will now boot every time the test is run, whether by CI workers or actual developers, and if those tests do not terminate cleanly and that container is mishandled - it will hang there forever.
If we are using containerized test runners - we will need to use docker-in-docker, which is not always trivial.

When trying to scale this approach, we will soon encounter performance issues, as spinning multiple containers during the test phase will increase the resource usage significantly, especially in shared development environments and in CI Workers that run multiple workloads in parallel against the same docker daemon.

Test velocity usually decreases as we introduce more production-like dependencies.

At groundcover, we faced those challenges, and we decided to look for a new solution that will keep 3 core testing goals:

  1. Production Like. Preserve the same proximity to production we’ve achieved with containers.
  2. Fast. Accelerate tests execution in at least x2, including “rush hours” (high amount of test suite executions).
  3. Transparent to Context. Will not branch the testing framework according to different testing contexts (CI/devs/local/containerized) 

Flat earth

After analyzing our stack from a testing perspective, and researching existing solutions and tools that can be leveraged for the task, we came up with the following strategy: 

  1. Centralize the containers used in the tenants and boost their resources
  2. Use app-level multi-tenancy in the centralized containers, as we concluded that it does not compromise correctness
  3. Implement cloud native VPN solutions to gather dev environments, CI and the centralized containers under one LAN.

In our case, the designated containers to be centralized were our logs, metrics and traces databases (Loki, VictoriaMetrics and TimescaleDB respectively).

All those databases supported multi-tenancy, which allowed us to isolate each test suite (or specific test if needed) in its own tenant.  A cool unexpected added value of this solution, is that since the databases are now long running, we could connect them to our Grafana, and explore the data generated during the testing phase, nice!

For flattening the network layer, we used Tailscale as a VPN sidecar in each database pod, and connected the ci runners and our dev workstation to the same VPN, this made the centralized databases accessible from dev and CI contexts (whether containerized or local) in the exact same way, keeping the testing framework simple and branchless in that sense.

Here is a diagram of our current solution:


Tests execution time is ~x2.5 faster
Lower resource consumption in CI workers and dev workstation, as almost no containers are being booted.

Logs during tests are much more accessible, as we can pause tests and check the data in a Grafana dashboard instead of logging in into a container. 

Ease of use
Tests can be run by simply running `go test`, or using the build target that runs them within a container, No differences between the two testing scenarios.

Long term
More scenarios can be shifted left towards development phases and basic CI checks, making release cycles much shorter as there are no long, blocking integration tests.

Of course, there are no free gifts, and as with any k8s deployment, we monitor the new centralized databases. Fortunately, we have a really nice monitoring solution :) You should give it a try too!

In the SAAS era, velocity rules, and although cloud technologies introduce new complexities, there are also new ways to scale and accelerate software crafting, we are currently happy with the current solution, but we are always on the lookout for better ones, feel free to share yours with us!

November 18, 2022

5 min read

Explore related posts