Skip to content

asatarin/testing-distributed-systems

Repository files navigation

List of resources on testing distributed systems curated by Andrey Satarin. If you are interested in my other stuff, check out public talks. For any questions or suggestions you can reach out to me on Twitter, Bluesky @asatarin.bsky.social or other platforms.

{% comment %} Private notes https://docs.google.com/document/d/1xHt_PK9yGMTP6JNDMydQLF4SHIdlq-BF9IZeTOXtIOg/edit {% endcomment %}

Table of Contents

  • A Markdown unordered list which will be replaced with the ToC {:toc}

Overview of Testing Approaches

Research Papers

Bugs

Testing

Fault Tolerance

Resilience In Complex Adaptive Systems

These materials are not directly related to testing distributed systems, but they greatly contribute to general understanding of such systems.

Jepsen

State-of-the-art approach to testing stateful distributed systems.

  • Jepsen Analyses — most recent Jepsen analyses of different distributed systems
  • Jepsen Talks — talks by Kyle Kingsbury at various conferences
  • Aphyr's Jepsen posts — older Jepsen analyses on Kyle Kingsbury's (Aphyr) personal site
  • Jepsen Talks on GitHub — Jepsen talks slides before 2015 on GitHub
  • Kyle Kingsbury on InfoQ
  • Call me maybe: Jepsen and flaky networks — talk on Jepsen, not by Kyle
  • Jepsen is used by Microsoft CosmosDB — founder of Azure CosmosDB confirms, that they are using Jepsen
  • Consistency Models — overview of various consistency models for distributed systems with transactional and non-transactional semantics. This page gives bird's-eye view on guarantees distributed systems might provide with references to do a deep dive.
  • Maelstrom — A workbench for writing toy implementations of distributed systems. Provides tests and simple I/O protocol to test simple implementation of distributed systems written in any language. All testing happens on one node, network is fully simulated.

Elle transactional consistency checker for black-box databases:

Some notable Jepsen analyses:

Jepsen is used by CockroachDB, VoltDB, Cassandra, ScyllaDB, YDB, MariaDB and others.

Formal Methods

TLA+

Companies using TLA+ to verify correctness of algorithms:

Deterministic Simulation

Pioneered by FoundationDB, deterministic simulation approach to testing distributed systems gained more popularity in recent years.

More companies and systems adopt deterministic simulation as a primary testing strategy:

See also autonomous testing, FoundationDB.

Autonomous Testing

This approach is currently represented by Antithesis — pioneers in autonomous testing, defining the space and the state of the art. Will Wilson (of FoundationDB fame) is one of the founders.

See also deterministic simulation, FoundationDB and fuzzing.

Lineage-driven Fault Injection

Netflix adopted lineage-driven fault injection techniques for testing microservices.

Chaos Engineering

Netflix pioneered chaos engineering discipline.

Fuzzing

There are two flavors of fuzzing. First, randomized concurrency testing, where the ordering of messages is fuzzed:

And input fuzzing, where message contents or user inputs are fuzzed:

See also autonomous testing.

Microservices

Amazing and comprehensive overview of different strategies to test systems built with microservices by Cindy Sridharan.

Series of blog posts specifically on testing in production — best practices, pitfalls, etc:

Performance and Benchmarking

See also benchmarking tools.

Misc

Testing in a Distributed World

Great overview of techniques for testing distributed systems from practitioner, the video did age well and still an excellent overview of the landscape. Additional materials could be found in this GitHub repo

Game Days

Technologies for Testing Distributed Systems

Colin Scott shares his viewpoint from academia on testing distributed systems, specifically regression testing for correctness and performance bugs.

Test Case Reduction

Specific Approaches in Different Distributed Systems

Google

Amazon Web Services

See also formal methods and deterministic simulation sections.

Netflix

Automated failure injection (see also Lineage-driven Fault Injection):

Random/manual failure injection testing:

See also chaos engineering and lineage-driven fault injection.

Microsoft

See also formal methods section.

Meta

  • BellJar: A new framework for testing system recoverability at scale — BellJar is a testing framework focused on answering question "What service dependencies are required for the service to recover after large scale disaster?". BellJar puts service in a vacuum environment with only handful of direct dependencies allow-listed to verify that recovery procedures succeed under those constraints. It checks those recovery procedures in CI/CD pipeline preventing unconstrained growth of dependency graph and circular dependencies. Based on BellJar tests one can construct the entire dependency graph of the services allowing to boostrap them in the correct order from bottom to top.
  • Vacuum Testing for Resiliency: Verifying Disaster Recovery in Complex — talk on how BellJar is used at Meta to test recovery of distributed systems

FoundationDB

See also deterministic simulation and autonomous testing.

Cassandra

ScyllaDB

They published series of blog posts on testing ScyllaDB:

Dropbox

  • Mysteries of Dropbox Property-Based Testing of a Distributed Synchronization Service— example of how to use QuickCheck to test synchronization in Dropbox and similar tools (Google Drive). John Hughes gave a talk on this. See also QuickCheck.
  • Data Checking at Dropbox — If you have lots of data, you have to verify that it did not suffer from bit rot and protect it against rare bugs (e.g. race conditions) to guarantee long term durability. This talks explains intricacies of building data consistency checker(s) at Dropbox scale.
  • Dropbox's Exabyte Storage System (aka Magic Pocket) talk by James Cowling — describes number of strategies to achieve extremely high durability. This includes:
    • guard against faulty disks,
    • guard against software defects,
    • guard against black swan events,
    • operational safeguards to reduce blast radius,
    • safeguards against deletes with multi stage soft-delete,
    • comprehensive testing strategy in-depth with increased scale,
    • redundancy across various axis in software and hardware stacks,
    • continuous data integrity validation on many levels,
    • etc
  • Testing sync at Dropbox — comprehensive overview of two test frameworks at Dropbox for new sync engine implementation. CanopyCheck — single threaded and fully deterministic randomized testing framework with minimization for synchronization planner component of the engine. The other framework Trinity focuses on concurrency and larger surface area of components. Great discussion on tradeoffs between determinism, strength of test oracles vs width of coverage and size of the system under test.

Elastic (Elasticsearch)

See also formal methods section.

MongoDB

See also formal methods section.

Confluent (Kafka)

See also formal methods section.

CockroachLabs (CockroachDB)

See also:

SingleStore

Formerly known as MemSQL.

Twitter

LinkedIn

Salesforce

VoltDB

Series of post on testing at VoltDB:

Additional resources:

PingCap (TiDB)

See also:

Cloudera

Wallaroo Labs

There is also talk from Sean T. Allen on testing stream processing system at Wallaroo Labs (ex. Sendence)

YugabyteDB

FaunaDB

Shopify

Hazelcast

Basho (Riak)

Etcd

Red Planet Labs

See also deterministic simulation section.

Atomix Copycat

Druid.io

TigerBeetle

See also deterministic simulation section.

Convex

See also QuickCheck, FoundationDB, Dropbox, Jepsen, deterministic simulation.

RisingWave

In a series of two blog posts, RisingWave team talks about their experience using deterministic simulation for testing distributed SQL-based stream processing platform:

As a result of this work, they open sourced MadSim — Magical Deterministic Simulator for the Rust language ecosystem.

See also deterministic simulation section.

YDB

See also Jepsen.

Feldera

  • Correctness at Feldera — overview of correctness approaches used at Feldera, a strongly consistent incremental compute engine, includes:
    • machine checked proof of the foundational DBSP algorithm using Lean theorem prover
    • differential (shadow) testing of the implementation
    • large corpus of tests reused from other SQL systems (MySQL, Postgres, Data Fusion, SQL Logic Tests, etc)
    • metamorphic tests with SQLancer
    • manually written automatic tests
    • fault tolerance, model-based and fuzzing tests for the control plane
  • Formalization of DBSP — GitHub repository with machine checked proof of the DBSP algorithm using Lean theorem prover

Single Node Systems

These examples are not about distributed systems, but they demonstrate testing concurrency and level of sophistication required in distributed systems.

Concurrency

Testing concurrent code is one of the challenges in single node as well as distributed systems. These tools help to test both lock based and lock-free concurrent code on various platforms.

JCStress

LinCheck

Other

SQLite

SQLite is not a distributed system by any stretch of the imagination, but provides good example of comprehensive testing of a database implementation.

Sled

See also deterministic simulation section.

Clickhouse

MariaDB

Tools

Network Simulation

QuickCheck

Benchmarking

Linkbench

YCSB