Skip to content

Latest commit

 

History

History
49 lines (40 loc) · 4.08 KB

README.org

File metadata and controls

49 lines (40 loc) · 4.08 KB

rust_1brc

rust_1brc is a Rust implementation for the One Billion Requests Challenge (1BRC). The challenge involves processing one billion temperature measurements to calculate the minimum, mean, and maximum temperatures per weather station. This project aims to explore Rust’s capabilities for efficient data handling and processing. The main motivation for undertaking this challenge was to leverage Rust’s std::mpsc and std::thread libraries. The challenge is well-suited for parallelization, making it an ideal choice for exploring these libraries.

The input file can be obtained by using one of the official scripts, such as this Python version. This script generates a text file containing one billion temperature measurements, approximately 13GB in size, referred to as measurements.txt. The input file uses names from the weather_stations.csv allowing us to optimize our hash function specifically for the official dataset. By tailoring the hash function to this specific set, we achieve higher performance compared to Rust’s standard library implementation, which is generally more collision-resilient and secure.

This project will load the _entire_ file into memory and process it using all available CPU cores to maximize performance.

Results

To measure performance we use hyperfine on the release build with the --warmup 2 option. On a MacBook M1 Pro (2021) the project processes the input file in ~2.75 seconds.

Benchmark 1: ./target/release/rust_1brc ./measurements.txt
  Time (mean ± σ):      2.746 s ±  0.017 s    [User: 23.851 s, System: 1.659 s]
  Range (min … max):    2.718 s …  2.777 s    10 runs

Usage

To use this project start by cloning the repository,

git clone https://github.com/hesampakdaman/rust_1brc.git
cd rust_1brc

then you can build and run the project with the following command.

cargo run --release /path/to/measurements.txt

Architecture

This section provides a high-level overview of the system’s architecture: the main components and their interactions.

Main Components

main.rs
The entry point of the application. Handles command-line arguments.
pipeline.rs
Orchestrating the processing workflow by calling other modules.
pre_processing.rs
Manages the initial parsing and preparation of data, utilizing memory-mapped files for efficient access.
compute.rs
Contains the core logic for processing the temperature data, including calculations for min, mean, and max temperatures.
aggregate.rs
Responsible for aggregating the results of the temperature data processing.
weather.rs
Defines data structures and utility functions that are used throughout the application.

Workflow

  1. Initialization: The application starts in main.rs, where it parses command-line arguments to get the path of the input file.
  2. Orchestration: pipeline.rs sets up the workflow by calling other modules.
  3. Data Loading: pre_processing.rs handles the loading of the input data file using memory-mapped files to efficiently manage large data volumes.
  4. Data Processing: compute.rs processes the loaded data, calculating the required statistics (min, mean, max) for each weather station.
  5. Aggregation: aggregate.rs aggregates the computed results for final output.
  6. Output: Results are then output in the format specified by the challenge requirements.

Boundaries

  • Module Boundaries: Clear separation between orchestration (pipeline.rs), data loading (pre_processing.rs), data processing (compute.rs), aggregation (aggregate.rs), and utility functions (weather.rs).
  • Separating I/O: The application logic is off-loaded to business logic functions within pipeline.rs, ensuring that core algorithms remain focused on computation without being coupled to input/output operations.