Skip to content

Latest commit

 

History

History
109 lines (80 loc) · 2.97 KB

readme.md

File metadata and controls

109 lines (80 loc) · 2.97 KB

Benchmarking Learned Bloom Filters

About ℹ️

Motivation

This project has been created as part of a Bachelor project at IT-University of Copenhagen in Spring 2023. It aims to provide an open and transparent way to benchmark various Learned Bloom Filters. This project contains implementation of the following Bloom Filters in /bloom_filters:

Data sets

Two different data sets are provided.

URL data set

A data set containing labeled 450,176 URLs. 345,738 beneign and 104,438 malicous. The data set is provided in /data/raw/url_data (source).

To vectorize the URL data run:

make vectorize

Synthetic Zipfean data set

A synthetic Zipfean data set is provided. This can be regenerated:

make zipf

Installation ⚙️

1. Clone the repository

git clone [email protected]:BSc-learned-indexes/daisy-bf.git
cd daisy

2. Recommended: Creating a virtual environment

We recommend that you install this project's dependencies in an isolated enviroment. If you are unfamiliar with this concept you can read more about it here.

Create the environment

python venv -m ~/.virtualenvs/daisy

Source the environment

source ~/.virtualenvs/daisy/bin/activate

3. Installing dependencies

pip install -r requirements.txt 

Usage 📈

Benchmarking the Bloom Filters

We have provided a template to run a benchmarking experiment with the following settings:

  • Large Random Forest Classifier as model
  • 1 - px as the query distribution
  • URL data set
  • Full key set

A series of make commands are provided to build the filters:

Build Adaptive Learned Bloom Filter

make adabf 

Build Partitioned Learned Bloom Filter

make plbf 

Build Daisy Bloom Filter

make daisy 

Build all Bloom Filters

Note: this command takes a while 🐌

make all 

Plot all Bloom Filters

make plot_all 

Plot all the Learned Bloom Filters (excludes the regular Bloom Filter)

make plot_learned_bf

Example output 🖼️

Example: 0.1% key to non-key ratio, query distribution: qx = 1 - px, model: Large Random Forest Classifier github_img_new

Extra 🤓

  • The directory /experiments contains all the data that is presented in the Bachelor's thesis's Experiments section.
  • The thesis can be read in /thesis.