ImputeGAP is a comprehensive framework designed for time series imputation algorithms. It offers a streamlined interface that bridges algorithm evaluation and parameter tuning, utilizing datasets from diverse fields such as neuroscience, medicine, and energy. The framework includes advanced imputation algorithms from five different families, supports various patterns of missing data, and provides multiple evaluation metrics. Additionally, ImputeGAP enables AutoML optimization, feature extraction, and feature analysis. The framework enables easy integration of new algorithms, datasets, and evaluation metrics.
- Documentation: https://exascaleinfolab.github.io/ImputeGAP/
- PyPI: https://pypi.org/project/imputegap/
- Datasets: Repository
Requirements | Installation | Preprocessing | Contamination | Auto-ML | Explainer | Integration | References | Contributors
The following prerequisites are required to use ImputeGAP:
- Python version 3.10 / 3.11 / 3.12 (recommended)
- Unix-compatible environment for execution
To create and set up an environment with Python 3.12, please refer to the installation guide.
To quickly install the latest version of ImputeGAP along with its dependencies from the Python Package Index (PyPI), run the following command:
$ pip install imputegap
To modify the code of ImputeGAP or contribute to is development, you can install the library from source:
- Initialize a Git repository and clone the project from GitHub:
$ git init
$ git clone https://github.com/eXascaleInfolab/ImputeGAP
$ cd ./ImputeGAP
- Once inside the project directory, run the following command to install the package in editable mode:
$ pip install -e .
The data management module allows to load any time series datasets in text format, given they follow this format: (values, series) with column separator: empty space, row separator: newline.
You can find this example in the file runner_loading.py
.
from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils
# 1. initiate the TimeSeries() object that will stay with you throughout the analysis
ts_1 = TimeSeries()
# 2. load the timeseries from file or from the code
ts_1.load_timeseries(utils.search_path("eeg-alcohol"), max_series=5, max_values=15)
ts_1.normalize(normalizer="z_score")
# [OPTIONAL] you can plot your raw data / print the information
ts_1.plot(input_data=ts_1.data, max_series=10, max_values=100, save_path="./imputegap/assets")
ts_1.print(limit_series=10)
ImputeGAP allows to contaminate a complete datasets with missing data patterns that mimics real-world scenarios. The available patterns are : MCAR
, MISSING POURCENTAGE
, and BLACKOUT
.
For more details, please refer to the documentation in this page.
You can find this example in the file runner_contamination.py
.
from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils
# 1. initiate the TimeSeries() object that will stay with you throughout the analysis
ts_1 = TimeSeries()
# 2. load the timeseries from file or from the code
ts_1.load_timeseries(utils.search_path("eeg-alcohol"))
ts_1.normalize(normalizer="min_max")
# 3. contamination of the data with MCAR pattern
incomp_data = ts_1.Contamination.mcar(ts_1.data, series_rate=0.2, missing_rate=0.2, seed=True)
# [OPTIONAL] you can plot your raw data / print the contamination
ts_1.print(limit_timestamps=12, limit_series=7)
ts_1.plot(ts_1.data, incomp_data, max_series=9, subplot=True, save_path="./imputegap/assets")
ImputeGAP provides a diverse selection of imputation algorithms, organized into five main categories: Matrix Completion, Deep Learning, Statistical Methods, Pattern Search, and Graph Learning. You can also add your own custom imputation algorithm by following the min-impute
template and substituting your code to implement your logic.
You can find this example in the file runner_imputation.py
.
from imputegap.recovery.imputation import Imputation
from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils
# 1. initiate the TimeSeries() object that will stay with you throughout the analysis
ts_1 = TimeSeries()
# 2. load the timeseries from file or from the code
ts_1.load_timeseries(utils.search_path("eeg-alcohol"))
ts_1.normalize(normalizer="min_max")
# 3. contamination of the data
incomp_data = ts_1.Contamination.mcar(ts_1.data)
# [OPTIONAL] save your results in a new Time Series object
ts_2 = TimeSeries().import_matrix(incomp_data)
# 4. imputation of the contaminated data
# choice of the algorithm, and their parameters (default, automl, or defined by the user)
cdrec = Imputation.MatrixCompletion.CDRec(ts_2.data)
# imputation with default values
cdrec.impute()
# OR imputation with user defined values
# >>> cdrec.impute(params={"rank": 5, "epsilon": 0.01, "iterations": 100})
# [OPTIONAL] save your results in a new Time Series object
ts_3 = TimeSeries().import_matrix(cdrec.recov_data)
# 5. score the imputation with the raw_data
cdrec.score(ts_1.data, ts_3.data)
# 6. display the results
ts_3.print_results(cdrec.metrics, algorithm="cdrec")
ts_3.plot(input_data=ts_1.data, incomp_data=ts_2.data, recov_data=ts_3.data, max_series=9, subplot=True, save_path="./imputegap/assets")
ImputeGAP provides optimization techniques that automatically identify the optimal hyperparameters for a specific algorithm in relation to a given dataset. The available optimizers are: Greedy Optimizer (GO), Bayesian Optimizer (BO), Particle Swarm Optimizer (PSO), and Successive Halving (SH).
You can find this example in the file runner_optimization.py
.
from imputegap.recovery.imputation import Imputation
from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils
# 1. initiate the TimeSeries() object that will stay with you throughout the analysis
ts_1 = TimeSeries()
# 2. load the timeseries from file or from the code
ts_1.load_timeseries(utils.search_path("eeg-alcohol"))
ts_1.normalize(normalizer="min_max")
# 3. contamination of the data
miss_matrix = ts_1.Contamination.mcar(ts_1.data)
# 4. imputation of the contaminated data
# imputation with AutoML which will discover the optimal hyperparameters for your dataset and your algorithm
cdrec = Imputation.MatrixCompletion.CDRec(miss_matrix).impute(user_def=False, params={"input_data": ts_1.data, "optimizer": "bayesian", "options": {"n_calls": 3}})
# 5. score the imputation with the raw_data
cdrec.score(ts_1.data, cdrec.recov_data)
# 6. display the results
ts_1.print_results(cdrec.metrics)
ts_1.plot(input_data=ts_1.data, incomp_data=miss_matrix, recov_data=cdrec.recov_data, max_series=9, subplot=True, save_path="./imputegap/assets", display=True)
# 7. save hyperparameters
utils.save_optimization(optimal_params=cdrec.parameters, algorithm="cdrec", dataset="eeg", optimizer="t")
ImputeGAP allows users to explore the features in the data that impact the imputation results through Shapely Additive exPlanations (SHAP). To attribute a meaningful interpretation of the SHAP results, ImputeGAP groups the extracted features into four categories: geometry, transformation, correlation, and trend.
You can find this example in the file runner_explainer.py
.
from imputegap.recovery.manager import TimeSeries
from imputegap.recovery.explainer import Explainer
from imputegap.tools import utils
# 1. initiate the TimeSeries() object that will stay with you throughout the analysis
ts_1 = TimeSeries()
# 2. load the timeseries from file or from the code
ts_1.load_timeseries(utils.search_path("eeg-alcohol"))
# 3. call the explanation of your dataset with a specific algorithm to gain insight on the Imputation results
shap_values, shap_details = Explainer.shap_explainer(input_data=ts_1.data, extractor="pycatch22", pattern="mcar", missing_rate=0.25, limit_ratio=1, split_ratio=0.7, file_name="eeg-alcohol", algorithm="cdrec")
# [OPTIONAL] print the results with the impact of each feature.
Explainer.print(shap_values, shap_details)
ImputeGAP enables users to comprehensively evaluate the efficiency of algorithms across various datasets.
You can find this example in the file runner_benchmark.py
.
from imputegap.recovery.benchmark import Benchmark
# VARIABLES
save_dir = "./analysis"
nbr_run = 2
# SELECT YOUR DATASET(S) :
datasets_demo = ["eeg-alcohol", "eeg-reading"]
# SELECT YOUR OPTIMIZER :
optimiser_bayesian = {"optimizer": "bayesian", "options": {"n_calls": 15, "n_random_starts": 50, "acq_func": "gp_hedge", "metrics": "RMSE"}}
optimizers_demo = [optimiser_bayesian] # add optimizer you want to test
# SELECT YOUR ALGORITHM(S) :
algorithms_demo = ["mean", "cdrec", "stmvl", "iim", "mrnn"]
# SELECT YOUR CONTAMINATION PATTERN(S) :
patterns_demo = ["mcar"]
# SELECT YOUR MISSING RATE(S) :
x_axis = [0.05, 0.1, 0.2, 0.4, 0.6, 0.8]
# START THE ANALYSIS
list_results, sum_scores = Benchmark().eval(algorithms=algorithms_demo, datasets=datasets_demo, patterns=patterns_demo, x_axis=x_axis, optimizers=optimizers_demo, save_dir=save_dir, runs=nbr_run)
To add your own imputation algorithm in Python or C++, please refer to the detailed integration guide.
Mourad Khayati, Quentin Nater, and Jacques Pasquier. ImputeVIS: An Interactive Evaluator to Benchmark Imputation Techniques for Time Series Data. Proceedings of the VLDB Endowment (PVLDB). Demo Track 17, no. 1 (2024), 4329–32.
Mourad Khayati, Alberto Lerner, Zakhar Tymchenko, and Philippe Cudre-Mauroux. Mind the Gap: An Experimental Evaluation of Imputation of Missing Values Techniques in Time Series. In Proceedings of the VLDB Endowment (PVLDB), Vol. 13, 2020.
- Quentin Nater ([email protected])
- Dr. Mourad Khayati ([email protected])