'Right to Be Forgotten': Analyzing the Impact of Forgetting Data Using K-NN Algorithm in Data Stream Learning
The aim of this project is - in the light of new regulations concerning the right of individuals to request their personal data to be erased - to investigate the behavior of Lazy Learning Machine Learning algorithms - k-nearest neighbors models, in particular - on a Stream Learning setting, where data can be periodically erased from the models' memories.
Unlike in Batch Machine Learning, where machine learning processes happen in a discrete fashion, in Stream Machine Learning, de data is processed in a continuous fashion. In this context, machine learning models have to cope with data coming in a sequential and possibly unlimited way. Therefore, in comparison with Batch Machine Learning, Stream Machine Learning requires models to operate under strict time and memory conditions.
k-nearest neighbors is a pattern recognition method where input data is classified based on its distance to previously seen (and labeled) data points. Because previously seen data points are required for classification and the fact that memory is a concern in Stream Learning, k-nearest neighbors in Stream Learning is achieved by adding to it a sliding window, which stores a previously-chosen number of the most recent data points for every new input data.
Concept Drift refers basically to the phenomenom of data distribution changing over time, something very feasible in the context of Stream Learning. There are 2 different concept drift types
- Real: Refers to changes in the distribution of the target variable with respect to the input data, even if the latter is unchanged.
- Virtual: Refers to changes in the input data without affecting the target variable distribution [1]
Additionaly, changes may occur at a different pace and frequency over time, so time-related classifications also exist. These can be explained in the image below [1]
- The experiments were conduced using scikit-multiflow, a Python Machine Learning on Data Streams framework.
- matplotlib was used for plot visualization.
Initial experiments were conducted using 7 datasets, with different concept drift properties. 5 of are artificial, and 2 are real-world datasets:
- SEA Concepts
- 50000 instances
- Drift Properties:
- Abrupt
- Real
- 2 classes
- 3 features
- Rotating Hyperplane
- 200000 instances
- Drift Properties:
- Incremental
- Real
- 2 classes
- 10 features
- Interchanging RBF [2]
- 200000 instances
- Drift Properties:
- Abrupt
- Real
- 2 classes
- 10 features
- Transient Chessboard [2]
- 200000 instances
- Drift Properties:
- Abrupt
- Reoccurring
- Virtual
- 8 classes
- 2 features
- Mixed Drift [2]
- 600000 instances
- Drift Properties:
- Various
- Real
- Virtual
- Various
- 15 classes
- 2 features
- Weather (original source)
- 18159 instances
- Drift Properties:
- Virtual
- 2 classes
- 8 features
- Poker Hand (original source)
- 829201 instances
- Drift Properties:
- Virtual
- 10 classes
- 10 features
All datasets used were retrieved from Viktor Losing's drift datasets page.
- The only metric evaluated in these initial experiments was Mean Accuracy
- A typical Stream Learning evaluation method was used in these initial experiments, that is Prequential Evaluation or Interleaved-Test-Then-Train Evaluation
- k = 3, 5
- window_size = 5000
In addition to the experiments conducted, there a few possibilities of other experiments:
- Conduct similiar experiments, but using SAM-KNN [3] instead of KNN.
- Dropping the whole memory and starting it from scratch every time data is erased, and comparing this against the conducted experiments.
- Testing with erasing not only the most recent examples periodically, but also random examples.
1. Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. 46(4) (2014)
2. V. Losing, B. Hammer and H. Wersing, "KNN Classifier with Self Adjusting Memory for Heterogeneous Concept Drift," 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, 2016, pp. 291-300, doi: 10.1109/ICDM.2016.0040.
3. Losing, Viktor & Hammer, Barbara & Wersing, Heiko. (2017). Self-Adjusting Memory: How to Deal with Diverse Drift Types. 4899-4903. 10.24963/ijcai.2017/690.