Add gauss-mix validation #448

lovisek · 2024-07-17T21:25:39Z

The benchmark computation was originally unstable due to the method of generating training data points. The benchmark generated random points such that each dimension of each point was a random double between zero and one, rounded to one decimal place. This approach resulted in points representing random noise within a cuboid defined by [0, 0, ..., 0] and [1, 1, ..., 1], rather than Gaussian distributions. Additionally, the Spark GMM model fitting returns inconsistent results when different thread counts are used, making it impossible to achieve benchmark stability even with a fixed seed for the random generator.

Currently, the benchmark generates input data points grouped into clusters. These clusters are positioned with increasing offset from the vertices of a hypercube. To enhance model fit quality, the clusters have varying deviations. After generation, the points are split into three sets: training, validation, and test. The training data points are used to train multiple models, each with different training parameters: iteration count, seed, and K.

Each model is trained with a different seed to explore variations in model initialization, which can lead to better fits. To ensure consistency, the benchmark uses an initial seed and increments it by one for each subsequent trained model.

The K parameter represents the number of Gaussian clusters the model should fit. In the benchmark, K is set to either 1.5 or 2.0 times the actual number of generated centroids. Using a higher K improves the model fit quality.

After training models with all defined configurations, the best model is selected. For each model, cluster inclusion for all points in the validation set is predicted, and the prediction accuracy is computed. The distance between the expected and correctly predicted Gaussian distribution mean (mu) should be 0.25 or less. The best model is then used to predict the points in the test set, and its prediction accuracy is computed. The benchmark validation requires the prediction accuracy of the best model on the test set to exceed 99%.

I believe the duration of the gauss-mix benchmark will vary due to modifications in its computation. Specifically, I reduced the point count and dimension count while increasing the number of trained models from 1 to 8 to enhance validation stability. To document the change in benchmark duration, I measured 15 runs (each with 150 repetitions) using JDK21 on a 6-cores (12-threads) system, both before and after the modifications. During these measurements, I included JVM warmup and did not filter outliers. The collected measurements are visualized in the graphs bellow:

The average benchmark duration before validation was: 532.923ms
The average benchmark duration after validation is: 4608.173ms

lbulej · 2024-08-02T15:25:11Z

It would be nice if @axel22 , @farquet or @ceresek found some time to have a look. I exclude mysself, because I was involved behind the scenes already, but maybe we (me and @lovisek) missed something :-)

farquet

Very nice rework and analysis!

This makes the benchmark more reasonable and reliable for the future.

Add gauss-mix validation

7cbbec7

lbulej marked this pull request as ready for review August 1, 2024 13:54

farquet approved these changes Aug 7, 2024

View reviewed changes

lbulej merged commit 7b33d3b into renaissance-benchmarks:master Aug 7, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gauss-mix validation #448

Add gauss-mix validation #448

lovisek commented Jul 17, 2024 •

edited

Loading

lbulej commented Aug 2, 2024

farquet left a comment

Add gauss-mix validation #448

Add gauss-mix validation #448

Conversation

lovisek commented Jul 17, 2024 • edited Loading

lbulej commented Aug 2, 2024

farquet left a comment

Choose a reason for hiding this comment

lovisek commented Jul 17, 2024 •

edited

Loading