Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The benchmark computation was originally unstable due to the method of generating training data points. The benchmark generated random points such that each dimension of each point was a random double between zero and one, rounded to one decimal place. This approach resulted in points representing random noise within a cuboid defined by [0, 0, ..., 0] and [1, 1, ..., 1], rather than Gaussian distributions. Additionally, the Spark GMM model fitting returns inconsistent results when different thread counts are used, making it impossible to achieve benchmark stability even with a fixed seed for the random generator.
Currently, the benchmark generates input data points grouped into clusters. These clusters are positioned with increasing offset from the vertices of a hypercube. To enhance model fit quality, the clusters have varying deviations. After generation, the points are split into three sets: training, validation, and test. The training data points are used to train multiple models, each with different training parameters: iteration count, seed, and K.
Each model is trained with a different seed to explore variations in model initialization, which can lead to better fits. To ensure consistency, the benchmark uses an initial seed and increments it by one for each subsequent trained model.
The K parameter represents the number of Gaussian clusters the model should fit. In the benchmark, K is set to either 1.5 or 2.0 times the actual number of generated centroids. Using a higher K improves the model fit quality.
After training models with all defined configurations, the best model is selected. For each model, cluster inclusion for all points in the validation set is predicted, and the prediction accuracy is computed. The distance between the expected and correctly predicted Gaussian distribution mean (mu) should be 0.25 or less. The best model is then used to predict the points in the test set, and its prediction accuracy is computed. The benchmark validation requires the prediction accuracy of the best model on the test set to exceed 99%.
I believe the duration of the gauss-mix benchmark will vary due to modifications in its computation. Specifically, I reduced the point count and dimension count while increasing the number of trained models from 1 to 8 to enhance validation stability. To document the change in benchmark duration, I measured 15 runs (each with 150 repetitions) using JDK21 on a 6-cores (12-threads) system, both before and after the modifications. During these measurements, I included JVM warmup and did not filter outliers. The collected measurements are visualized in the graphs bellow:
The average benchmark duration before validation was: 532.923ms
The average benchmark duration after validation is: 4608.173ms