-
Notifications
You must be signed in to change notification settings - Fork 35
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Introduce configurations and workflow automation necessary to execute…
… LST-Bench on Spark 3.3.1 in Azure (#229)
- Loading branch information
Showing
47 changed files
with
1,523 additions
and
183 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
<!-- | ||
{% comment %} | ||
Copyright (c) Microsoft Corporation. | ||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
http://www.apache.org/licenses/LICENSE-2.0 | ||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
{% endcomment %} | ||
--> | ||
|
||
# LST-Bench: Configurations and Results | ||
This folder contains configurations for running LST-Bench on various systems as depicted in the [LST-Bench dashboard](/metrics/app), along with details about the setups used to generate those results. | ||
|
||
## Systems Included | ||
- [x] Apache Spark 3.3.1 | ||
- [x] Delta Lake 2.2.0 | ||
- [x] Apache Hudi 0.12.2 | ||
- [x] Apache Iceberg 1.1.0 | ||
- [ ] Trino 420 | ||
- [ ] Delta Lake | ||
- [ ] Apache Iceberg | ||
|
||
## Folder Structure | ||
While the folder for each engine may have a slightly different structure, they generally contain the following: | ||
|
||
- `scripts/`: | ||
This directory contains SQL files used to execute LST-Bench workloads on the respective engine. | ||
Typically, these SQL files may vary slightly across engines and LSTs based on the supported SQL dialect. | ||
- `config/`: | ||
This directory houses LST-Bench configuration files required to execute the workload. | ||
It includes LST-Bench phase/session/task libraries that reference the aforementioned SQL scripts. | ||
- Additional infrastructure and configuration automation folders, e.g., `azure-pipelines/`: | ||
These folders contain scripts or files facilitating automation for running the benchmark on a specific infrastructure/engine. | ||
For instance, Azure Pipelines scripts to deploy an engine with different LSTs and executing LST-Bench. | ||
Generally, these folders should include an additional README.md file offering further details. | ||
- `results/`: | ||
This folder stores the results of the LST-Bench runs as captured by LST-Bench telemetry using DuckDB. | ||
These results are processed and visualized in the [LST-Bench dashboard](/metrics/app). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
<!-- | ||
{% comment %} | ||
Copyright (c) Microsoft Corporation. | ||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
http://www.apache.org/licenses/LICENSE-2.0 | ||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
{% endcomment %} | ||
--> | ||
|
||
# Azure Pipelines Deployment for LST-Bench on Apache Spark 3.3.1 | ||
This directory comprises the necessary tooling for executing LST-Bench on Apache Spark 3.3.1 with different LSTs using Azure Pipelines. The included tooling consists of: | ||
- `run-lst-bench.yml`: | ||
An Azure Pipelines script designed to deploy Apache Spark with various LSTs and execute LST-Bench. | ||
- `sh/`: | ||
A directory containing shell scripts and engine configuration files supporting the deployment of Spark with different LSTs and the execution of experiments. | ||
- `config/`: | ||
A directory with LST-Bench configuration files necessary for executing the experiments that are part of the results. | ||
|
||
## Prerequisites | ||
- Automation for deploying the infrastructure in Azure to run LST-Bench is not implemented. As a result, the Azure Pipeline script expects the following setup: | ||
- A VM named 'lst-bench-client' connected to the pipeline environment to run the LST-Bench client. | ||
- A VM named 'lst-bench-head' to run the head node of the Spark cluster, also connected to the pipeline environment. | ||
- A VMSS cluster, that will serve as the Spark worker nodes, within the same VNet as the head node. | ||
- An Azure Storage Account accessible by both the VMSS and head node. | ||
- An Azure SQL Database (or SQL Server flavored RDBMS) that will be running Hive Metastore. | ||
The Hive Metastore schema for version 2.3.0 should already be installed in the instance. | ||
- Prior to running the pipeline, several variables need definition in your Azure Pipeline: | ||
- `data_storage_account`: Name of the Azure Blob Storage account where the source data for the experiment is stored. | ||
- `data_storage_account_shared_key` (secret): Shared key for the Azure Blob Storage account where the source data for the experiment is stored. | ||
- `hms_jdbc_driver`: JDBC driver for the Hive Metastore. | ||
- `hms_jdbc_url`: JDBC URL for the Hive Metastore. | ||
- `hms_jdbc_user`: Username for the Hive Metastore. | ||
- `hms_jdbc_password` (secret): Password for the Hive Metastore. | ||
- `hms_storage_account`: Name of the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog (can be the same as the data_storage_account). | ||
- `hms_storage_account_shared_key` (secret): Shared key for the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog. | ||
- `hms_storage_account_container`: Name of the container in the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog. | ||
- The versions and configurations of LSTs to run can be modified via input parameters for the pipelines in the Azure Pipelines YAML file or from the Web UI. | ||
Default values are assigned to these parameters. | ||
Parameters also include experiment scale factor, machine type, and cluster size. | ||
Note that these parameters are not used to deploy the data or the infrastructure, as this process is not automated in the pipeline. | ||
Instead, they are recorded in the experiment telemetry for proper categorization and visualization of results later on. |
7 changes: 7 additions & 0 deletions
7
run/spark-3.3.1/azure-pipelines/config/connections_config.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# Description: Connections Configuration | ||
--- | ||
version: 1 | ||
connections: | ||
- id: spark_0 | ||
driver: org.apache.hive.jdbc.HiveDriver | ||
url: jdbc:hive2://${SPARK_MASTER_HOST}:10000 |
29 changes: 29 additions & 0 deletions
29
run/spark-3.3.1/azure-pipelines/config/experiment_config-cow-delta-2.2.0.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# Description: Experiment Configuration | ||
--- | ||
version: 1 | ||
id: "${EXP_NAME}" | ||
repetitions: 1 | ||
# Metadata accepts any key-value that we want to register together with the experiment run. | ||
metadata: | ||
system: spark | ||
system_version: 3.3.1 | ||
table_format: delta | ||
table_format_version: 2.2.0 | ||
scale_factor: "${EXP_SCALE_FACTOR}" | ||
mode: cow | ||
machine: "${EXP_MACHINE}" | ||
cluster_size: "${EXP_CLUSTER_SIZE}" | ||
# The following parameter values will be used to replace the variables in the workload statements. | ||
parameter_values: | ||
external_catalog: spark_catalog | ||
external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}" | ||
external_table_format: csv | ||
external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/" | ||
external_options_suffix: ',header="true"' | ||
external_tblproperties_suffix: '' | ||
catalog: spark_catalog | ||
database: "${EXP_NAME}" | ||
table_format: delta | ||
data_path: 'abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/run/delta/sf_${EXP_SCALE_FACTOR}/' | ||
options_suffix: '' | ||
tblproperties_suffix: '' |
29 changes: 29 additions & 0 deletions
29
run/spark-3.3.1/azure-pipelines/config/experiment_config-cow-hudi-0.12.2.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# Description: Experiment Configuration | ||
--- | ||
version: 1 | ||
id: "${EXP_NAME}" | ||
repetitions: 1 | ||
# Metadata accepts any key-value that we want to register together with the experiment run. | ||
metadata: | ||
system: spark | ||
system_version: 3.3.1 | ||
table_format: hudi | ||
table_format_version: 0.12.2 | ||
scale_factor: "${EXP_SCALE_FACTOR}" | ||
mode: cow | ||
machine: "${EXP_MACHINE}" | ||
cluster_size: "${EXP_CLUSTER_SIZE}" | ||
# The following parameter values will be used to replace the variables in the workload statements. | ||
parameter_values: | ||
external_catalog: spark_catalog | ||
external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}" | ||
external_table_format: csv | ||
external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/" | ||
external_options_suffix: ',header="true"' | ||
external_tblproperties_suffix: '' | ||
catalog: spark_catalog | ||
database: "${EXP_NAME}" | ||
table_format: hudi | ||
data_path: 'abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/run/hudi/sf_${EXP_SCALE_FACTOR}/' | ||
options_suffix: '' | ||
tblproperties_suffix: ', "type"="cow"' |
29 changes: 29 additions & 0 deletions
29
run/spark-3.3.1/azure-pipelines/config/experiment_config-cow-iceberg-1.1.0.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# Description: Experiment Configuration | ||
--- | ||
version: 1 | ||
id: "${EXP_NAME}" | ||
repetitions: 1 | ||
# Metadata accepts any key-value that we want to register together with the experiment run. | ||
metadata: | ||
system: spark | ||
system_version: 3.3.1 | ||
table_format: iceberg | ||
table_format_version: 1.1.0 | ||
scale_factor: "${EXP_SCALE_FACTOR}" | ||
mode: cow | ||
machine: "${EXP_MACHINE}" | ||
cluster_size: "${EXP_CLUSTER_SIZE}" | ||
# The following parameter values will be used to replace the variables in the workload statements. | ||
parameter_values: | ||
external_catalog: spark_catalog | ||
external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}" | ||
external_table_format: csv | ||
external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/" | ||
external_options_suffix: ',header="true"' | ||
external_tblproperties_suffix: '' | ||
catalog: spark_catalog | ||
database: "${EXP_NAME}" | ||
table_format: iceberg | ||
data_path: 'abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/run/iceberg/sf_${EXP_SCALE_FACTOR}/' | ||
options_suffix: '' | ||
tblproperties_suffix: ', "format-version"="2", "write.delete.mode"="copy-on-write", "write.update.mode"="copy-on-write", "write.merge.mode"="copy-on-write"' |
29 changes: 29 additions & 0 deletions
29
run/spark-3.3.1/azure-pipelines/config/experiment_config-mor-hudi-0.12.2.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# Description: Experiment Configuration | ||
--- | ||
version: 1 | ||
id: "${EXP_NAME}" | ||
repetitions: 1 | ||
# Metadata accepts any key-value that we want to register together with the experiment run. | ||
metadata: | ||
system: spark | ||
system_version: 3.3.1 | ||
table_format: hudi | ||
table_format_version: 0.12.2 | ||
scale_factor: "${EXP_SCALE_FACTOR}" | ||
mode: mor | ||
machine: "${EXP_MACHINE}" | ||
cluster_size: "${EXP_CLUSTER_SIZE}" | ||
# The following parameter values will be used to replace the variables in the workload statements. | ||
parameter_values: | ||
external_catalog: spark_catalog | ||
external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}" | ||
external_table_format: csv | ||
external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/" | ||
external_options_suffix: ',header="true"' | ||
external_tblproperties_suffix: '' | ||
catalog: spark_catalog | ||
database: "${EXP_NAME}" | ||
table_format: hudi | ||
data_path: 'abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/run/hudi/sf_${EXP_SCALE_FACTOR}/' | ||
options_suffix: '' | ||
tblproperties_suffix: ', "type"="mor"' |
29 changes: 29 additions & 0 deletions
29
run/spark-3.3.1/azure-pipelines/config/experiment_config-mor-iceberg-1.1.0.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# Description: Experiment Configuration | ||
--- | ||
version: 1 | ||
id: "${EXP_NAME}" | ||
repetitions: 1 | ||
# Metadata accepts any key-value that we want to register together with the experiment run. | ||
metadata: | ||
system: spark | ||
system_version: 3.3.1 | ||
table_format: iceberg | ||
table_format_version: 1.1.0 | ||
scale_factor: "${EXP_SCALE_FACTOR}" | ||
mode: mor | ||
machine: "${EXP_MACHINE}" | ||
cluster_size: "${EXP_CLUSTER_SIZE}" | ||
# The following parameter values will be used to replace the variables in the workload statements. | ||
parameter_values: | ||
external_catalog: spark_catalog | ||
external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}" | ||
external_table_format: csv | ||
external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/" | ||
external_options_suffix: ',header="true"' | ||
external_tblproperties_suffix: '' | ||
catalog: spark_catalog | ||
database: "${EXP_NAME}" | ||
table_format: iceberg | ||
data_path: 'abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/run/iceberg/sf_${EXP_SCALE_FACTOR}/' | ||
options_suffix: '' | ||
tblproperties_suffix: ', "format-version"="2", "write.delete.mode"="merge-on-read", "write.update.mode"="merge-on-read", "write.merge.mode"="merge-on-read"' |
20 changes: 20 additions & 0 deletions
20
run/spark-3.3.1/azure-pipelines/config/setup_experiment_config.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
# Description: Experiment Configuration | ||
--- | ||
version: 1 | ||
id: setup_experiment | ||
repetitions: 1 | ||
# Metadata accepts any key-value that we want to register together with the experiment run. | ||
metadata: | ||
system: spark | ||
system_version: 3.3.1 | ||
scale_factor: "${EXP_SCALE_FACTOR}" | ||
machine: "${EXP_MACHINE}" | ||
cluster_size: "${EXP_CLUSTER_SIZE}" | ||
# The following parameter values will be used to replace the variables in the workload statements. | ||
parameter_values: | ||
external_catalog: spark_catalog | ||
external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}" | ||
external_table_format: csv | ||
external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/" | ||
external_options_suffix: ',header="true"' | ||
external_tblproperties_suffix: '' |
13 changes: 13 additions & 0 deletions
13
run/spark-3.3.1/azure-pipelines/config/telemetry_config.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# Description: Telemetry Configuration | ||
--- | ||
version: 1 | ||
connection: | ||
id: duckdb_0 | ||
driver: org.duckdb.DuckDBDriver | ||
url: jdbc:duckdb:./telemetry-spark-3.3.1 | ||
execute_ddl: true | ||
ddl_file: 'src/main/resources/scripts/logging/duckdb/ddl.sql' | ||
insert_file: 'src/main/resources/scripts/logging/duckdb/insert.sql' | ||
# The following parameter values will be used to replace the variables in the logging statements. | ||
parameter_values: | ||
data_path: '' |
Oops, something went wrong.