Introduce configurations and workflow automation necessary to execute…

… LST-Bench on Spark 3.3.1 in Azure (#229)
microsoft · Feb 21, 2024 · 3e2dcc7 · 3e2dcc7
1 parent a7a17a2
commit 3e2dcc7
Show file tree

Hide file tree

Showing 47 changed files with 1,523 additions and 183 deletions.
diff --git a/.azure-pipelines/workflows/periodic_reporting.yml b/.azure-pipelines/workflows/periodic_reporting.yml
diff --git a/run/README.md b/run/README.md
@@ -0,0 +1,46 @@
+<!--
+{% comment %}
+Copyright (c) Microsoft Corporation.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# LST-Bench: Configurations and Results
+This folder contains configurations for running LST-Bench on various systems as depicted in the [LST-Bench dashboard](/metrics/app), along with details about the setups used to generate those results.
+
+## Systems Included
+- [x] Apache Spark 3.3.1
+  - [x] Delta Lake 2.2.0
+  - [x] Apache Hudi 0.12.2
+  - [x] Apache Iceberg 1.1.0
+- [ ] Trino 420
+  - [ ] Delta Lake
+  - [ ] Apache Iceberg
+
+## Folder Structure
+While the folder for each engine may have a slightly different structure, they generally contain the following:
+
+- `scripts/`: 
+  This directory contains SQL files used to execute LST-Bench workloads on the respective engine. 
+  Typically, these SQL files may vary slightly across engines and LSTs based on the supported SQL dialect.
+- `config/`: 
+  This directory houses LST-Bench configuration files required to execute the workload. 
+  It includes LST-Bench phase/session/task libraries that reference the aforementioned SQL scripts.
+- Additional infrastructure and configuration automation folders, e.g., `azure-pipelines/`: 
+  These folders contain scripts or files facilitating automation for running the benchmark on a specific infrastructure/engine.
+  For instance, Azure Pipelines scripts to deploy an engine with different LSTs and executing LST-Bench. 
+  Generally, these folders should include an additional README.md file offering further details.
+- `results/`: 
+  This folder stores the results of the LST-Bench runs as captured by LST-Bench telemetry using DuckDB.
+  These results are processed and visualized in the [LST-Bench dashboard](/metrics/app).
diff --git a/run/spark-3.3.1/azure-pipelines/README.md b/run/spark-3.3.1/azure-pipelines/README.md
@@ -0,0 +1,50 @@
+<!--
+{% comment %}
+Copyright (c) Microsoft Corporation.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Azure Pipelines Deployment for LST-Bench on Apache Spark 3.3.1
+This directory comprises the necessary tooling for executing LST-Bench on Apache Spark 3.3.1 with different LSTs using Azure Pipelines. The included tooling consists of:
+- `run-lst-bench.yml`:
+  An Azure Pipelines script designed to deploy Apache Spark with various LSTs and execute LST-Bench.
+- `sh/`:
+  A directory containing shell scripts and engine configuration files supporting the deployment of Spark with different LSTs and the execution of experiments.
+- `config/`:
+  A directory with LST-Bench configuration files necessary for executing the experiments that are part of the results.
+
+## Prerequisites
+- Automation for deploying the infrastructure in Azure to run LST-Bench is not implemented. As a result, the Azure Pipeline script expects the following setup:
+  - A VM named 'lst-bench-client' connected to the pipeline environment to run the LST-Bench client.
+  - A VM named 'lst-bench-head' to run the head node of the Spark cluster, also connected to the pipeline environment.
+  - A VMSS cluster, that will serve as the Spark worker nodes, within the same VNet as the head node.
+  - An Azure Storage Account accessible by both the VMSS and head node.
+  - An Azure SQL Database (or SQL Server flavored RDBMS) that will be running Hive Metastore.
+    The Hive Metastore schema for version 2.3.0 should already be installed in the instance.
+- Prior to running the pipeline, several variables need definition in your Azure Pipeline:
+  - `data_storage_account`: Name of the Azure Blob Storage account where the source data for the experiment is stored.
+  - `data_storage_account_shared_key` (secret): Shared key for the Azure Blob Storage account where the source data for the experiment is stored.
+  - `hms_jdbc_driver`: JDBC driver for the Hive Metastore.
+  - `hms_jdbc_url`: JDBC URL for the Hive Metastore.
+  - `hms_jdbc_user`: Username for the Hive Metastore.
+  - `hms_jdbc_password` (secret): Password for the Hive Metastore.
+  - `hms_storage_account`: Name of the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog (can be the same as the data_storage_account).
+  - `hms_storage_account_shared_key` (secret): Shared key for the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog.
+  - `hms_storage_account_container`: Name of the container in the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog.
+- The versions and configurations of LSTs to run can be modified via input parameters for the pipelines in the Azure Pipelines YAML file or from the Web UI.
+  Default values are assigned to these parameters. 
+  Parameters also include experiment scale factor, machine type, and cluster size. 
+  Note that these parameters are not used to deploy the data or the infrastructure, as this process is not automated in the pipeline. 
+  Instead, they are recorded in the experiment telemetry for proper categorization and visualization of results later on.
diff --git a/run/spark-3.3.1/azure-pipelines/config/connections_config.yaml b/run/spark-3.3.1/azure-pipelines/config/connections_config.yaml
@@ -0,0 +1,7 @@
+# Description: Connections Configuration
+---
+version: 1
+connections:
+- id: spark_0
+  driver: org.apache.hive.jdbc.HiveDriver
+  url: jdbc:hive2://${SPARK_MASTER_HOST}:10000
diff --git a/run/spark-3.3.1/azure-pipelines/config/experiment_config-cow-delta-2.2.0.yaml b/run/spark-3.3.1/azure-pipelines/config/experiment_config-cow-delta-2.2.0.yaml
@@ -0,0 +1,29 @@
+# Description: Experiment Configuration
+---
+version: 1
+id: "${EXP_NAME}"
+repetitions: 1
+# Metadata accepts any key-value that we want to register together with the experiment run.
+metadata:
+  system: spark
+  system_version: 3.3.1
+  table_format: delta
+  table_format_version: 2.2.0
+  scale_factor: "${EXP_SCALE_FACTOR}"
+  mode: cow
+  machine: "${EXP_MACHINE}"
+  cluster_size: "${EXP_CLUSTER_SIZE}"
+# The following parameter values will be used to replace the variables in the workload statements.
+parameter_values:
+  external_catalog: spark_catalog
+  external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}"
+  external_table_format: csv
+  external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/"
+  external_options_suffix: ',header="true"'
+  external_tblproperties_suffix: ''
+  catalog: spark_catalog
+  database: "${EXP_NAME}"
+  table_format: delta
+  data_path: 'abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/run/delta/sf_${EXP_SCALE_FACTOR}/'
+  options_suffix: ''
+  tblproperties_suffix: ''
diff --git a/run/spark-3.3.1/azure-pipelines/config/experiment_config-cow-hudi-0.12.2.yaml b/run/spark-3.3.1/azure-pipelines/config/experiment_config-cow-hudi-0.12.2.yaml
@@ -0,0 +1,29 @@
+# Description: Experiment Configuration
+---
+version: 1
+id: "${EXP_NAME}"
+repetitions: 1
+# Metadata accepts any key-value that we want to register together with the experiment run.
+metadata:
+  system: spark
+  system_version: 3.3.1
+  table_format: hudi
+  table_format_version: 0.12.2
+  scale_factor: "${EXP_SCALE_FACTOR}"
+  mode: cow
+  machine: "${EXP_MACHINE}"
+  cluster_size: "${EXP_CLUSTER_SIZE}"
+# The following parameter values will be used to replace the variables in the workload statements.
+parameter_values:
+  external_catalog: spark_catalog
+  external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}"
+  external_table_format: csv
+  external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/"
+  external_options_suffix: ',header="true"'
+  external_tblproperties_suffix: ''
+  catalog: spark_catalog
+  database: "${EXP_NAME}"
+  table_format: hudi
+  data_path: 'abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/run/hudi/sf_${EXP_SCALE_FACTOR}/'
+  options_suffix: ''
+  tblproperties_suffix: ', "type"="cow"'
diff --git a/run/spark-3.3.1/azure-pipelines/config/experiment_config-cow-iceberg-1.1.0.yaml b/run/spark-3.3.1/azure-pipelines/config/experiment_config-cow-iceberg-1.1.0.yaml
@@ -0,0 +1,29 @@
+# Description: Experiment Configuration
+---
+version: 1
+id: "${EXP_NAME}"
+repetitions: 1
+# Metadata accepts any key-value that we want to register together with the experiment run.
+metadata:
+  system: spark
+  system_version: 3.3.1
+  table_format: iceberg
+  table_format_version: 1.1.0
+  scale_factor: "${EXP_SCALE_FACTOR}"
+  mode: cow
+  machine: "${EXP_MACHINE}"
+  cluster_size: "${EXP_CLUSTER_SIZE}"
+# The following parameter values will be used to replace the variables in the workload statements.
+parameter_values:
+  external_catalog: spark_catalog
+  external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}"
+  external_table_format: csv
+  external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/"
+  external_options_suffix: ',header="true"'
+  external_tblproperties_suffix: ''
+  catalog: spark_catalog
+  database: "${EXP_NAME}"
+  table_format: iceberg
+  data_path: 'abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/run/iceberg/sf_${EXP_SCALE_FACTOR}/'
+  options_suffix: ''
+  tblproperties_suffix: ', "format-version"="2", "write.delete.mode"="copy-on-write", "write.update.mode"="copy-on-write", "write.merge.mode"="copy-on-write"'
diff --git a/run/spark-3.3.1/azure-pipelines/config/experiment_config-mor-hudi-0.12.2.yaml b/run/spark-3.3.1/azure-pipelines/config/experiment_config-mor-hudi-0.12.2.yaml
@@ -0,0 +1,29 @@
+# Description: Experiment Configuration
+---
+version: 1
+id: "${EXP_NAME}"
+repetitions: 1
+# Metadata accepts any key-value that we want to register together with the experiment run.
+metadata:
+  system: spark
+  system_version: 3.3.1
+  table_format: hudi
+  table_format_version: 0.12.2
+  scale_factor: "${EXP_SCALE_FACTOR}"
+  mode: mor
+  machine: "${EXP_MACHINE}"
+  cluster_size: "${EXP_CLUSTER_SIZE}"
+# The following parameter values will be used to replace the variables in the workload statements.
+parameter_values:
+  external_catalog: spark_catalog
+  external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}"
+  external_table_format: csv
+  external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/"
+  external_options_suffix: ',header="true"'
+  external_tblproperties_suffix: ''
+  catalog: spark_catalog
+  database: "${EXP_NAME}"
+  table_format: hudi
+  data_path: 'abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/run/hudi/sf_${EXP_SCALE_FACTOR}/'
+  options_suffix: ''
+  tblproperties_suffix: ', "type"="mor"'
diff --git a/run/spark-3.3.1/azure-pipelines/config/experiment_config-mor-iceberg-1.1.0.yaml b/run/spark-3.3.1/azure-pipelines/config/experiment_config-mor-iceberg-1.1.0.yaml
@@ -0,0 +1,29 @@
+# Description: Experiment Configuration
+---
+version: 1
+id: "${EXP_NAME}"
+repetitions: 1
+# Metadata accepts any key-value that we want to register together with the experiment run.
+metadata:
+  system: spark
+  system_version: 3.3.1
+  table_format: iceberg
+  table_format_version: 1.1.0
+  scale_factor: "${EXP_SCALE_FACTOR}"
+  mode: mor
+  machine: "${EXP_MACHINE}"
+  cluster_size: "${EXP_CLUSTER_SIZE}"
+# The following parameter values will be used to replace the variables in the workload statements.
+parameter_values:
+  external_catalog: spark_catalog
+  external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}"
+  external_table_format: csv
+  external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/"
+  external_options_suffix: ',header="true"'
+  external_tblproperties_suffix: ''
+  catalog: spark_catalog
+  database: "${EXP_NAME}"
+  table_format: iceberg
+  data_path: 'abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/run/iceberg/sf_${EXP_SCALE_FACTOR}/'
+  options_suffix: ''
+  tblproperties_suffix: ', "format-version"="2", "write.delete.mode"="merge-on-read", "write.update.mode"="merge-on-read", "write.merge.mode"="merge-on-read"'
diff --git a/run/spark-3.3.1/azure-pipelines/config/setup_experiment_config.yaml b/run/spark-3.3.1/azure-pipelines/config/setup_experiment_config.yaml
@@ -0,0 +1,20 @@
+# Description: Experiment Configuration
+---
+version: 1
+id: setup_experiment
+repetitions: 1
+# Metadata accepts any key-value that we want to register together with the experiment run.
+metadata:
+  system: spark
+  system_version: 3.3.1
+  scale_factor: "${EXP_SCALE_FACTOR}"
+  machine: "${EXP_MACHINE}"
+  cluster_size: "${EXP_CLUSTER_SIZE}"
+# The following parameter values will be used to replace the variables in the workload statements.
+parameter_values:
+  external_catalog: spark_catalog
+  external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}"
+  external_table_format: csv
+  external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/"
+  external_options_suffix: ',header="true"'
+  external_tblproperties_suffix: ''
diff --git a/run/spark-3.3.1/azure-pipelines/config/telemetry_config.yaml b/run/spark-3.3.1/azure-pipelines/config/telemetry_config.yaml
@@ -0,0 +1,13 @@
+# Description: Telemetry Configuration
+---
+version: 1
+connection:
+  id: duckdb_0
+  driver: org.duckdb.DuckDBDriver
+  url: jdbc:duckdb:./telemetry-spark-3.3.1
+execute_ddl: true
+ddl_file: 'src/main/resources/scripts/logging/duckdb/ddl.sql'
+insert_file: 'src/main/resources/scripts/logging/duckdb/insert.sql'
+# The following parameter values will be used to replace the variables in the logging statements.
+parameter_values:
+  data_path: ''