Skip to content

Commit

Permalink
Introduce configurations and workflow automation necessary to execute…
Browse files Browse the repository at this point in the history
… LST-Bench on Spark 3.3.1 in Azure (#229)
  • Loading branch information
jcamachor committed Feb 21, 2024
1 parent a7a17a2 commit 3e2dcc7
Show file tree
Hide file tree
Showing 47 changed files with 1,523 additions and 183 deletions.
65 changes: 0 additions & 65 deletions .azure-pipelines/workflows/periodic_reporting.yml

This file was deleted.

46 changes: 46 additions & 0 deletions run/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
<!--
{% comment %}
Copyright (c) Microsoft Corporation.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

# LST-Bench: Configurations and Results
This folder contains configurations for running LST-Bench on various systems as depicted in the [LST-Bench dashboard](/metrics/app), along with details about the setups used to generate those results.

## Systems Included
- [x] Apache Spark 3.3.1
- [x] Delta Lake 2.2.0
- [x] Apache Hudi 0.12.2
- [x] Apache Iceberg 1.1.0
- [ ] Trino 420
- [ ] Delta Lake
- [ ] Apache Iceberg

## Folder Structure
While the folder for each engine may have a slightly different structure, they generally contain the following:

- `scripts/`:
This directory contains SQL files used to execute LST-Bench workloads on the respective engine.
Typically, these SQL files may vary slightly across engines and LSTs based on the supported SQL dialect.
- `config/`:
This directory houses LST-Bench configuration files required to execute the workload.
It includes LST-Bench phase/session/task libraries that reference the aforementioned SQL scripts.
- Additional infrastructure and configuration automation folders, e.g., `azure-pipelines/`:
These folders contain scripts or files facilitating automation for running the benchmark on a specific infrastructure/engine.
For instance, Azure Pipelines scripts to deploy an engine with different LSTs and executing LST-Bench.
Generally, these folders should include an additional README.md file offering further details.
- `results/`:
This folder stores the results of the LST-Bench runs as captured by LST-Bench telemetry using DuckDB.
These results are processed and visualized in the [LST-Bench dashboard](/metrics/app).
50 changes: 50 additions & 0 deletions run/spark-3.3.1/azure-pipelines/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
<!--
{% comment %}
Copyright (c) Microsoft Corporation.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

# Azure Pipelines Deployment for LST-Bench on Apache Spark 3.3.1
This directory comprises the necessary tooling for executing LST-Bench on Apache Spark 3.3.1 with different LSTs using Azure Pipelines. The included tooling consists of:
- `run-lst-bench.yml`:
An Azure Pipelines script designed to deploy Apache Spark with various LSTs and execute LST-Bench.
- `sh/`:
A directory containing shell scripts and engine configuration files supporting the deployment of Spark with different LSTs and the execution of experiments.
- `config/`:
A directory with LST-Bench configuration files necessary for executing the experiments that are part of the results.

## Prerequisites
- Automation for deploying the infrastructure in Azure to run LST-Bench is not implemented. As a result, the Azure Pipeline script expects the following setup:
- A VM named 'lst-bench-client' connected to the pipeline environment to run the LST-Bench client.
- A VM named 'lst-bench-head' to run the head node of the Spark cluster, also connected to the pipeline environment.
- A VMSS cluster, that will serve as the Spark worker nodes, within the same VNet as the head node.
- An Azure Storage Account accessible by both the VMSS and head node.
- An Azure SQL Database (or SQL Server flavored RDBMS) that will be running Hive Metastore.
The Hive Metastore schema for version 2.3.0 should already be installed in the instance.
- Prior to running the pipeline, several variables need definition in your Azure Pipeline:
- `data_storage_account`: Name of the Azure Blob Storage account where the source data for the experiment is stored.
- `data_storage_account_shared_key` (secret): Shared key for the Azure Blob Storage account where the source data for the experiment is stored.
- `hms_jdbc_driver`: JDBC driver for the Hive Metastore.
- `hms_jdbc_url`: JDBC URL for the Hive Metastore.
- `hms_jdbc_user`: Username for the Hive Metastore.
- `hms_jdbc_password` (secret): Password for the Hive Metastore.
- `hms_storage_account`: Name of the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog (can be the same as the data_storage_account).
- `hms_storage_account_shared_key` (secret): Shared key for the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog.
- `hms_storage_account_container`: Name of the container in the Azure Blob Storage account where the Hive Metastore will store data associated with the catalog.
- The versions and configurations of LSTs to run can be modified via input parameters for the pipelines in the Azure Pipelines YAML file or from the Web UI.
Default values are assigned to these parameters.
Parameters also include experiment scale factor, machine type, and cluster size.
Note that these parameters are not used to deploy the data or the infrastructure, as this process is not automated in the pipeline.
Instead, they are recorded in the experiment telemetry for proper categorization and visualization of results later on.
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Description: Connections Configuration
---
version: 1
connections:
- id: spark_0
driver: org.apache.hive.jdbc.HiveDriver
url: jdbc:hive2://${SPARK_MASTER_HOST}:10000
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Description: Experiment Configuration
---
version: 1
id: "${EXP_NAME}"
repetitions: 1
# Metadata accepts any key-value that we want to register together with the experiment run.
metadata:
system: spark
system_version: 3.3.1
table_format: delta
table_format_version: 2.2.0
scale_factor: "${EXP_SCALE_FACTOR}"
mode: cow
machine: "${EXP_MACHINE}"
cluster_size: "${EXP_CLUSTER_SIZE}"
# The following parameter values will be used to replace the variables in the workload statements.
parameter_values:
external_catalog: spark_catalog
external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}"
external_table_format: csv
external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/"
external_options_suffix: ',header="true"'
external_tblproperties_suffix: ''
catalog: spark_catalog
database: "${EXP_NAME}"
table_format: delta
data_path: 'abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/run/delta/sf_${EXP_SCALE_FACTOR}/'
options_suffix: ''
tblproperties_suffix: ''
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Description: Experiment Configuration
---
version: 1
id: "${EXP_NAME}"
repetitions: 1
# Metadata accepts any key-value that we want to register together with the experiment run.
metadata:
system: spark
system_version: 3.3.1
table_format: hudi
table_format_version: 0.12.2
scale_factor: "${EXP_SCALE_FACTOR}"
mode: cow
machine: "${EXP_MACHINE}"
cluster_size: "${EXP_CLUSTER_SIZE}"
# The following parameter values will be used to replace the variables in the workload statements.
parameter_values:
external_catalog: spark_catalog
external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}"
external_table_format: csv
external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/"
external_options_suffix: ',header="true"'
external_tblproperties_suffix: ''
catalog: spark_catalog
database: "${EXP_NAME}"
table_format: hudi
data_path: 'abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/run/hudi/sf_${EXP_SCALE_FACTOR}/'
options_suffix: ''
tblproperties_suffix: ', "type"="cow"'
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Description: Experiment Configuration
---
version: 1
id: "${EXP_NAME}"
repetitions: 1
# Metadata accepts any key-value that we want to register together with the experiment run.
metadata:
system: spark
system_version: 3.3.1
table_format: iceberg
table_format_version: 1.1.0
scale_factor: "${EXP_SCALE_FACTOR}"
mode: cow
machine: "${EXP_MACHINE}"
cluster_size: "${EXP_CLUSTER_SIZE}"
# The following parameter values will be used to replace the variables in the workload statements.
parameter_values:
external_catalog: spark_catalog
external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}"
external_table_format: csv
external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/"
external_options_suffix: ',header="true"'
external_tblproperties_suffix: ''
catalog: spark_catalog
database: "${EXP_NAME}"
table_format: iceberg
data_path: 'abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/run/iceberg/sf_${EXP_SCALE_FACTOR}/'
options_suffix: ''
tblproperties_suffix: ', "format-version"="2", "write.delete.mode"="copy-on-write", "write.update.mode"="copy-on-write", "write.merge.mode"="copy-on-write"'
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Description: Experiment Configuration
---
version: 1
id: "${EXP_NAME}"
repetitions: 1
# Metadata accepts any key-value that we want to register together with the experiment run.
metadata:
system: spark
system_version: 3.3.1
table_format: hudi
table_format_version: 0.12.2
scale_factor: "${EXP_SCALE_FACTOR}"
mode: mor
machine: "${EXP_MACHINE}"
cluster_size: "${EXP_CLUSTER_SIZE}"
# The following parameter values will be used to replace the variables in the workload statements.
parameter_values:
external_catalog: spark_catalog
external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}"
external_table_format: csv
external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/"
external_options_suffix: ',header="true"'
external_tblproperties_suffix: ''
catalog: spark_catalog
database: "${EXP_NAME}"
table_format: hudi
data_path: 'abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/run/hudi/sf_${EXP_SCALE_FACTOR}/'
options_suffix: ''
tblproperties_suffix: ', "type"="mor"'
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Description: Experiment Configuration
---
version: 1
id: "${EXP_NAME}"
repetitions: 1
# Metadata accepts any key-value that we want to register together with the experiment run.
metadata:
system: spark
system_version: 3.3.1
table_format: iceberg
table_format_version: 1.1.0
scale_factor: "${EXP_SCALE_FACTOR}"
mode: mor
machine: "${EXP_MACHINE}"
cluster_size: "${EXP_CLUSTER_SIZE}"
# The following parameter values will be used to replace the variables in the workload statements.
parameter_values:
external_catalog: spark_catalog
external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}"
external_table_format: csv
external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/"
external_options_suffix: ',header="true"'
external_tblproperties_suffix: ''
catalog: spark_catalog
database: "${EXP_NAME}"
table_format: iceberg
data_path: 'abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/run/iceberg/sf_${EXP_SCALE_FACTOR}/'
options_suffix: ''
tblproperties_suffix: ', "format-version"="2", "write.delete.mode"="merge-on-read", "write.update.mode"="merge-on-read", "write.merge.mode"="merge-on-read"'
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Description: Experiment Configuration
---
version: 1
id: setup_experiment
repetitions: 1
# Metadata accepts any key-value that we want to register together with the experiment run.
metadata:
system: spark
system_version: 3.3.1
scale_factor: "${EXP_SCALE_FACTOR}"
machine: "${EXP_MACHINE}"
cluster_size: "${EXP_CLUSTER_SIZE}"
# The following parameter values will be used to replace the variables in the workload statements.
parameter_values:
external_catalog: spark_catalog
external_database: "external_tpcds_sf_${EXP_SCALE_FACTOR}"
external_table_format: csv
external_data_path: "abfss://${DATA_STORAGE_ACCOUNT_CONTAINER}@${DATA_STORAGE_ACCOUNT}.dfs.core.windows.net/tpc-ds/csv/sf_${EXP_SCALE_FACTOR}/"
external_options_suffix: ',header="true"'
external_tblproperties_suffix: ''
13 changes: 13 additions & 0 deletions run/spark-3.3.1/azure-pipelines/config/telemetry_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Description: Telemetry Configuration
---
version: 1
connection:
id: duckdb_0
driver: org.duckdb.DuckDBDriver
url: jdbc:duckdb:./telemetry-spark-3.3.1
execute_ddl: true
ddl_file: 'src/main/resources/scripts/logging/duckdb/ddl.sql'
insert_file: 'src/main/resources/scripts/logging/duckdb/insert.sql'
# The following parameter values will be used to replace the variables in the logging statements.
parameter_values:
data_path: ''
Loading

0 comments on commit 3e2dcc7

Please sign in to comment.