From 4b60dae0f02d6a2ecd984483af93ca7cedebaf08 Mon Sep 17 00:00:00 2001 From: "xiyu.zk" Date: Tue, 24 Dec 2024 11:40:18 +0800 Subject: [PATCH] [CELEBORN-1789][DOC] Document on Java Columnar Shuffle ### What changes were proposed in this pull request? Introduction to Celeborn's Java Columnar Shuffle ### Why are the changes needed? Introduction to Celeborn's Java Columnar Shuffle ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI Closes #3010 from kerwin-zk/CELEBORN-1789. Authored-by: xiyu.zk Signed-off-by: mingji --- docs/developers/java-columnar-shuffle.md | 56 ++++++++++++++++++++++++ mkdocs.yml | 1 + 2 files changed, 57 insertions(+) create mode 100644 docs/developers/java-columnar-shuffle.md diff --git a/docs/developers/java-columnar-shuffle.md b/docs/developers/java-columnar-shuffle.md new file mode 100644 index 00000000000..c01df0bbc09 --- /dev/null +++ b/docs/developers/java-columnar-shuffle.md @@ -0,0 +1,56 @@ +--- +license: | + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + https://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--- + +# Introduction to Celeborn's Java Columnar Shuffle + +## Overview + +Celeborn presents a Java Columnar Shuffle designed to enhance performance and efficiency in SparkSQL and DataFrame operations. This innovative approach leverages a columnar format for shuffle operations, achieving a higher compression rate than traditional Row-based Shuffle methods. This improvement leads to significant savings in disk space usage during shuffle operations. + +## Benefits + +- **High Compression Rate**: By organizing data into a columnar format, this feature significantly increases the compression ratio, reducing the disk space required for Shuffle data. +However, enabling this optimization incurs overhead for row-to-column and column-to-row transformations. If disk space is a higher priority, it is recommended to enable this feature. +If performance is a higher priority, it is advisable to weigh the trade-offs. + +## Configuration + +To leverage Celeborn's Java Columnar Shuffle, you need to apply a patch and configure certain settings in Spark 3.x. Follow the steps below for implementation: + +### Step 1: Apply this patch to obtain the schema information for shuffle + +1. Obtain the `https://github.com/apache/celeborn/tree/main/assets/spark-patch/Celeborn_Columnar_Shuffle_spark3.patch` file that contains the modifications needed for enabling Columnar Shuffle in Spark 3.x. +2. Navigate to your Spark source directory. +3. Apply the patch. + +### Step 2: Configure Celeborn Settings + +To enable Columnar Shuffle, adjust the following configurations in your Spark application: +Open the Spark configuration file or set these parameters in your Spark application. +Add the following configuration settings: + +``` +spark.celeborn.columnarShuffle.enabled true +spark.celeborn.columnarShuffle.encoding.dictionary.enabled true +``` + +If you require further performance optimization, consider enabling code generation with: + +``` +spark.celeborn.columnarShuffle.codegen.enabled true +``` diff --git a/mkdocs.yml b/mkdocs.yml index c3b34c48c49..a61f17ba1bc 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -95,6 +95,7 @@ nav: - Overview: developers/client.md - LifecycleManager: developers/lifecyclemanager.md - ShuffleClient: developers/shuffleclient.md + - JavaColumnarShuffle: developers/java-columnar-shuffle.md - Configuration: developers/configuration.md - Fault Tolerant: developers/faulttolerant.md - Worker Exclusion: developers/workerexclusion.md