GetStarted_YARN

Getting Started TensorFlowOnSpark on Hadoop Cluster

Before you start, you should already be familiar with TensorFlow and have access to a Hadoop grid with Spark installed. If your grid has GPU nodes, they must have cuda installed locally.

Install Python 2.7

From your grid gateway, download/install Python into a local folder. This installation of Python will be distributed to the Spark executors, so that any custom dependencies, including TensorFlow, will be available to the executors.

# download and extract Python 2.7
export PYTHON_ROOT=~/Python
curl -O https://www.python.org/ftp/python/2.7.12/Python-2.7.12.tgz
tar -xvf Python-2.7.12.tgz
rm Python-2.7.12.tgz

# compile into local PYTHON_ROOT
pushd Python-2.7.12
./configure --prefix="${PYTHON_ROOT}" --enable-unicode=ucs4
make
make install
popd
rm -rf Python-2.7.12

# install pip
pushd "${PYTHON_ROOT}"
curl -O https://bootstrap.pypa.io/get-pip.py
bin/python get-pip.py
rm get-pip.py

# install tensorflow (and any custom dependencies)
${PYTHON_ROOT}/bin/pip install pydoop
# Note: add any extra dependencies here
popd

Install and compile TensorFlow w/ RDMA Support

git clone [email protected]:yahoo/tensorflow.git
# follow build instructions to install into ${PYTHON_ROOT}

Install and compile Hadoop InputFormat/OutputFormat for TFRecords

git clone https://github.com/tensorflow/ecosystem.git
# follow build instructions to generate tensorflow-hadoop-1.0-SNAPSHOT.jar
# copy jar to HDFS for easier reference
hadoop fs -put tensorflow-hadoop-1.0-SNAPSHOT.jar

Create a Python w/ TensorFlow zip package for Spark

pushd "${PYTHON_ROOT}"
zip -r Python.zip *
popd

# copy this Python distribution into HDFS
hadoop fs -put ${PYTHON_ROOT}/Python.zip

Install TensorFlowOnSpark

Next, clone this repo and build a zip package for Spark:

git clone [email protected]:yahoo/TensorFlowOnSpark.git
pushd TensorFlowOnSpark/src
zip -r ../tfspark.zip *
popd

Run MNIST example

Download/zip the MNIST dataset

mkdir ${HOME}/mnist
pushd ${HOME}/mnist >/dev/null
curl -O "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz"
popd >/dev/null

Convert the MNIST zip files into HDFS files

# set environment variables (if not already done)
export PYTHON_ROOT=~/Python
export LD_LIBRARY_PATH=${PATH}
export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python
export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=Python/bin/python"
export PATH=${PYTHON_ROOT}/bin/:$PATH
export QUEUE=gpu

# for CPU mode:
# export QUEUE=default
# remove --conf spark.executorEnv.LD_LIBRARY_PATH \
# remove --driver-library-path \

# save images and labels as CSV files
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue ${QUEUE} \
--num-executors 4 \
--executor-memory 4G \
--archives hdfs:///user/${USER}/Python.zip#Python,mnist/mnist.zip#mnist \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda-7.5/lib64" \
--driver-library-path="/usr/local/cuda-7.5/lib64" \
TensorFlowOnSpark/examples/mnist/mnist_data_setup.py \
--output mnist/csv \
--format csv

# save images and labels as TFRecords
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue ${QUEUE} \
--num-executors 4 \
--executor-memory 4G \
--archives hdfs:///user/${USER}/Python.zip#Python,mnist/mnist.zip#mnist \
--jars hdfs:///user/${USER}/tensorflow-hadoop-1.0-SNAPSHOT.jar \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda-7.5/lib64" \
--driver-library-path="/usr/local/cuda-7.5/lib64" \
TensorFlowOnSpark/examples/mnist/mnist_data_setup.py \
--output mnist/tfr \
--format tfr

Run distributed MNIST training (using feed_dict)

# for CPU mode:
# export QUEUE=default
# set --conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" \
# remove --driver-library-path \

# hadoop fs -rm -r mnist_model
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue ${QUEUE} \
--num-executors 4 \
--executor-memory 27G \
--py-files TensorFlowOnSpark/tfspark.zip,TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--archives hdfs:///user/${USER}/Python.zip#Python \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda-7.5/lib64:$JAVA_HOME/jre/lib/amd64/server" \
--driver-library-path="/usr/local/cuda-7.5/lib64" \
TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \
--images mnist/csv/train/images \
--labels mnist/csv/train/labels \
--mode train \
--model mnist_model
# to use infiniband, add --rdma

Run distributed MNIST inference (using feed_dict)

${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue ${QUEUE} \
--num-executors 4 \
--executor-memory 27G \
--py-files TensorFlowOnSpark/tfspark.zip,TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--archives hdfs:///user/${USER}/Python.zip#Python \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda-7.5/lib64:$JAVA_HOME/jre/lib/amd64/server" \
--driver-library-path="/usr/local/cuda-7.5/lib64" \
TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \
--images mnist/csv/test/images \
--labels mnist/csv/test/labels \
--mode inference \
--model mnist_model \
--output predictions

Run distributed MNIST training (using QueueRunners)

# for CPU mode:
# export QUEUE=default
# set --conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" \
# remove --driver-library-path \

# hadoop fs -rm -r mnist_model
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue ${QUEUE} \
--num-executors 4 \
--executor-memory 27G \
--py-files tensorflow/tfspark.zip,tensorflow/examples/mnist/tf/mnist_dist.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--archives hdfs:///user/${USER}/Python.zip#Python \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda-7.5/lib64:$JAVA_HOME/jre/lib/amd64/server" \
--driver-library-path="/usr/local/cuda-7.5/lib64" \
tensorflow/examples/mnist/tf/mnist_spark.py \
--images mnist/tfr/train \
--format tfr \
--mode train \
--model mnist_model
# to use infiniband, replace the last line with --model mnist_model --rdma

Run distributed MNIST inference (using QueueRunners)

# hadoop fs -rm -r predictions
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue ${QUEUE} \
--num-executors 4 \
--executor-memory 27G \
--py-files TensorFlowOnSpark/tfspark.zip,TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--archives hdfs:///user/${USER}/Python.zip#Python \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda-7.5/lib64:$JAVA_HOME/jre/lib/amd64/server" \
--driver-library-path="/usr/local/cuda-7.5/lib64" \
TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py \
--images mnist/tfr/test \
--mode inference \
--model mnist_model \
--output predictions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly