setup spark streaming

navigate to the spark directory and submit spark streaming job

cd ~/musicaly-project/spark_streaming && \
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 spark_stream.py

how it works

spark executes this job in 3 basic steps

it reads the stream data from the kafka broker port 9092
it carries out data transformation on the stream data which includes:
- schema specification
- datetime transformation and
- string encoding
then, every 120 seconds, it writes the transformed stream data in parquet format to specified GCS bucket and path.