This mini project showcase how to build and debug Apache Spark application using Python

Last update: Dec 29, 2021

Related tags

Overview

Spark Python

by Denny Imanuel

This mini project showcase how to build and debug Apache Spark application using Python programming language. There are also options to run Spark application on Spark container

Spark on Localhost

Requirement

PyCharm IDE - You need to install PyCharm IDE
Java JDK - You need to install Java JDK and set JAVA_HOME env
Python - You need to install Python and set PYTHONPATH env
Spark Hadoop - You need to install Spark Hadoop and set HADOOP_HOME and SPARK_HOME env

For more info: https://dotnet.microsoft.com/en-us/learn/data/spark-tutorial/install-spark

Run Config

To run Spark app run Spark Submit command or create a new 'Run Config' under Shell Script as follows:

\SparkPython\venv\Scripts\python.exe" spark-submit --class SparkPython SparkPython.py">

set PYSPARK_PYTHON "
    
     \SparkPython\venv\Scripts\python.exe"
spark-submit --class SparkPython SparkPython.py

Build Config

To build Spark app run Spark Submit command or create a new 'Build Config' under Python Debug Server as follows:

venv\Scripts\activate
pip install pydevd-pycharm~=

Debug Config

To debug Spark app create 'Debug Config' using standard Python configuration file and then insert following code. In order to debug run above 'Build Config' first, set breakpoint, and then run this 'Debug Config':

import pydevd_pycharm
pydevd_pycharm.settrace('localhost', port=8888, stdoutToServer=True, stderrToServer=True)

Spark on Docker

Requirement

Rider IDE / Visual Studio - You need to install Rider IDE or Visual Studio
Docker Desktop - You need to install Docker Desktop to run Docker
Spark Image - Make sure you pull same version of Spark image as your local Spark:

docker pull bitnami/spark:3.1.2

Spark Clusters

Docker Compose below will run Spark cluster in master and worker node. First comment the debug line(6,7) and then pack the venv folder into venv.tar.gz and then submit both SparkPython.py file and venv.tar.gz to Spark cluster.

docker-compose up
spark-submit --master spark://localhost:7070 --class SparkPython SparkPython.py --archives venv.tar.gz

Output Result

If the Spark application is successfully build it should print out result table as follows:

This mini project showcase how to build and debug Apache Spark application using Python

Related tags

Overview

Spark Python

Spark on Localhost

Requirement

Run Config

Build Config

Debug Config

Spark on Docker

Requirement

Spark Clusters

Output Result

Owner

Denny Imanuel

DataPrep — The easiest way to prepare data in Python

Functional tensors for probabilistic programming

BinTuner is a cost-efficient auto-tuning framework, which can deliver a near-optimal binary code that reveals much more differences than -Ox settings.

Two phase pipeline + StreamlitTwo phase pipeline + Streamlit

Approximate Nearest Neighbor Search for Sparse Data in Python!

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Fit models to your data in Python with Sherpa.

pyETT: Python library for Eleven VR Table Tennis data

The Master's in Data Science Program run by the Faculty of Mathematics and Information Science

Monitor the stability of a pandas or spark dataframe ⚙︎

A lightweight, hub-and-spoke dashboard for multi-account Data Science projects

MapReader: A computer vision pipeline for the semantic exploration of maps at scale

Data collection, enhancement, and metrics calculation.

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

PyEmits, a python package for easy manipulation in time-series data.

Catalogue data - A Python Scripts to prepare catalogue data

WAL enables programmable waveform analysis.

Predictive Modeling & Analytics on Home Equity Line of Credit

This cosmetics generator allows you to generate the new Fortnite cosmetics, Search pak and search cosmetics!