Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Last update: Jan 02, 2023

Overview

Spark Python Notebooks

This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language.

If Python is not your language, and it is R, you may want to have a look at our R on Apache Spark (SparkR) notebooks instead. Additionally, if your are interested in being introduced to some basic Data Science Engineering, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R.

Instructions

A good way of using these notebooks is by first cloning the repo, and then starting your own IPython notebook/Jupyter in pySpark mode. For example, if we have a standalone Spark installation running in our localhost with a maximum of 6Gb per node assigned to IPython:

MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="6G" IPYTHON_OPTS="notebook --pylab inline" ~/spark-1.5.0-bin-hadoop2.6/bin/pyspark

Notice that the path to the pyspark command will depend on your specific installation. So as requirement, you need to have Spark installed in the same machine you are going to start the IPython notebook server.

For more Spark options see here. In general it works the rule of passing options described in the form spark.executor.memory as SPARK_EXECUTOR_MEMORY when calling IPython/pySpark.

Datasets

We will be using datasets from the KDD Cup 1999. The results of this competition can be found here.

References

The reference book for these and other Spark related topics is:

Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia.

Notebooks

The following notebooks can be examined individually, although there is a more or less linear 'story' when followed in sequence. By using the same dataset they try to solve a related set of tasks with it.

RDD creation

About reading files and parallelize.

RDDs basics

A look at map, filter, and collect.

Sampling RDDs

RDD sampling methods explained.

RDD set operations

Brief introduction to some of the RDD pseudo-set operations.

Data aggregations on RDDs

RDD actions reduce, fold, and aggregate.

Working with key/value pair RDDs

How to deal with key/value pairs in order to aggregate and explore data.

MLlib: Basic Statistics and Exploratory Data Analysis

A notebook introducing Local Vector types, basic statistics in MLlib for Exploratory Data Analysis and model selection.

MLlib: Logistic Regression

Labeled points and Logistic Regression classification of network attacks in MLlib. Application of model selection techniques using correlation matrix and Hypothesis Testing.

MLlib: Decision Trees

Use of tree-based methods and how they help explaining models and feature selection.

Spark SQL: structured processing for Data Analysis

In this notebook a schema is inferred for our network interactions dataset. Based on that, we use Spark's SQL DataFrame abstraction to perform a more structured exploratory data analysis.

Applications

Beyond the basics. Close to real-world applications using Spark and other technologies.

Olssen: On-line Spectral Search ENgine for proteomics

Same tech stack this time with an AngularJS client app.

An on-line movie recommendation web service

This tutorial can be used independently to build a movie recommender model based on the MovieLens dataset. Most of the code in the first part, about how to use ALS with the public MovieLens dataset, comes from my solution to one of the exercises proposed in the CS100.1x Introduction to Big Data with Apache Spark by Anthony D. Joseph on edX, that is also publicly available since 2014 at Spark Summit.

There I've added with minor modifications to use a larger dataset and also code about how to store and reload the model for later use. On top of that we build a Flask web service so the recommender can be use to provide movie recommendations on-line.

KDD Cup 1999

My try using Spark with this classic dataset and Knowledge Discovery competition.

Contributing

Contributions are welcome! For bug reports or requests please submit an issue.

Contact

Feel free to contact me to discuss any issues, questions, or comments.

Twitter: @ja_dianes
GitHub: jadianes
LinkedIn: jadianes
Website: jadianes.me

License

This repository contains a variety of content; some developed by Jose A. Dianes, and some from third-parties. The third-party content is distributed under the license provided by those parties.

The content developed by Jose A. Dianes is distributed under the following license:

Copyright 2016 Jose A Dianes

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Related tags

Overview

Spark Python Notebooks

Instructions

Datasets

References

Notebooks

Applications

Contributing

Contact

License

Owner

Jose A Dianes

Nixtla is an open-source time series forecasting library.

Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch

Open-Source CI/CD platform for ML teams. Deliver ML products, better & faster. ⚡️🧑‍🔧

Penguins species predictor app is used to classify penguins species created using python's scikit-learn, fastapi, numpy and joblib packages.

This is a Machine Learning model which predicts the presence of Diabetes in Patients

The MLOps is the process of continuous integration and continuous delivery of Machine Learning artifacts as a software product, keeping it inside a loop of Design, Model Development and Operations.

Adaptive: parallel active learning of mathematical functions

Hypernets: A General Automated Machine Learning framework to simplify the development of End-to-end AutoML toolkits in specific domains.

Machine Learning Study 혼자 해보기

🌲 Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams

A basic Ray Tracer that exploits numpy arrays and functions to work fast.

Python Research Framework

Book Recommender System Using Sci-kit learn N-neighbours

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Data science, Data manipulation and Machine learning package.

Ml based project which uses regression technique to predict the price.

This repository contains the code to predict house price using Linear Regression Method

A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and A* Search (Manhattan Distance Heuristic)

Decision tree is the most powerful and popular tool for classification and prediction