INFO-H515 - Big Data Scalable Analytics

Overview

INFO-H515 - Big Data Scalable Analytics

Jacopo De Stefani, Giovanni Buroni, Théo Verhelst and Gianluca Bontempi - Machine Learning Group

Exercise classes - Overview

This repository contains the material for the exercise classes of the ULB/VUB Big Data Analytics master course (first semester 2022) - Advanced analytics part.

These hands-on sessions provide:

  • Session 1 : An introduction to Spark and its Machine Learning (ML) library. The case study for the first session is a churn prediction problem: How to predict which customers will quit a subscription to a given service? The session covers the basics for loading and formatting a dataset for training an ML algorithm using Spark ML library, and illustrates the use of different Spark ML algorithms and accuracy metrics to address the prediction problem.

  • Sessions 2 and 4: An in-depth coverage of the use of the Map/Reduce programming model for distributing machine learning algorithms, and their implementation in Spark. Sessions 2, 3, and 4 cover, respectively, the Map/Reduce implementations from scratch of

    • Session 2: Linear regression (ordinary least squares and stochastic gradient descent). The algorithms are applied on an artificial dataset, and illustrate the numpy and Map/Reduce implementations for OLS and SGD.
    • Session 3: Streaming analytics with Recursive Least Squares and model racing. The algorithms are implemented using Spark Streaming, on a data stream coming from a Kafka broker. The RLS approach is then compared with established ML approaches.
    • Session 4: Recommender system with alternating least squares, using as a case study a movie recommendation problem.

    After detailing the Map/Reduce techniques for solving these problems, each session ends with an example on how to use the corresponding algorithm with Spark ML, and get insights into how Spark distributes the task using the Spark user interface.

  • Session 5: An overview of a deep learning framework (Keras/Tensorflow), and its use for image classification using convolutional neural networks.

The material is available as a set of Jupyter notebooks.

Clone this repository

From the command line, use

git clone https://github.com/Yannael/BigDataAnalytics_INFOH515

If using the course cluster, you will have to use SFTP to send this folder to the cluster.

Environment setup

These notebooks rely on different technologies and frameworks for Big Data and machine learning (Spark, Kafka, Keras and Tensorflow). We summarize below different ways to have your environment set up.

Local setup (Linux)

Python

Install Anaconda Python (see https://www.anaconda.com/download/, choose the latest Linux distribution (Python 3.9 at the writing of these instructions).

Make sure the binaries are in your PATH. Anaconda installer proposes to add them at the end of the installation process. If you decline, you may later add

export ANACONDA_HOME=where_you_installed_anaconda
export PATH=$ANACONDA_HOME/bin:$PATH

to your .bash_rc.

Spark

Download from https://spark.apache.org/downloads.html (Use version 3.2.0 (October 2020), prebuilt for Apache Hadoop 3.3). Untar and add executables to your PATH, as well as Python libraries to PYTHONPATH

export SPARK_HOME=where_you_untarred_spark
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
export PYTHONPATH="$SPARK_HOME/python/lib/pyspark.zip:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip"

Kafka

Download from https://kafka.apache.org/downloads, and untar archive. Start with

export KAFKA_HOME=where_you_untarred_kafka
nohup $KAFKA_HOME/bin/zookeeper-server-start.sh $KAFKA_HOME/config/zookeeper.properties  > $HOME/zookeeper.log 2>&1 &
nohup $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties > $HOME/kafka.log 2>&1 &

Keras and tensorflow

Install with pip

pip install tensorflow
pip install keras

Notebook

The notebook is part of Anaconda. Start Jupyter notebook with

jupyter notebook

and open in the browser at 127.0.0.1:8888

Docker

In order to ease the setting-up of the environment, we also prepared a Docker container that provides a ready-to-use environment. See docker folder for installing Docker, downloading the course container, and get started with it.

Note that the Dockerfile script essentially follows the steps for the 'local' installation.

Check if your setup is working

After setting up your environment (either in a Docker or your own machine) you should be able to run the notebook and scripts in Check_Setup

Spark - Test with Check_Setup notebook

  • Open notebook from Check_Setup/Demo_RDD_local.ipynb
  • Run all cells

Follow instructions in Check_Setup/Demo_RDD_local.ipynb to have access to Spark UI.

Kafka - Test with Check_Setup scripts

  1. Run the script Check_Setup/0_kafka_startup.sh to start Zookeeper and Kafka.
  2. Run the script Check_Setup/1_kafka_test_topic.sh to check whether a topic can be created and deleted successfully.
  3. In two separate terminals:
    1. Start first Check_Setup/2_kafka_test_sender.sh, and try sending some messages, by entering some text and concluding the message with the Enter key.
    2. Start first Check_Setup/3_kafka_test_receiver.sh, and check that the messages sent by the sender are correctly received.

FAQ

Owner
Yann-Aël Le Borgne
Postdoc @ Machine Learning Group - Computer Science Department - Université Libre de Bruxelles - Belgium
Yann-Aël Le Borgne
Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data.

Hatchet Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data. It is intended for analyzing

Lawrence Livermore National Laboratory 14 Aug 19, 2022
BasstatPL is a package for performing different tabulations and calculations for descriptive statistics.

BasstatPL is a package for performing different tabulations and calculations for descriptive statistics. It provides: Frequency table constr

Angel Chavez 1 Oct 31, 2021
Jupyter notebooks for the book "The Elements of Statistical Learning".

This repository contains Jupyter notebooks implementing the algorithms found in the book and summary of the textbook.

Madiyar 369 Dec 30, 2022
Single machine, multiple cards training; mix-precision training; DALI data loader.

Template Script Category Description Category script comparison script train.py, loader.py for single-machine-multiple-cards training train_DP.py, tra

2 Jun 27, 2022
An ETL framework + Monitoring UI/API (experimental project for learning purposes)

Fastlane An ETL framework for building pipelines, and Flask based web API/UI for monitoring pipelines. Project structure fastlane |- fastlane: (ETL fr

Dan Katz 2 Jan 06, 2022
DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in cluste

Amazon Web Services - Labs 53 Dec 08, 2022
Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

1 Feb 11, 2022
Code for the DH project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval Muslim World"

Damast This repository contains code developed for the digital humanities project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval

University of Stuttgart Visualization Research Center 2 Jul 01, 2022
CPSPEC is an astrophysical data reduction software for timing

CPSPEC manual Introduction CPSPEC is an astrophysical data reduction software for timing. Various timing properties, such as power spectra and cross s

Tenyo Kawamura 1 Oct 20, 2021
A set of tools to analyse the output from TraDIS analyses

QuaTradis (Quadram TraDis) A set of tools to analyse the output from TraDIS analyses Contents Introduction Installation Required dependencies Bioconda

Quadram Institute Bioscience 2 Feb 16, 2022
Repository created with LinkedIn profile analysis project done

EN/en Repository created with LinkedIn profile analysis project done. The datase

Mayara Canaver 4 Aug 06, 2022
General Assembly's 2015 Data Science course in Washington, DC

DAT8 Course Repository Course materials for General Assembly's Data Science course in Washington, DC (8/18/15 - 10/29/15). Instructor: Kevin Markham (

Kevin Markham 1.6k Jan 07, 2023
A Python and R autograding solution

Otter-Grader Otter Grader is a light-weight, modular open-source autograder developed by the Data Science Education Program at UC Berkeley. It is desi

Infrastructure Team 93 Jan 03, 2023
statDistros is a Python library for dealing with various statistical distributions

StatisticalDistributions statDistros statDistros is a Python library for dealing with various statistical distributions. Now it provides various stati

1 Oct 03, 2021
AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures.

AptaMAT Purpose AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures. The method is based on the compa

GEC UTC 3 Nov 03, 2022
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023
A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

TennisBusinessIntelligenceProject - A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

carlo paladino 1 Jan 02, 2022
simple way to build the declarative and destributed data pipelines with python

unipipeline simple way to build the declarative and distributed data pipelines. Why you should use it Declarative strict config Scaffolding Fully type

aliaksandr-master 0 Jan 26, 2022
Senator Trades Monitor

Senator Trades Monitor This monitor will grab the most recent trades by senators and send them as a webhook to discord. Installation To use the monito

Yousaf Cheema 5 Jun 11, 2022
The Master's in Data Science Program run by the Faculty of Mathematics and Information Science

The Master's in Data Science Program run by the Faculty of Mathematics and Information Science is among the first European programs in Data Science and is fully focused on data engineering and data a

Amir Ali 2 Jun 17, 2022