End-to-end Data Science project

This is the repo with the notebooks, code, and additional material used in the ITI's workshop. The goal of the sessions was to illustrate the end-to-end process of an real project.

Additional material

In addition to the notebooks and code, the following material is also available:

Video recordings of the sessions are uploaded to youtube
Slide decks are also added to this repo here

Problem statement

Our (fictional) client is an IT educational institute. They have reached out to us has reach out with the following: “IT jobs and technologies keep evolving quickly. This makes our field to be one of the most interesting out there. But on the other hand, such fast development confuses our students. They do not know which skills they need to learn for which job. “Do I need to learn C++ to be a Data Scientist?” “Do DevOps and System admins use the same technologies?” “I really like JavaScript; can I use it in Data Analytics?” Those are some of the questions that our students ask. Could you please develop a data-driven solution for our students to answer such questions? They mostly want to understand the relationships between the jobs and the technologies.

Level guide

	Basic	Intermediate	Advanced
Business case		Decide on the KPIs that you will positively influence	Calculate the expected financial returns
Data collection	Decide on and collect a suitable data source for your business case	Decide on, collect and connect multiple data sources for better performance
Legal review		Get basic information about the local data privacy law	Study the local data privacy law
Cookie Cutter	Create the standard directory structure
Git	Use Git's GUI to track on master branch	Use Git's CLI to track on Dev branch and merge back to Master	Decide on a branching strategy and solve merge conflicts
Environments	Install python packages using conda	Create a dedicated conda environment	Share your environment and install it on a different machine
Data cleaning	Use basic statistics to filter out non-sense entries	Use advanced statistics and unsupervised learning to filter out non-sense entries	Calculate a 'sanity probability value' for each data point and use it later as the weight
Descriptive analytics	Calculate summary statistics to provide data insights	Produce visualizations to provide deeper understanding	Apply unsupervised learning to provide even deeper understanding
Predictive analytics	Create a single baseline model	Create multiple hyper-tuned models. Benchmark their performance	Combine the chosen models via ensemble and provide prediction confidence
Prescriptive analytics			Recommend the action that the user should take
Software Engineering	Refactor your notebooks to simple python scripts	Create a production OOP class for predictions	Expose your model using an API
MLops	Export and load models from pickle files	Track your models using Mlflow	Create and run a docker image for your project
Product	Create a Web App / GUI to expose prediction functionality	Add the relevant historical insights, predictions and optimization results	Collect users' feedback and retrain your model accordingly

Tutorial repo for an end-to-end Data Science project

Related tags

Overview

End-to-end Data Science project

Additional material

Problem statement

Level guide

Owner

Deena Gergis

End-To-End Crowdsourcing

JAXDL: JAX (Flax) Deep Learning Library

CFC-Net: A Critical Feature Capturing Network for Arbitrary-Oriented Object Detection in Remote Sensing Images

Liecasadi - liecasadi implements Lie groups operation written in CasADi

Codebase for the self-supervised goal reaching benchmark introduced in the LEXA paper

Active Offline Policy Selection With Python

ThunderGBM: Fast GBDTs and Random Forests on GPUs

PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

An abstraction layer for mathematical optimization solvers.

Keepsake is a Python library that uploads files and metadata (like hyperparameters) to Amazon S3 or Google Cloud Storage

OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark

A platform to display the carbon neutralization information for researchers, decision-makers, and other participants in the community.

Wordle-solver - Wordle answer generation program in python

AEI: Actors-Environment Interaction with Adaptive Attention for Temporal Action Proposals Generation

Solutions of Reinforcement Learning 2nd Edition

On Evaluation Metrics for Graph Generative Models

This tutorial aims to learn the basics of deep learning by hands, and master the basics through combination of lectures and exercises

NAS-Bench-x11 and the Power of Learning Curves

AI pipelines for Nvidia Jetson Platform

An integration of several popular automatic augmentation methods, including OHL (Online Hyper-Parameter Learning for Auto-Augmentation Strategy) and AWS (Improving Auto Augment via Augmentation Wise Weight Sharing) by Sensetime Research.