A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Last update: Nov 10, 2022

Related tags

Data Analysis data-science-at-scale

Overview

Note: This repository is currently a work in progress. If you are joining for any given tutorial, please make sure to clone // pull the repository 2 hours before the tutorial begins.

Material for any given tutorial will be in the notebooks directory: for example, material for the Data Umbrella & PyLadies NYC tutorial on October 27, is in a subdirectort of /notebooks called /data-umbrella-2020-10-27.

Data Science At Scale

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Prerequisites

Not a lot. It would help if you knew

programming fundamentals and the basics of the Python programming language (e.g., variables, for loops);
a bit about pandas, numpy, and scikit-learn (although not strictly necessary);
a bit about Jupyter Notebooks;
your way around the terminal/shell.

However, I have always found that the most important and beneficial prerequisite is a will to learn new things so if you have this quality, you'll definitely get something out of this code-along session.

Also, if you'd like to watch and not code along, you'll also have a great time and these notebooks will be downloadable afterwards also.

If you are going to code along and use the Anaconda distribution of Python 3 (see below), I ask that you install it before the session.

Getting set up computationally

The first option is to click on the Binder badge above. This will spin up the necessary computational environment for you so you can write and execute Python code from the comfort of your browser. Binder is a free service. Due to this, the resources are not guaranteed, though they usually work well. If you want as close to a guarantee as possible, follow the instructions below to set up your computational environment locally (that is, on your own computer). Note that Binder will not work for all of the notebooks, particularly when we spin up Coiled Cloud. For these, you can follow along or set up your local environment as detailed below.

1. Clone the repository

To get set up for this live coding session, clone this repository. You can do so by executing the following in your terminal:

git clone https://github.com/coiled/data-science-at-scale

Alternatively, you can download the zip file of the repository at the top of the main page of the repository. If you prefer not to use git or don't have experience with it, this a good option.

2. Download Anaconda (if you haven't already)

If you do not already have the Anaconda distribution of Python 3, go get it (n.b., you can also do this w/out Anaconda using pip to install the required packages, however Anaconda is great for Data Science and I encourage you to use it).

3. Create your conda environment for this session

Navigate to the relevant directory data-science-at-scale and install required packages in a new conda environment:

conda env create -f binder/environment.yml

This will create a new environment called data-science-at-scale. To activate the environment on OSX/Linux, execute

source activate data-science-at-scale

On Windows, execute

activate data-science-at-scale

Then execute the following to get all the great Jupyter // Bokeh // Dask dashboarding tools.

jupyter labextension install @jupyter-widgets/jupyterlab-manager
jupyter labextension install @bokeh/jupyter_bokeh
jupyter labextension install dask-labextension

4. Open your Jupyter Lab

In the terminal, execute jupyter lab.

Then open the notebook 0-overview.ipynb in the relevant subdirectory of /notebooks and we're ready to get coding. Enjoy.

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Related tags

Overview

Data Science At Scale

Prerequisites

Getting set up computationally

1. Clone the repository

2. Download Anaconda (if you haven't already)

3. Create your conda environment for this session

4. Open your Jupyter Lab

Owner

Coiled

ICLR 2022 Paper submission trend analysis

Retentioneering: product analytics, data-driven customer journey map optimization, marketing analytics, web analytics, transaction analytics, graph visualization, and behavioral segmentation with customer segments in Python.

An implementation of the largeVis algorithm for visualizing large, high-dimensional datasets, for R

Manage large and heterogeneous data spaces on the file system.

ASOUL直播间弹幕抓取&&数据分析

Hg002-qc-snakemake - HG002 QC Snakemake

A script to "SHUA" H1-2 map of Mercenaries mode of Hearthstone

A fast, flexible, and performant feature selection package for python.

This python script allows you to manipulate the audience data from Sl.ido surveys

songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Automatic earthquake catalog building workflow: EQTransformer + Siamese EQTransformer + PickNet + REAL + HypoInverse

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

Visions provides an extensible suite of tools to support common data analysis operations

A Python package for modular causal inference analysis and model evaluations

Using Python to derive insights on particular Pokemon, Types, Generations, and Stats

Binance Kline Data With Python

OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

This creates a ohlc timeseries from downloaded CSV files from NSE India website and makes a SQLite database for your research.

Data cleaning tools for Business analysis

WAL enables programmable waveform analysis.