In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

Overview

ETL Pipeline for AWS

Project Description

In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 to stagging tables on Redshift and SQL queries are written to create analytics tables from staging tables.

Dataset Structure

The dataset is composed of two files the Songs data and Logs data that is present in S3 bucket.

Song Data

The song data is dataset with million of entries. Each file is in JSON format that contains the data about song, artist of that song. Moreover, the files are partitioned by the first three letters of song ID. The single entry of the song dataset looks like

  • {
       "num_songs":1,
       "artist_id":"ARJIE2Y1187B994AB7",
       "artist_latitude":null,
       "artist_longitude":null,
       "artist_location":"",
       "artist_name":"Line Renaud",
       "song_id":"SOUPIRU12A6D4FA1E1",
       "title":"Der Kleine Dompfaff",
       "duration":152.92036,
       "year":0
    }
    

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.

Logs Data

The logs dataset is also in the JSON formatted, which is formed by the event simulator based on the songs dataset. The logs dataset is the activity logs from the music app.

  • {
        "artist": "Pavement",
        "auth": "Logged in",
        "firstName": "Sylvie",
        "gender": "F",
        "iteminSession": 0,
        "lastName": "Cruz",
        "length": 99.16036,
        "level": "free",
        "location": "Kiamath Falls, OR",
        "method": "PUT",
        "page": "NextSong",
        "registration": 1.540266e+12,
        "sessionId": 345,
        "song": "Mercy: The Laundromat",
        "status": 200,
        "ts": 1541990258796,
        "userAgent": "Mozzilla/5.0...",
        "userId": 10
    }
    

Data Warehouse schema

There are two staging tables: Event table: artist VARCHAR, auth VARCHAR, firstName VARCHAR, gender VARCHAR, itemInSession INT, lastName VARCHAR, length DOUBLE PRECISION, level VARCHAR, location VARCHAR, method VARCHAR , page VARCHAR, registration VARCHAR, sessionid INT, song VARCHAR, status INT, ts VARCHAR, userAgent VARCHAR, userId INT*

Song table* num_songs INTEGER,* artist_id VARCHAR, artist_latitude VARCHAR, artist_longitude VARCHAR, artist_location VARCHAR , artist_name VARCHAR, song_id VARCHAR, title VARCHAR, duration NUMERIC NOT NULL, year integer*

These staging tables helps forming dimension tables and fact tables:

Dimension Tables:
users:
*user_id, first_name, last_name, gender, level*
songs:
*song_id, title, artist_id, year, duration*
artists:
*artist_id, name, location, latitude, longitude*
time:
*start_time, hour, day, week, month, year, weekday*
Fact tables:
Songplays:
*songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent*

All the tables contains Primary Key as there should be something unique to identify the rows in the table.

ETL Process

The ETL process is comprises of two steps:

  • Getting data from S3 bucket to staging table
  • Insert the data in dimension and fact table from staging tables using Star Schema

Files Description

- create_tables.py: When create_tables.py run, it will first create tables and drop if table already exists. 
- etl.py: read and process data files
- dwh.cfg: File contains the data warehouse settings for AWS. It contains CLUSTER, IAM_ROLE and S3 settings for the ETL pipeline
- sql_queries: Contains the sql queries for dropping, creation, selection data from tables.
Owner
Mobeen Ahmed
Mobeen Ahmed
Python package for analyzing sensor-collected human motion data

Python package for analyzing sensor-collected human motion data

Simon Ho 71 Nov 05, 2022
ToeholdTools is a Python package and desktop app designed to facilitate analyzing and designing toehold switches, created as part of the 2021 iGEM competition.

ToeholdTools Category Status Repository Package Build Quality A library for the analysis of toehold switch riboregulators created by the iGEM team Cit

0 Dec 01, 2021
A simple and efficient tool to parallelize Pandas operations on all available CPUs

Pandaral·lel Without parallelization With parallelization Installation $ pip install pandarallel [--upgrade] [--user] Requirements On Windows, Pandara

Manu NALEPA 2.8k Dec 31, 2022
Projeto para realizar o RPA Challenge . Utilizando Python e as bibliotecas Selenium e Pandas.

RPA Challenge in Python Projeto para realizar o RPA Challenge (www.rpachallenge.com), utilizando Python. O objetivo deste desafio é criar um fluxo de

Henrique A. Lourenço 1 Apr 12, 2022
Exploratory data analysis

Exploratory data analysis An Exploratory data analysis APP TAPIWA CHAMBOKO 🚀 About Me I'm a full stack developer experienced in deploying artificial

tapiwa chamboko 1 Nov 07, 2021
Finding project directories in Python (data science) projects, just like there R rprojroot and here packages

Find relative paths from a project root directory Finding project directories in Python (data science) projects, just like there R here and rprojroot

Daniel Chen 102 Nov 16, 2022
Analyzing Earth Observation (EO) data is complex and solutions often require custom tailored algorithms.

eo-grow Earth observation framework for scaled-up processing in Python. Analyzing Earth Observation (EO) data is complex and solutions often require c

Sentinel Hub 18 Dec 23, 2022
Programmatically access the physical and chemical properties of elements in modern periodic table.

API to fetch elements of the periodic table in JSON format. Uses Pandas for dumping .csv data to .json and Flask for API Integration. Deployed on "pyt

the techno hack 3 Oct 23, 2022
Predictive Modeling & Analytics on Home Equity Line of Credit

Predictive Modeling & Analytics on Home Equity Line of Credit Data (Python) HMEQ Data Set In this assignment we will use Python to examine a data set

Dhaval Patel 1 Jan 09, 2022
Pipeline and Dataset helpers for complex algorithm evaluation.

tpcp - Tiny Pipelines for Complex Problems A generic way to build object-oriented datasets and algorithm pipelines and tools to evaluate them pip inst

Machine Learning and Data Analytics Lab FAU 3 Dec 07, 2022
A collection of learning outcomes data analysis using Python and SQL, from DQLab.

Data Analyst with PYTHON Data Analyst berperan dalam menghasilkan analisa data serta mempresentasikan insight untuk membantu proses pengambilan keputu

6 Oct 11, 2022
Convert tables stored as images to an usable .csv file

Convert an image of numbers to a .csv file This Python program aims to convert images of array numbers to corresponding .csv files. It uses OpenCV for

711 Dec 26, 2022
MotorcycleParts DataAnalysis python

We work with the accounting department of a company that sells motorcycle parts. The company operates three warehouses in a large metropolitan area.

NASEEM A P 1 Jan 12, 2022
PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

PyStan PyStan is a Python interface to Stan, a package for Bayesian inference. Stan® is a state-of-the-art platform for statistical modeling and high-

Stan 229 Dec 29, 2022
This is an analysis and prediction project for house prices in King County, USA based on certain features of the house

This is a project for analysis and estimation of House Prices in King County USA The .csv file contains the data of the house and the .ipynb file con

Amit Prakash 1 Jan 21, 2022
Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python This project is a good starting point for those who have little

Himanshu Kumar singh 2 Dec 04, 2021
Data Science Environment Setup in single line

datascienv is package that helps your to setup your environment in single line of code with all dependency and it is also include pyforest that provide single line of import all required ml libraries

Ashish Patel 55 Dec 16, 2022
BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

Vo Cong Thanh 1 Jan 06, 2022
Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data

Statistical_Modelling Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data Statistical Methods for Decision Ma

Avnika Mehta 1 Jan 27, 2022
Data exploration done quick.

Pandas Tab Implementation of Stata's tabulate command in Pandas for extremely easy to type one-way and two-way tabulations. Support: Python 3.7 and 3.

W.D. 20 Aug 27, 2022