Projects that implement various aspects of Data Engineering.

Last update: Oct 14, 2021

Related tags

Overview

DATAWAREHOUSE ON AWS

The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming application 'Sparkify'. This data model is implemented on AWS cloud infrastructure with following components -

AWS S3 - Source datasets.

AWS Redshift
>for staging extracted data
>for storing the resultant data model (facts and dimensions)

Data model designed for this project consists of a star schema.

Table and attribute details are -

Fact Table
songplays: songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables
users: user_id, first_name, last_name, gender, level
songs: song_id, title, artist_id, year, duration
artists: artist_id, name, location, lattitude, longitude
time: start_time, hour, day, week, month, year, weekday

Source datasets to be extracted into dimension model are -

There are two json files for

Song data: s3://udacity-dend/song_data - Data for all songs with their respective artists available in application library.

Log data: s3://udacity-dend/log_data - Data for user events and activity activity on the application.

Datawarehouse is implemented using PostgreSQL.

ETL pipeline to extract and load data from source to target is implemented using Python.

TODO steps:

Create sql_queries.py - to design and build tables for proposed data model

Run create_tables.py - to create tables by implementing the database queries from sql_queries.py

Run etl.py - to implement the data pipeline built over the data model which extract, stage and load data from AWS S3 to DWH on AWS Redshift

Design and fire analytical queries on the populated data model to gain insights of user events over streaming application

Projects that implement various aspects of Data Engineering.

Related tags

Overview

DATAWAREHOUSE ON AWS

The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming application 'Sparkify'. This data model is implemented on AWS cloud infrastructure with following components -

Data model designed for this project consists of a star schema.

Table and attribute details are -

Source datasets to be extracted into dimension model are -

Datawarehouse is implemented using PostgreSQL.

ETL pipeline to extract and load data from source to target is implemented using Python.

TODO steps:

Owner

Used for data processing in machine learning, and help us to construct ML model more easily from scratch

Creating a statistical model to predict 10 year treasury yields

Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

Finding project directories in Python (data science) projects, just like there R rprojroot and here packages

[CVPR2022] This repository contains code for the paper "Nested Collaborative Learning for Long-Tailed Visual Recognition", published at CVPR 2022

Stitch together Nanopore tiled amplicon data without polishing a reference

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

An easy-to-use feature store

ETL pipeline on movie data using Python and postgreSQL

Monitor the stability of a pandas or spark dataframe ⚙︎

A utility for functional piping in Python that allows you to access any function in any scope as a partial.

apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.

Titanic data analysis for python

Senator Trades Monitor

This repo is dedicated to the data extraction and manipulation of the World Bank's database called STEP.

The Spark Challenge Student Check-In/Out Tracking Script

Python Kalman filtering and optimal estimation library. Implements Kalman filter, particle filter, Extended Kalman filter, Unscented Kalman filter, g-h (alpha-beta), least squares, H Infinity, smoothers, and more. Has companion book 'Kalman and Bayesian Filters in Python'.

A model checker for verifying properties in epistemic models

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python