songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Overview

Sparkify

Songplays User activity datamart

Status GitHub Issues GitHub Pull Requests License


The following document describes the model used to build the songplays datamart table and the respective ETL process.

Table of Contents

About

The songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system.

This document describes the model of songplays table datamart on sparkify_app schema inside a container sparkify_postgres, and the Python code to load new data. The production directory and data must be simmilar to those in mnt/data/log_data and mnt/data/song_data paths in this repository.

🏁 Getting Started

First you need to have the right permissions to access the source files and write them into sparkify_app tables that generates the songplays datamart table. Contact the owners or your team leader for more information.

Data Model and Schema


songplays datamart

Source files and owners

File or table Description Directory Owner
YYYY-MM-DD-events.json User events. mnt/data/log_data/YYYY/11 Person 1
.json Song data. mnt/data/song_data/a Person 2
songplays Datamart for recomendation system. sparkify_app.songplays Person 3
artists Dimension table for artists. sparkify_app.artists Person 1
songs Dimension table for songs. sparkify_app.songs Person 1
time Dimension table for streaming start time for a given song. sparkify_app.time Person 2
users Dimension table for users. sparkify_app.users Person 3

Prerequisites


To run this project first you need to install the Docker Engine for your operational system and Docker Compose.

After installing and configuring the Docker tools, download this repository and create a folder named postgres that will store all sparkify_postgres service data. To build the proper images and run the services, just execute the following command inside this repository:

docker-compose up

If the service runs successfully you should see something like this:

...
sparkify_python      | 28/30 files processed.
sparkify_python      | 29/30 files processed.
sparkify_python      | 30/30 files processed.
sparkify_python exited with code 0

You can also check the job by following these steps:

  • Open your browser and access localhost:16543: pga1

    • Enter with the following credentials to authenticate:
  • After you log in, click on the Servers option at the upper corner on the left: pga2

    • You will be asked to enter with the PostgreSQL credentials:
      • User: sparkifypsql
      • Password: p4ssw0rd
  • Select the Query Tools under the Tools menu: pga3

  • Under the Query Editor, run the following query:

    SELECT * FROM sparkify_app.songplays WHERE song_id is NOT NULL and artist_id is NOT NULL;
    • You should get only 5 rows. pga3

Microservice architecture

The following image represents the microservice architecture for this project: topology

Where:

  • sparkify_python: runs all Python scripts and stores raw data.
  • sparkify_postgres: runs Postgre and stores the database.
  • sparkify_pgadmin: runs the pgAdmin tool to monitor the sparkify_postgres service.

⛏️ Built Using

✍️ Authors

Owner
Leandro Kellermann de Oliveira
Leandro Kellermann de Oliveira
DataPrep — The easiest way to prepare data in Python

DataPrep — The easiest way to prepare data in Python

SFU Database Group 1.5k Dec 27, 2022
Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods

Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods Introduction Graph Neural Networks (GNNs) have demonstrated

37 Dec 15, 2022
Working Time Statistics of working hours and working conditions by industry and company

Working Time Statistics of working hours and working conditions by industry and company

Feng Ruohang 88 Nov 04, 2022
MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation [ECCV2020]

MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation [ECCV2020] by Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wa

112 Dec 28, 2022
Single-Cell Analysis in Python. Scales to >1M cells.

Scanpy – Single-Cell Analysis in Python Scanpy is a scalable toolkit for analyzing single-cell gene expression data built jointly with anndata. It inc

Theis Lab 1.4k Jan 05, 2023
BioMASS - A Python Framework for Modeling and Analysis of Signaling Systems

Mathematical modeling is a powerful method for the analysis of complex biological systems. Although there are many researches devoted on produ

BioMASS 22 Dec 27, 2022
Tools for working with MARC data in Catalogue Bridge.

catbridge_tools Tools for working with MARC data in Catalogue Bridge. Borrows heavily from PyMarc

1 Nov 11, 2021
Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Backtesting the "Cramer Effect" & Recommendations from Cramer Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which

Gábor Vecsei 12 Aug 30, 2022
Tools for the analysis, simulation, and presentation of Lorentz TEM data.

ltempy ltempy is a set of tools for Lorentz TEM data analysis, simulation, and presentation. Features Single Image Transport of Intensity Equation (SI

McMorran Lab 1 Dec 26, 2022
Additional tools for particle accelerator data analysis and machine information

PyLHC Tools This package is a collection of useful scripts and tools for the Optics Measurements and Corrections group (OMC) at CERN. Documentation Au

PyLHC 3 Apr 13, 2022
Open source platform for Data Science Management automation

Hydrosphere examples This repo contains demo scenarios and pre-trained models to show Hydrosphere capabilities. Data and artifacts management Some mod

hydrosphere.io 6 Aug 10, 2021
Fit models to your data in Python with Sherpa.

Table of Contents Sherpa License How To Install Sherpa Using Anaconda Using pip Building from source History Release History Sherpa Sherpa is a modeli

134 Jan 07, 2023
Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

HoloViz 2.9k Jan 06, 2023
Statistical package in Python based on Pandas

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. F

Raphael Vallat 1.2k Dec 31, 2022
The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

About The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficien

ROOT 2k Dec 29, 2022
Pyspark Spotify ETL

This is my first Data Engineering project, it extracts data from the user's recently played tracks using Spotify's API, transforms data and then loads it into Postgresql using SQLAlchemy engine. Data

16 Jun 09, 2022
This python script allows you to manipulate the audience data from Sl.ido surveys

Slido-Automated-VoteBot This python script allows you to manipulate the audience data from Sl.ido surveys Since Slido blocks interference from automat

Pranav Menon 1 Jan 24, 2022
We're Team Arson and we're using the power of predictive modeling to combat wildfires.

We're Team Arson and we're using the power of predictive modeling to combat wildfires. Arson Map Inspiration There’s been a lot of wildfires in Califo

Jerry Lee 3 Oct 17, 2021
4CAT: Capture and Analysis Toolkit

4CAT: Capture and Analysis Toolkit 4CAT is a research tool that can be used to analyse and process data from online social platforms. Its goal is to m

Digital Methods Initiative 147 Dec 20, 2022