songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Last update: Jul 13, 2021

Related tags

Overview

Songplays User activity datamart

The following document describes the model used to build the songplays datamart table and the respective ETL process.

About
Getting Started
Data Model and Schema
Deployment
Built Using
Authors

About

The songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system.

This document describes the model of songplays table datamart on sparkify_app schema inside a container sparkify_postgres, and the Python code to load new data. The production directory and data must be simmilar to those in mnt/data/log_data and mnt/data/song_data paths in this repository.

🏁 Getting Started

First you need to have the right permissions to access the source files and write them into sparkify_app tables that generates the songplays datamart table. Contact the owners or your team leader for more information.

Data Model and Schema

Source files and owners

File or table	Description	Directory	Owner
YYYY-MM-DD-events.json	User events.	mnt/data/log_data/YYYY/11	Person 1
.json	Song data.	mnt/data/song_data/a	Person 2
`songplays`	Datamart for recomendation system.	`sparkify_app.songplays`	Person 3
`artists`	Dimension table for artists.	`sparkify_app.artists`	Person 1
`songs`	Dimension table for songs.	`sparkify_app.songs`	Person 1
`time`	Dimension table for streaming start time for a given song.	`sparkify_app.time`	Person 2
`users`	Dimension table for users.	`sparkify_app.users`	Person 3

Prerequisites

To run this project first you need to install the Docker Engine for your operational system and Docker Compose.

After installing and configuring the Docker tools, download this repository and create a folder named postgres that will store all sparkify_postgres service data. To build the proper images and run the services, just execute the following command inside this repository:

docker-compose up

If the service runs successfully you should see something like this:

...
sparkify_python      | 28/30 files processed.
sparkify_python      | 29/30 files processed.
sparkify_python      | 30/30 files processed.
sparkify_python exited with code 0

You can also check the job by following these steps:

Open your browser and access localhost:16543:
- Enter with the following credentials to authenticate:
  - e-mail: [email protected]
  - password: sp4rk1fy
After you log in, click on the Servers option at the upper corner on the left:
- You will be asked to enter with the PostgreSQL credentials:
  - User: sparkifypsql
  - Password: p4ssw0rd
Select the Query Tools under the Tools menu:

Under the Query Editor, run the following query:

SELECT * FROM sparkify_app.songplays WHERE song_id is NOT NULL and artist_id is NOT NULL;

You should get only 5 rows.

Microservice architecture

The following image represents the microservice architecture for this project:

Where:

sparkify_python: runs all Python scripts and stores raw data.
sparkify_postgres: runs Postgre and stores the database.
sparkify_pgadmin: runs the pgAdmin tool to monitor the sparkify_postgres service.

⛏️ Built Using

Dbeaver - Database tool.
Docker Compose - Tool to run multi-container applications.
Docker Engine - Container engine.
pandas - Data analysis and data wrangling tool.
pgAdmin - PostgreSQL tool.
psycopg2 - Database adapter for Python.
PostgreSQL - Reletional database management system.

✍️ Authors

@lkellermann - Idea & Initial work

songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Related tags

Overview

Songplays User activity datamart

Table of Contents

About

🏁 Getting Started

Data Model and Schema

Prerequisites

Microservice architecture

⛏️ Built Using

✍️ Authors

Owner

Leandro Kellermann de Oliveira

DataPrep — The easiest way to prepare data in Python

Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods

Working Time Statistics of working hours and working conditions by industry and company

MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation [ECCV2020]

Single-Cell Analysis in Python. Scales to >1M cells.

BioMASS - A Python Framework for Modeling and Analysis of Signaling Systems

Tools for working with MARC data in Catalogue Bridge.

Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Tools for the analysis, simulation, and presentation of Lorentz TEM data.

Additional tools for particle accelerator data analysis and machine information

Open source platform for Data Science Management automation

Kats, a kit to analyze time series data, a lightweight, easy-to-use, generalizable, and extendable framework to perform time series analysis, from understanding the key statistics and characteristics, detecting change points and anomalies, to forecasting future trends.

Fit models to your data in Python with Sherpa.

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

Statistical package in Python based on Pandas

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

Pyspark Spotify ETL

This python script allows you to manipulate the audience data from Sl.ido surveys

We're Team Arson and we're using the power of predictive modeling to combat wildfires.

4CAT: Capture and Analysis Toolkit