ETL pipeline on movie data using Python and postgreSQL

Last update: Jul 07, 2021

Overview

Movies-ETL

ETL pipeline on movie data using Python and postgreSQL

Overview

This project consisted on a automated Extraction, Transformation and Load pipeline. This ETL extracted movie data from wikipedia, kaggle, and MovieLens to clean it, transform it, and merge it using Pandas. The product was a merged table with movies and ratings loaded to PostgreSQL.

Resources

Data sources:
- movies_metadata.csv
- ratings.csv
- wikipedia_movies.json
Software:
- Python
- PostgreSQL
- Pandas
- SQLAlchemy
- Regular Expressions

Results

Final output table: FINAL_Merged_Movies_and_Ratings.csv
Datasets uploaded to PostgreSQL for other users to analyze movie data (Hacketon):

Summary

The pipeline was created under the following assumptions:

I was able to join the wikipedia, kaggle, and ratings movie data on the IMDB ID column.
The wikipedia dataset didn't have a IMDB ID, so I had to extract it from the url link given.
Each dataset had to be cleaned on their own because they had overlapping columns, suck as 'Director' and 'Directed By', unecessary columns, many null values, TV shows, outliers, duplicates, incorrect data types, formatting, and other errors.
The wikipedia movie data was in json format.
Not every every movie had a rating for each rating level.
The ratings dataset had more than 26 million entries which generated a time constraint and a processing data challenge.

ETL pipeline on movie data using Python and postgreSQL

Related tags

Overview

Movies-ETL

ETL pipeline on movie data using Python and postgreSQL

Overview

Resources

Results

Summary

Owner

Juan Nicolas Serrano

PySpark Structured Streaming ROS Kafka ApacheSpark Cassandra

Zipline, a Pythonic Algorithmic Trading Library

Employee Turnover Analysis

A simplified prototype for an as-built tracking database with API

Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

.npy, .npz, .mtx converter.

Streamz helps you build pipelines to manage continuous streams of data

NFCDS Workshop Beginners Guide Bioinformatics Data Analysis

An experimental project I'm undertaking for the sole purpose of increasing my Python knowledge

Pipeline and Dataset helpers for complex algorithm evaluation.

Titanic data analysis for python

Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

VevestaX is an open source Python package for ML Engineers and Data Scientists.

Open-Domain Question-Answering for COVID-19 and Other Emergent Domains

nrgpy is the Python package for processing NRG Data Files

Pyspark project that able to do joins on the spark data frames.

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

collect training and calibration data for gaze tracking

yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset