Processing NYC Taxi Data using PySpark ETL pipeline

Description

This is an project to extract, transform, and load large amount of data from NYC Taxi Rides database (Hosted on AWS S3). It extracts data from CSV files of large size (~2GB per month) and applies transformations such as datatype conversions, drop unuseful rows/columns, etc. Finally, the data is written back in parquet format. This saves time for tasks such as machine learning. It also saves a huge amount of space (~97% space reduction from csv to parquet) making it easy to store for downstream tasks.

How to use it (Using GCP as the cloud service of choice)

Setup a bucket on Google Cloud Storage
Use get_raw_data.sh to download raw data from s3 in the form of CSV files to the GCS bucket
Setup a GCP dataproc service
SSH into the master node and copy the entire project folder to the Persistent Disk
Edit the configuration file for application
Submit the job: submit-spark main.py --filename [raw_data_filename] or Execute submit_job.sh with appropriate args

Project structure

root/
|---bash/
    |---create_cluster.sh
    |---install.sh
|---configs/
    |---app_config.json
    |---cols_config.json
|---jobs/
    |---etl_tasks.py
    |---transformations.py
|   get_raw_data.sh
|   main.py
|   requirements.txt
|   submit_job.sh

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Related tags

Overview

Processing NYC Taxi Data using PySpark ETL pipeline

Description

How to use it (Using GCP as the cloud service of choice)

Project structure

Owner

Unnikrishnan

Random dataframe and database table generator

Udacity-api-reporting-pipeline - Udacity api reporting pipeline

PySpark Structured Streaming ROS Kafka ApacheSpark Cassandra

A tax calculator for stocks and dividends activities.

AWS Glue ETL Code Samples

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

Find exposed data in Azure with this public blob scanner

Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.

Analysiscsv.py for extracting analysis and exporting as CSV

Intake is a lightweight package for finding, investigating, loading and disseminating data.

Minimal working example of data acquisition with nidaqmx python API

MDAnalysis is a Python library to analyze molecular dynamics simulations.

Single-Cell Analysis in Python. Scales to >1M cells.

An easy-to-use feature store

Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions.

A simple and efficient tool to parallelize Pandas operations on all available CPUs

Implementation in Python of the reliability measures such as Omega.

Lale is a Python library for semi-automated data science.

Hg002-qc-snakemake - HG002 QC Snakemake

Analytical view of olist e-commerce in Brazil