PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Overview

PrimaryBid

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

Part1

This part involves ingesting an application lifecycle raw data in .csv formats (“CC Application Lifecycle.csv”). The data is transformed to return various Application stages as column names, and the time of stage completion, as values against each customer ID via python.

Files included in this section include:

  • Solution Directory:
    • application_etl.py (Contains transformation class for application lifecycle raw data)
    • run_application_etl.py (Ingest and executes transformations for application lifecycle raw data)
  • Test Directory:
    • test_application_etl.py (runs a series of test for objects in the transformation class)
    • Input Directory (Contains all the input test files)
    • Output Directory (Contains all the output test files)

Execution:

  1. Execute run_application_etl.py to obtain output file for transformed application lifecycle data.

Modifications:

  1. Extra transformation, bug fixes and other modification can be added in application_etl.py as an object.
  2. For new transformations (new functions), add a test for the function in test_application_etl.py and execute it with pytest -vv.
  3. Call the object in run_application_etl.py after test passes to return desired output.

Part2

This part presents an architectural design to ingest data from a MongoDB database - into a Redshift data platform. The solution accomodates the addition of more data sources in the near future. The DDL scripts which form part of the solution is resusable for ingesting and loading data into redshift.

Files included in this section establishes the creation of target tables for the data ingestion process:

  • dwh.cfg (Infrastucture parameters and configuration)
  • DDL_queries.py (DDL queries to drop, creat, copy/insert data into Redshift)
  • table_setup_load.py (Class to manage the establish connection to database setup and teardown of tables in Redshift)
  • execute_ddl_process.py (script to execute processes in table_setup_load class)
  • test_execute_ddl_process.py (script to test the setup and teardown of resources.)
  • requirement.txt (key libraries needed to execute .py scripts)
  • makefile (file to automate process of installing and testing libraries and .py scripts respectively.)

Execution:

  1. Execute execute_ddl_process.py to create and load data into target tables from S3.

Modifications:

  1. Bucket file sources and other config paramters can be added in dwh.cfg
  2. New DDl queries which includes ingesting data from multiple tables from aggregations/joins can be added in DDL_queries.py.
  3. For other functions not captured in this section work, custom functions can be added in table_setup_load.py
  4. Before executing scripts for production environments, test the modifications by executing test_execute_ddl_process.py

The architecture below highlights the processes involved in ingesting data from various data sources into redshift

  • Architeture

Data Architecture

Owner
Emmanuel Boateng Sifah
Computer scientist, Doctoral researcher, Solutions engineer, Data scientist, Data analyst and Data engineer
Emmanuel Boateng Sifah
My solution to the book A Collection of Data Science Take-Home Challenges

DS-Take-Home Solution to the book "A Collection of Data Science Take-Home Challenges". Note: Please don't contact me for the dataset. This repository

Jifu Zhao 1.5k Jan 03, 2023
PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

1 Feb 07, 2022
Wafer Fault Detection - Wafer circleci with python

Wafer Fault Detection Problem Statement: Wafer (In electronics), also called a slice or substrate, is a thin slice of semiconductor, such as a crystal

Avnish Yadav 14 Nov 21, 2022
API>local_db>AWS_RDS - Disclaimer! All data used is for educational purposes only.

APIlocal_dbAWS_RDS Disclaimer! All data used is for educational purposes only. ETL pipeline diagram. Aim of project By creating a fully working pipe

0 Apr 25, 2022
OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

opendrift OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere. Do

OpenDrift 167 Dec 13, 2022
Kennedy Institute of Rheumatology University of Oxford Project November 2019

TradingBot6M Kennedy Institute of Rheumatology University of Oxford Project November 2019 Run Change api.txt to binance api key: https://www.binance.c

Kannan SAR 2 Nov 16, 2021
Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Larch: Data Analysis Tools for X-ray Spectroscopy and More Documentation: http://xraypy.github.io/xraylarch Code: http://github.com/xraypy/xraylarch L

xraypy 95 Dec 13, 2022
WaveFake: A Data Set to Facilitate Audio DeepFake Detection

WaveFake: A Data Set to Facilitate Audio DeepFake Detection This is the code repository for our NeurIPS 2021 (Track on Datasets and Benchmarks) paper

Chair for Sys­tems Se­cu­ri­ty 27 Dec 22, 2022
This tool parses log data and allows to define analysis pipelines for anomaly detection.

logdata-anomaly-miner This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis wit

AECID 32 Nov 27, 2022
A 2-dimensional physics engine written in Cairo

A 2-dimensional physics engine written in Cairo

Topology 38 Nov 16, 2022
Python library for creating data pipelines with chain functional programming

PyFunctional Features PyFunctional makes creating data pipelines easy by using chained functional operators. Here are a few examples of what it can do

Pedro Rodriguez 2.1k Jan 05, 2023
Stock Analysis dashboard Using Streamlit and Python

StDashApp Stock Analysis Dashboard Using Streamlit and Python If you found the content useful and want to support my work, you can buy me a coffee! Th

StreamAlpha 27 Dec 09, 2022
CubingB is a timer/analyzer for speedsolving Rubik's cubes, with smart cube support

CubingB is a timer/analyzer for speedsolving Rubik's cubes (and related puzzles). It focuses on supporting "smart cubes" (i.e. bluetooth cubes) for recording the exact moves of a solve in real time.

Zach Wegner 5 Sep 18, 2022
Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video.

Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video. You can chose the cha

2 Jul 22, 2022
For making Tagtog annotation into csv dataset

tagtog_relation_extraction for making Tagtog annotation into csv dataset How to Use On Tagtog 1. Go to Project Downloads 2. Download all documents,

hyeong 4 Dec 28, 2021
A highly efficient and modular implementation of Gaussian Processes in PyTorch

GPyTorch GPyTorch is a Gaussian process library implemented using PyTorch. GPyTorch is designed for creating scalable, flexible, and modular Gaussian

3k Jan 02, 2023
Python package for processing UC module spectral data.

UC Module Python Package How To Install clone repo. cd UC-module pip install . How to Use uc.module.UC(measurment=str, dark=str, reference=str, heade

Nicolai Haaber Junge 1 Oct 20, 2021
Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

WeRateDogs Twitter Data from 2015 to 2017 Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data Table of Contents Introduction Proj

Keenan Cooper 1 Jan 12, 2022
University Challenge 2021 With Python

University Challenge 2021 This repository contains: The TeX file of the technical write-up describing the University / HYPER Challenge 2021 under late

2 Nov 27, 2021
A simplified prototype for an as-built tracking database with API

Asbuilt_Trax A simplified prototype for an as-built tracking database with API The purpose of this project is to: Model a database that tracks constru

Ryan Pemberton 1 Jan 31, 2022