A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
The Fastest multi spambot of Telegram ๐Ÿคž ๐Ÿคž

Revil Spam Bot The Fastest multi spambot of Telegram ๐Ÿคž ๐Ÿคž ๐š‚๐š„๐™ฟ๐™ฟ๐™พ๐š๐šƒ ๐Ÿ–ค แด„ส€แด‡แด€แด›แดส€ ๐Ÿ–ค โšก ๐“ก๐“ฎ๐“ฟ๐“ฒ๐“ต ๐“—๐“พ๐“ท๐“ฝ๐“ฎ๐“ป ๐”๐”ฒ๐”ฉ๐”ฑ๐”ฆ แบžรธโœž๏ธŽ โšก ๐“ ๐•พะผฮฟฮฟฯ„ะฝ ๐“ะธโˆ‚ ๐•ฑ

REVIL HUNTER 4 Dec 08, 2021
Brute force instagram account / actonetor, 2021

Brute force instagram account / actonetor, 2021

actonetor 6 Nov 16, 2022
Telegram Group Manager Bot + Userbot Written In Python Using Pyrogram.

Telegram Group Manager Bot + Userbot Written In Python Using PyrogramTelegram Group Manager Bot + Userbot Written In Python Using Pyrogram

1 Nov 11, 2021
Terraform Cloud CLI for Managing Workspace Terraform Versions

Terraform Cloud Version Manager This tiny script makes it easy to update the Terraform Version on all of the Workspaces inside Terraform Cloud. It wil

Robert Hafner 1 Jan 07, 2022
Its Is A Telegram Maths Basic Calculator Bot

Its Is A Telegram Maths Basic Calculator Bot

ANKIT KUMAR 1 Dec 26, 2021
Nasdaq Cloud Data Service (NCDS) provides a modern and efficient method of delivery for realtime exchange data and other financial information. This repository provides an SDK for developing applications to access the NCDS.

Nasdaq Cloud Data Service (NCDS) Nasdaq Cloud Data Service (NCDS) provides a modern and efficient method of delivery for realtime exchange data and ot

Nasdaq 8 Dec 01, 2022
Easily report Instagram pages and close the page

Program Features - ๐Ÿ“Œ Delete target post on Instagram. - ๐Ÿ“Œ Delete Media Target post on Instagram - ๐Ÿ“Œ Complete deletion of the target account on Inst

hack4lx 11 Nov 25, 2022
Automatically commits and pushes changes from a specified directory to remote repository

autopush a simple python program that checks a directory for updates and automatically commits any updated files (and optionally pushes them) installa

carreb 1 Jan 16, 2022
A simple Discord Mass-Ban that's still working with Member Scraper.

Mass-Ban [!] This was made for education / you can use for revenge. Please don't skid it. [!] If you want to use it, please use member scraper before

WoahThatsHot 1 Nov 20, 2021
Blankly - ๐Ÿš€ ๐Ÿ’ธ Trade stocks, cryptos, and forex w/ one package. Easily build, backtest, trade, and deploy across exchanges in a few lines of code.

๐Ÿ’จ Rapidly build and deploy quantitative models for stocks, crypto, and forex ๐Ÿš€ View Docs ยท Our Website ยท Join Our Newsletter ยท Getting Started Why B

Blankly Finance 1.4k Jan 03, 2023
๐ŸŽฅ Stream your favorite movie from the terminal!

Stream-Cli stream-cli is a Python scrapping CLI that combine scrapy and webtorrent in one command for streaming movies from your terminal. Installatio

R E D O N E 379 Dec 24, 2022
A simple Python TDLib wrapper

Telegram Forwarder App Description pywtdlib (Python Wrapper TDLib) is a simple synchronous Python wrapper that makes you easy to create new Python Tel

รlvaro Fernรกndez 2 Jan 04, 2023
A Telegram bot to download posts, videos, reels, IGTV and a user profile picture from Instagram!

Telegram Bot A telegram bot to download media from Instagram! No API Key or Login Needed! Requirements You must have python installed (of course) You

Simon Farah 2 Apr 10, 2022
Based on falcondai and fenhl's Python snowflake tool, but with documentation and simliarities to Discord.

python-snowflake-2 Based on falcondai and fenhl's Python snowflake tool, but with documentation and simliarities to Discord. Docs make_snowflake This

2 Mar 19, 2022
This is a bot which you can use in telegram to spam without flooding and enjoy being in the leaderboard

Telegram-Count-spamming-Bot This is a bot which you can use in telegram to spam without flooding and enjoy being in the leaderboard You can avoid the

Lalan Kumar 1 Oct 23, 2021
Find people to play tennis with.

40Love 40Love is a full-stack web application that helps tennis players find hits at public tennis courts. Players can select public courts on the map

Tanner Schmutte 27 Jun 08, 2022
Materials for the AMS 2022 Student Conference Python Workshop.

AMS 2022 Student Conference Python Workshop Let's talk MetPy! Here you will find a collection of notebooks we will be demonstrating and working throug

Unidata 4 Dec 13, 2022
Unofficial GoPro API Library for Python - connect to GoPro via WiFi.

GoPro API for Python Unofficial GoPro API Library for Python - connect to GoPro cameras via WiFi. Compatibility: HERO3 HERO3+ HERO4 (including HERO Se

Konrad Iturbe 1.3k Jan 01, 2023
Create a Neo4J graph of users and roles trust policies within an AWS Organization.

AWS_ORG_MAPPER This tool uses sso-oidc to authenticate to the AWS organization. Once authenticated the tool will attempt to enumerate all users and ro

Ruse 24 Jul 28, 2022