A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
Anti-league-discordbot - Harrasses imbeciles for playing league of legends

anti-league-discordbot harrasses imbeciles for playing league of legends Running

Chris Clem 2 Feb 12, 2022
A discord bot written in discord.py to manage custom roles assigned to boosters of your server.

BBotty A discord bot written in discord.py to manage custom roles assigned to boosters of your server. v0.0.1-alpha released! This version is incomple

Oui002 1 Nov 27, 2021
PyDiscord, a maintained fork of discord.py, is a python wrapper for the Discord API.

discord.py A modern, easy to use, feature-rich, and async ready API wrapper for Discord written in Python. The Future of discord.py Please read the gi

Omkaar 1 Jan 16, 2022
Telegram-Discord Bridge

imperial-toilet Скрипт, пересылающий сообщения из нескольких каналов Telegram в один/несколько каналов Discord. Технически это Telegram-юзербот и Disc

1 Jan 17, 2022
A simple and modular Discord bot with various functionalities.

All-In-Bot for Discord A simple and modular Discord bot with various functionalities. How to use the bot? Simple! Just invite the bot to your server u

Th3J0nny 3 Jan 29, 2022
New discord token grabber, password and general information

New discord token grabber, password and general information

Monstered 6 Nov 09, 2022
This Lambda will Pull propagated routes from TGW and update VPC route table

AWS-Transitgateway-Route-Propagation This Lambda will Pull propagated routes from TGW and update VPC route table. Tested on python 3.8 Lambda AWS INST

4 Jan 20, 2022
BiliBili-live-barrage-transceiver - A simple python program for sending and receiving barrage in bilibili live room

BiliBili-live-barrage-transceiver - A simple python program for sending and receiving barrage in bilibili live room

zeroy 2 Jan 18, 2022
SUPPORTS 500 GROUPS NO NEED OF BOT 😉

LOVELY RADIO SUPPORTS 500 GROUPS NO NEED OF BOT 😉 Requirements Telegram API_ID , API_HASH and SESSION_NAME HEROKU Get YouTube live stream link instal

6 Nov 24, 2021
Telegram bot for logistic - Telegram bot for logistic

Демонстрационный телеграм-бот для нужд транспортной компании Цель проекта Реализ

M1chigun 1 Feb 05, 2022
Python SDK for accessing the Hanko Authentication API

Hanko Authentication SDK for Python This package is maintained by Hanko. Contents Introduction Documentation Installation Usage Prerequisites Create a

Hanko.io 3 Mar 08, 2022
Dns-Client-Server - Dns Client Server For Python

Dns-client-server DNS Server: supporting all types of queries and replies. Shoul

Nishant Badgujar 1 Feb 15, 2022
A Simple Voice Music Player

📀 𝐕𝐂𝐔𝐬𝐞𝐫𝐁𝐨𝐭 √𝙏𝙚𝙖𝙢✘𝙊𝙘𝙩𝙖𝙫𝙚 NOTE JUST AN ENGLISH VERSION OF OUR PRIVATE SOURCE WAIT FOR LATEST UPDATES JOIN @𝐒𝐔𝐏𝐏𝐎𝐑𝐓 JOIN @𝐂?

TeamOctave 8 May 08, 2022
Video-Player - Telegram Music/ Video Streaming Bot Using Pytgcalls

Video Player 🔥 ᴢᴀɪᴅ ᴠᴄ ᴘʟᴀyᴇʀ ɪꜱ ᴀ ᴛᴇʟᴇɢʀᴀᴍ ᴘʀᴏᴊᴇᴄᴛ ʙᴀꜱᴇᴅ ᴏɴ ᴘʏʀᴏɢʀᴀᴍ ꜰᴏʀ ᴘʟᴀʏ

Zaid 16 Nov 30, 2022
Simple Discord bot for snekbox (sandboxed Python code execution), self-host or use a global instance

snakeboxed Simple Discord bot for snekbox (sandboxed Python code execution), self-host or use a global instance

0 Jun 25, 2022
Discord bots that update their status to the price of any coin listed on x.vite.net

Discord bots that update their status to the price of any coin listed on x.vite.net

5am 3 Nov 27, 2022
Bendford analysis of Ethereum transaction

Bendford analysis of Ethereum transaction The python script script.py extract from already downloaded archive file the ethereum transaction. The value

sleepy ramen 2 Dec 18, 2021
A Python package that can be used to download post and comment data from Reddit.

Reddit Data Collector Reddit Data Collector is a Python package that allows a user to collect post and comment data from Reddit. It is built on top of

Nico Van den Hooff 3 Jul 26, 2022
Attempting to create a framework for Discord Slash commands... yes

discord_slash.py Attempting to create a framework for Discord Slash commands... yes Installation pip install slashpy Documentation Coming soon™ Why is

AlexFlipnote 11 Mar 24, 2021
StringSessionGenerator - A Telegram bot to generate pyrogram and telethon string session

⭐️ String Session Generator ⭐️ Genrate String Session Using this bot. Made by TeamUltronX 🔥 String Session Demo Bot: Environment Variables Mandatory

TheUltronX 1 Dec 31, 2021