The scope of this project will be to build a data ware house on Google Cloud Platform that will help answer common business questions as well as powering dashboards

Overview

Project Scopes

The scope of this project will be to build a data ware house on Google Cloud Platform that will help answer common business questions as well as powering dashboards. To do that, a conceptual data model and a data pipeline will be defined.

Architecture

Data are uploaded to Google Cloud Storage bucket. GCS will act as the data lake where all raw files are stored. Data will then be loaded to staging tables on BigQuery. The ETL process will take data from those staging tables and create data mart tables. An Airflow instance can be deployed on a Google Compute Engine or locally to orchestrate the pipeline.

Here are the justifications for the technologies used:

  • Google Cloud Storage: act as the data lake, vertically scalable.
  • Google Big Query: act as data base engine for data warehousing, data mart and ETL processes. BigQuery is a serverless solution that can easily and effectively process petabytes scale dataset.
  • Apache Airflow: orchestrate the workflow by issuing command line to load data to BigQuery or SQL queries for ETL process. Airflow does not have to process any data by itself, thus allowing the architecture to scale.

Data Model

The database is designed following a star-schema principal with 1 fact table and 5 dimensions tables.

image

  • F_IMMIGRATION_DATA: contains immigration information such as arrival date, departure date, visa type, gender, country of origin, etc.
  • D_TIME: contains dimensions for date column
  • D_PORT: contains port_id and port_name
  • D_AIRPORT: contains airports within a state
  • D_STATE: contains state_id and state_name
  • D_COUNTRY: contains country_id and country_name
  • D_WEATHER: contains average weather for a state
  • D_CITY_DEMO: contains demographic information for a city

Data pipeline

This project uses Airflow for orchestration.

image

A DummyOperator start_pipeline kick off the pipeline followed by 4 load operations. Those operations load data from GCS bucket to BigQuery tables. The immigration_data is loaded as parquet files while the others are csv formatted. There are operations to check rows after loading to BigQuery.

Next the pipeline loads 3 master data object from the I94 Data dictionary. Then the F_IMMIGRATION_DATA table is created and check to make sure that there is no duplicates. Other dimension tables are also created and the pipelines finishes.

Scenarios

Data increase by 100x

Currently infrastructure can easily supports 100x increase in data size. GCS and BigQuery can handle petabytes scale data. Airflow is not a bottle neck since it only issue commands to other services.

Pipelines would be run on 7am daily. how to update dashboard? would it still work?

Schedule dag to be run daily at 7 AM. Setup dag retry, email/slack notification on failures.

Make it available to 100+ people

BigQuery is auto-scaling so if 100+ people need to access, it can handle that easily. If more people or services need access to the database, we can add steps to write to a NoSQL database like Data Store or Cassandra, or write to a SQL one that supports horizontal scaling like BigTable.

Project Instructions

GCP setup

Follow the following steps:

  • Create a project on GCP
  • Enable billing by adding a credit card (you have free credits worth $300)
  • Navigate to IAM and create a service account
  • Grant the account project owner. It is convenient for this project, but not recommended for production system. You should keep your key somewhere safe.

Create a bucket on your project and upload the data with the following structure:

gs://cloud-data-lake-gcp/airports/:
gs://cloud-data-lake-gcp/airports/airport-codes_csv.csv
gs://cloud-data-lake-gcp/airports/airport_codes.json

gs://cloud-data-lake-gcp/cities/:
gs://cloud-data-lake-gcp/cities/us-cities-demographics.csv
gs://cloud-data-lake-gcp/cities/us_cities_demo.json

gs://cloud-data-lake-gcp/immigration_data/:
gs://cloud-data-lake-gcp/immigration_data/part-00000-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00001-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00002-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00003-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00004-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00005-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00006-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00007-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00008-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00009-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00010-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00011-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00012-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00013-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet

gs://cloud-data-lake-gcp/master_data/:
gs://cloud-data-lake-gcp/master_data/
gs://cloud-data-lake-gcp/master_data/I94ADDR.csv
gs://cloud-data-lake-gcp/master_data/I94CIT_I94RES.csv
gs://cloud-data-lake-gcp/master_data/I94PORT.csv

gs://cloud-data-lake-gcp/weather/:
gs://cloud-data-lake-gcp/weather/GlobalLandTemperaturesByCity.csv
gs://cloud-data-lake-gcp/weather/temperature_by_city.json

You can copy the data to your own bucket by running the following:

gsutil cp -r gs://cloud-data-lake-gcp/ gs://{your_bucket_name}

Local setup

Clone the project, create environment, install required packages by running the following:

Install docker if it's not already installed. You can find the resources to do that here.

Install the Astronomer CLI following the instructions here.

Run the following commands to bring up the Airflow instance:

astro d start

You can look at the logs by running make logs if you need to debug something. You can access and manage the pipeline by typing the following address to a browser:

localhost:8080/admin/

If everything is setup correctly, you will see the following screen:

image

Navigate to Admin -> Connections and paste in the credentials for the following two connections: bigquery_default and google_cloud_default

image

Navigate to the main dag on path dags\cloud-data-lake-pipeline.py and change the following parameters with your own setup:

project_id = 'cloud-data-lake'
staging_dataset = 'IMMIGRATION_DWH_STAGING'
dwh_dataset = 'IMMIGRATION_DWH'
gs_bucket = 'cloud-data-lake-gcp'

You can then trigger the dag and the pipeline will run.

The data warehouse

The final data warehouse looks like this: img

Owner
Shweta_kumawat
AI software @ Computer programmer, Deep Learning, Computer Vision Researcher and Developer. Implement AI products and machine learning solutions
Shweta_kumawat
Bot interpretation of the carbon.now.sh site

📒 Source code of the @PicodeBot 🧸 Developer: @hoosnick Run $ git clone https://github.com/hoosnick/picodebot.git $ pip install -r requirements.txt P

Husniddin Murodov 13 Oct 02, 2022
Cryptocurrency Prices Telegram Bot For Python

Cryptocurrency Prices Telegram Bot How to Run Set your telegram bot token as environment variable TELEGRAM_BOT_TOKEN: export TELEGRAM_BOT_TOKEN=your_

Sina Nazem 3 Oct 31, 2022
Allows you to easily share bookmarks from Raindrop.io in Telegram chats.

Allows you to easily share bookmarks from Raindrop.io in Telegram chats. As well as save links/photos/longreads from Telegram right into Raindrop.io. Join us, we have a nice 'reader mode' :)

Oleh 36 Dec 19, 2022
Python async SDK for betsapi.com

Python async SDK for betsapi.com

1 Dec 21, 2021
NewpaperNews-API - Json data of the news with python

NewsAPI API Documentation BASE_URL = "https://saurav.tech/NewsAPI/" top_headline

Aryaman Prakash 2 Sep 23, 2022
Video Stream is a telegram bot project that's allow you to play video on telegram group video chat

Video Stream is a telegram bot project that's allow you to play video on telegram group video chat 🚀 Get SESSION_NAME from below: Pyrogram ## ✨ Featu

1 Nov 10, 2021
IACR Events Scraper

IACR Events Scraper This scrapes https://iacr.org/events/ and exports it as a calendar file. I host a version of this for myself under https://arrrr.c

Karolin Varner 6 May 28, 2022
A modern,feature-rich, and async ready API wrapper for Discord written in Python

discord.io A modern, easy to use, feature-rich, and async ready API wrapper for Discord written in Python. Key Features Modern Pythonic API using asyn

Vincent 18 Jan 02, 2023
Modified Version Of Media Search bot

Modified Version Of Media Search bot

1 Oct 09, 2021
Using DST's API with Python

A short guide on how to access Denmark's Statistics API with python, together with a helper class that facilitates the collection of data and metadata from any DST's table

Alessandro Martinello 16 Dec 02, 2022
This is new discord nitro generator for discord

Hello! This is new discord nitro generator for discord. If you want use it, To generator i added checker for no seraching generator and checker. This tool maked by .

ItzBolt 1 Jan 16, 2022
A discord bot that moderates your server!

Staff Bot para Discord O que é? É um bot que modera o seu servidor no Discord, apagando mensagens indesejadas que os usuários mandem! Como usar Primei

Isac Gonçalves Cunha 3 Oct 07, 2021
Userbot Telegram + Music Voice Chats. Dibuat Untuk Bersenang - Senang , Dan Mempermudah Kegiatan. Created By Rio.

RIO - USERBOT Disclaimer Saya tidak bertanggung jawab atas penyalahgunaan bot ini. Bot ini dimaksudkan untuk bersenang-senang sekaligus membantu Anda

RioProjectX 1 Nov 10, 2021
Extend the commitizen tools to create conventional commits and README that link to Jira and GitHub.

cz-github-jira-conventional cz-github-jira-conventional is a plugin for the commitizen tools, a toolset that helps you to create conventional commit m

12 Dec 13, 2022
Support for Competitive Coding badges to add in Github readme or portfolio websites.

Support for Competitive Coding badges to add in Github readme or portfolio websites.

Akshat Aggarwal 2 Feb 14, 2022
An Simple Advance Auto Filter Bot Complete Rewritten Version Of Adv-Filter-Bot

Adv Auto Filter Bot V2 This Is Just An Simple Advance Auto Filter Bot Complete Rewritten Version Of Adv-Filter-Bot.. Just Sent Any Text As Query It Wi

0 Dec 18, 2021
Retrieve information from DBLP and update BibTex files automatically

Rebib TLDR: This script retrieves information from DBLP to update your BibTex files. python rebib.py --bibfile xxx.bib It first parses the bib entries

Shangtong Zhang 49 Jan 01, 2023
Drop-in Replacement of pychallonge

pychal Pychal is a drop-in replacement of pychallonge with some extra features and support for new Python versions. Pychal provides python bindings fo

ZED 29 Nov 28, 2022
ShadowMusic - A Telegram Music Bot with proper functions written in Python with Pyrogram and Py-Tgcalls.

⭐️ Shadow Music ⭐️ A Telegram Music Bot written in Python using Pyrogram and Py-Tgcalls Ready to use method A Support Group, Updates Channel and ready

TeamShadow 8 Aug 17, 2022
Telegram bot made with Python to get notified when visa slots are available

Visa slot bot I created this bot to getnotified when screenshots are available in the Telegram channel for dropbox appointments. How do I use this? Ch

Jimil 7 Jan 03, 2023