Full ELT process on GCP environment.

Overview

Rent Houses Germany - GCP Pipeline

gcp_pipeline

Project:

  • The goal of the project is to extract data about house rentals in Germany, store, process and analyze it using GCP tools. The focus here is to practice and get used to the GCP environment.

Main Tools:

Python

Cloud Storage

BigQuery

Dataprep

Data Studio

Looker

Crontab

Bash

Data Extraction and Storage:

Source: https://www.immonet.de/

  • The data extraction is done in 3 steps where first the quantity of offers for each city is collected, them the ID's for each offers and finaly the raw information about each rent offer is extracted.

  • The first script is responsible to scrape the number of offers in each city and save the information as a CSV file in Cloud Storage. The second script gets the previous CSV file from Cloud Storage and uses it to scrape all ID's from each offers in each city and load the information back to Cloud Storage as a new CSV file. The third script gets the rent offer's ID info from Cloud Storage and perform a web-scraper to collect all information for each ID and save it back to Cloud Storage, again as a CSV file containing all raw infos about the offers.

  • All the extractions steps are scheduled though a Crontab Job to run everyday at 0h.

cronjob

Data Preprocessing.

  • As the last CSV file contains all the RAW information about each offer grouped in only two columns, a preprocessing step is needed. The preprocessor script gets the CSV file with the raw information from Cloud Storage, separates the data into the appropriate columns already performing some cleaning like excluding not needed characters. Again, the preprocessed CSV file is stored in Cloud Storage.

all_offers_infos_raw.csv:

raw_infos

all_offers_infos_pp.csv:

raw_infos

Data Cleaning and Preparation.

  • Here is used Cloud Dataprep to clean and prepare the data for further use. To transform the rent data into useble information first we need to clean and prepare it. Dataprep is a realy good tool where we can look inside the data and can perform all kind of filtering, removing and preparations. Dataprep gets the preprocessed csv file from Cloud Storage and runs a "recipe" tranforming the data to be analyzed. Dataprep saves the cleaned and final csv file both into Data Storage (a backup) and into a BigQuery warehouse.

dataprepJob

  • The Dataproc job was scheduled to run everyday 7 A.M and update the data source for the reports.

Data Analysis - Data Studio Report.

  • With the data cleaned and loaded into BigQuery it's time to display the information. The GCP tools used to display the data was Data Studio and Looker. First I used Data Studio to make a simple report summaring all the rent houses main informantion and schedule to send an e-mail with the updated report avery day at 8 A.M.

    data_studio_dashboard

German Rent Report - 27.11.21

Data Analysis - Looker Dashboard.

  • I'm still working on it.

Conclusion.

  • The tools available on Google Cloud Platform are simply amazing. As in all Cloud platforms, the tools are available and are arranged in a way to make the user's life easier, it is really cool and very practical to build an entire ETL/ELT process using the available tools and it makes everything much easier and agile. The fact that you don't have to deal with hardware fiscally, the automated scalability, the advanced security controls, the availability of virtually all the necessary tools in one place, the integration between the tools, and all the other characteristics of cloud environments contribute greatly to the considerable increase in productivity, in environments like these we only need to focus on doing the main part of our job, on delivering the result, and that is amazing. For me it has been a very pleasant experience to work and experience these features, the next steps now are to continue learning and applying them and in the future to seek certifications.
Owner
Felipe Demenech Vasconcelos
In a constant learning path...
Felipe Demenech Vasconcelos
A Python Tools to imaging the shallow seismic structure

ShallowSeismicImaging Tools to imaging the shallow seismic structure, above 10 km, based on the ZH ratio measured from the ambient seismic noise, and

Xiao Xiao 9 Aug 09, 2022
Full ELT process on GCP environment.

Rent Houses Germany - GCP Pipeline Project: The goal of the project is to extract data about house rentals in Germany, store, process and analyze it u

Felipe Demenech Vasconcelos 2 Jan 20, 2022
Projeto para realizar o RPA Challenge . Utilizando Python e as bibliotecas Selenium e Pandas.

RPA Challenge in Python Projeto para realizar o RPA Challenge (www.rpachallenge.com), utilizando Python. O objetivo deste desafio é criar um fluxo de

Henrique A. Lourenço 1 Apr 12, 2022
A fast, flexible, and performant feature selection package for python.

linselect A fast, flexible, and performant feature selection package for python. Package in a nutshell It's built on stepwise linear regression When p

88 Dec 06, 2022
cLoops2: full stack analysis tool for chromatin interactions

cLoops2: full stack analysis tool for chromatin interactions Introduction cLoops2 is an extension of our previous work, cLoops. From loop-calling base

YaqiangCao 25 Dec 14, 2022
The Spark Challenge Student Check-In/Out Tracking Script

The Spark Challenge Student Check-In/Out Tracking Script This Python Script uses the Student ID Database to match the entries with the ID Card Swipe a

1 Dec 09, 2021
yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

The yt Project yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data. yt supports structured, varia

The yt project 367 Dec 25, 2022
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

🧪📈 🐍. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python a

Marc Skov Madsen 97 Dec 08, 2022
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 01, 2022
A Python package for Bayesian forecasting with object-oriented design and probabilistic models under the hood.

Disclaimer This project is stable and being incubated for long-term support. It may contain new experimental code, for which APIs are subject to chang

Uber Open Source 1.6k Dec 29, 2022
WithPipe is a simple utility for functional piping in Python.

A utility for functional piping in Python that allows you to access any function in any scope as a partial.

Michael Milton 1 Oct 26, 2021
Flenser is a simple, minimal, automated exploratory data analysis tool.

Flenser Have you ever been handed a dataset you've never seen before? Flenser is a simple, minimal, automated exploratory data analysis tool. It runs

John McCambridge 79 Sep 20, 2022
A set of functions and analysis classes for solvation structure analysis

SolvationAnalysis The macroscopic behavior of a liquid is determined by its microscopic structure. For ionic systems, like batteries and many enzymes,

MDAnalysis 19 Nov 24, 2022
Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions.

About Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions. The tool provides rich data and a summary g

9 Nov 16, 2022
Randomisation-based inference in Python based on data resampling and permutation.

Randomisation-based inference in Python based on data resampling and permutation.

67 Dec 27, 2022
Fit models to your data in Python with Sherpa.

Table of Contents Sherpa License How To Install Sherpa Using Anaconda Using pip Building from source History Release History Sherpa Sherpa is a modeli

134 Jan 07, 2023
A computer algebra system written in pure Python

SymPy See the AUTHORS file for the list of authors. And many more people helped on the SymPy mailing list, reported bugs, helped organize SymPy's part

SymPy 9.9k Dec 31, 2022
Kennedy Institute of Rheumatology University of Oxford Project November 2019

TradingBot6M Kennedy Institute of Rheumatology University of Oxford Project November 2019 Run Change api.txt to binance api key: https://www.binance.c

Kannan SAR 2 Nov 16, 2021
Snakemake workflow for converting FASTQ files to self-contained CRAM files with maximum lossless compression.

Snakemake workflow: name A Snakemake workflow for description Usage The usage of this workflow is described in the Snakemake Workflow Catalog. If

Algorithms for reproducible bioinformatics (Koesterlab) 1 Dec 16, 2021
This repository contains some analysis of possible nerdle answers

Nerdle Analysis https://nerdlegame.com/ This repository contains some analysis of possible nerdle answers. Here's a quick overview: nerdle.py contains

0 Dec 16, 2022