Full automated data pipeline using docker images

Overview

Create postgres tables from CSV files

This first section is only relate to creating tables from CSV files using postgres container alone. Just one of my experiments. If you interest, you can just follow these steps (only if your working environment support bash):

sh scripts/prep.sh

The prep.sh will handle everything for you by doing follwing:

# Start postgres db container
docker-compose -f postgres.yaml up -d
# Sleep to make sure the container is fully up running
sleep 3 

# I have problem with mouting csv files via docker compose, so here we go
# Copy csv and setup.sql to create required tables
docker cp ./csv/ my_postgres:
docker cp ./scripts/setup.sql my_postgres:setup.sql

# Execute the script in postgres db
docker exec -it my_postgres psql -p5432 --dbname=postgres --username=postgres --file=setup.sql 

# Shutdown the container
docker-compose -f postgres.yaml down --remove-orphans

I had problem with mount volumn that I can't mount the files under csv and scripts folders. Which still can be improved with a proper mount. But let's skip it for now to save time.

Initial Setup/Start Airflow container

This section will use a separate docker-compose.yaml than the above test. It will be relate due to the fact that we want to use airflow to schedule the tasks above (create table and load data). To do so, do the following. First prepare folders. You can call a new folder specifically for this if you want.

# (optional) mkdir airflow && cd airflow
mkdir ./dags ./logs ./plugins

Next we need the airflow docker-compose.yaml in our airflow directory

curl -O https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml

Next make sure we will have a proper permission to initial Airflow

.env ">
echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .env

Then we must initial Airflow instance

docker-compose up airflow-init

Wait until the initial finished then (you can use -d to detach if you want)

docker-compose up

Now you will be able to connect to Airflow GUI via http://localhost:8080/

Create Airflow DAG task

First thing, you need to setup connection for postgres database. Go to tab Admin > Connection > +, wow you have to fill details of the connection:

Connection Id: postgres_default
Connection Type: 'Postgres'
Host: 
   
    
Schema: postgres (default)
Login: 
    
     
Password: 
     
      
Port: 
      

      
     
    
   

Click "Test" button to check your connection then save. Now click at the Airflow icon to return to home page. You should see task name "create_postgres_tables". Try to run by clicking start button select "Trigger DAG".

You might also like...
In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

ETL Pipeline for AWS Project Description In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 t

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Overview dataflow-mvp provides a basic example pipeline that pulls data from an API and writes it to a BigQuery table using GCP's Dataflow (i.e., Apac

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities

Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities. This is aimed at those looking to get into the field of Data Science or those who are already in the field and looking to solve a real-world project with python.

Flenser is a simple, minimal, automated exploratory data analysis tool.

Flenser Have you ever been handed a dataset you've never seen before? Flenser is a simple, minimal, automated exploratory data analysis tool. It runs

Lale is a Python library for semi-automated data science.
Lale is a Python library for semi-automated data science.

Lale is a Python library for semi-automated data science. Lale makes it easy to automatically select algorithms and tune hyperparameters of pipelines that are compatible with scikit-learn, in a type-safe fashion.

Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly
Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

Table of contents Introduction Dataset Model & Metrics How to Run Quickstart Install Training Evaluation Detection DATA COMPETITION The COVID-19 pande

Pipeline and Dataset helpers for complex algorithm evaluation.

tpcp - Tiny Pipelines for Complex Problems A generic way to build object-oriented datasets and algorithm pipelines and tools to evaluate them pip inst

Releases(airflow-postgres-dag)
A script to "SHUA" H1-2 map of Mercenaries mode of Hearthstone

lushi_script Introduction This script is to "SHUA" H1-2 map of Mercenaries mode of Hearthstone Installation Make sure you installed python=3.6. To in

210 Jan 02, 2023
Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

PremiershipPlayerAnalysis Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. No

5 Sep 06, 2021
Using approximate bayesian posteriors in deep nets for active learning

Bayesian Active Learning (BaaL) BaaL is an active learning library developed at ElementAI. This repository contains techniques and reusable components

ElementAI 687 Dec 25, 2022
This is a repo documenting the best practices in PySpark.

Spark-Syntax This is a public repo documenting all of the "best practices" of writing PySpark code from what I have learnt from working with PySpark f

Eric Xiao 447 Dec 25, 2022
Useful tool for inserting DataFrames into the Excel sheet.

PyCellFrame Insert Pandas DataFrames into the Excel sheet with a bunch of conditions Install pip install pycellframe Usage Examples Let's suppose that

Luka Sosiashvili 1 Feb 16, 2022
PyPSA: Python for Power System Analysis

1 Python for Power System Analysis Contents 1 Python for Power System Analysis 1.1 About 1.2 Documentation 1.3 Functionality 1.4 Example scripts as Ju

758 Dec 30, 2022
Shot notebooks resuming the main functions of GeoPandas

Shot notebooks resuming the main functions of GeoPandas, 2 notebooks written as Exercises to apply these functions.

1 Jan 12, 2022
PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

PyStan PyStan is a Python interface to Stan, a package for Bayesian inference. Stan® is a state-of-the-art platform for statistical modeling and high-

Stan 229 Dec 29, 2022
MS in Data Science capstone project. Studying attacks on autonomous vehicles.

Surveying Attack Models for CAVs Guide to Installing CARLA and Collecting Data Our project focuses on surveying attack models for Connveced Autonomous

Isabela Caetano 1 Dec 09, 2021
An easy-to-use feature store

A feature store is a data storage system for data science and machine-learning. It can store raw data and also transformed features, which can be fed straight into an ML model or training script.

ByteHub AI 48 Dec 09, 2022
Python-based Space Physics Environment Data Analysis Software

pySPEDAS pySPEDAS is an implementation of the SPEDAS framework for Python. The Space Physics Environment Data Analysis Software (SPEDAS) framework is

SPEDAS 98 Dec 22, 2022
Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

1 Feb 11, 2022
Example Of Splunk Search Query With Python And Splunk Python SDK

SSQAuto (Splunk Search Query Automation) Example Of Splunk Search Query With Python And Splunk Python SDK installation: ➜ ~ git clone https://github.c

AmirHoseinTangsiriNET 1 Nov 14, 2021
yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

The yt Project yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data. yt supports structured, varia

The yt project 367 Dec 25, 2022
The repo for mlbtradetrees.com. Analyze any trade in baseball history!

The repo for mlbtradetrees.com. Analyze any trade in baseball history!

7 Nov 20, 2022
Churn prediction with PySpark

It is expected to develop a machine learning model that can predict customers who will leave the company.

3 Aug 13, 2021
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 05, 2023
BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

Vo Cong Thanh 1 Jan 06, 2022
ForecastGA is a Python tool to forecast Google Analytics data using several popular time series models.

ForecastGA is a tool that combines a couple of popular libraries, Atspy and googleanalytics, with a few enhancements.

JR Oakes 36 Jan 03, 2023
Validation and inference over LinkML instance data using souffle

Translates LinkML schemas into Datalog programs and executes them using Souffle, enabling advanced validation and inference over instance data

Linked data Modeling Language 7 Aug 07, 2022