Data Orchestration Platform

Related tags

Miscellaneousdop
Overview

Table of contents

What is DOP

Design Concept

DOP is designed to simplify the orchestration effort across many connected components using a configuration file without the need to write any code. We have a vision to make orchestration easier to manage and more accessible to a wider group of people.

Here are some of the key design concept behind DOP,

  • Built on top of Apache Airflow - Utilises it’s DAG capabilities with interactive GUI
  • DAGs without code - YAML + SQL
  • Native capabilities (SQL) - Materialisation, Assertion and Invocation
  • Extensible via plugins - DBT job, Spark job, Egress job, Triggers, etc
  • Easy to setup and deploy - fully automated dev environment and easy to deploy
  • Open Source - open sourced under the MIT license

Please note that this project is heavily optimised to run with GCP (Google Cloud Platform) services which is our current focus. By focusing on one cloud provider, it allows us to really improve on end user experience through automation

A Typical DOP Orchestration Flow

Typical DOP Flow

Prerequisites - Run in Docker

Note that all the IAM related prerequisites will be available as a Terraform template soon!

For DOP Native Features

  1. Download and install Docker https://docs.docker.com/get-docker/ (if you are on Windows, please follow instruction here as there are some additional steps required for it to work https://docs.docker.com/docker-for-windows/install/)
  2. Download and install Google Cloud Platform (GCP) SDK following instructions here https://cloud.google.com/sdk/docs/install.
  3. Create a dedicated service account for docker with limited permissions for the development GCP project, the Docker instance is not designed to be connected to the production environment
    1. Call it dop-docker-user@<your GCP project id> and create it in https://console.cloud.google.com/iam-admin/serviceaccounts?project=<your GCP project id>
    2. Assign the roles/bigquery.dataEditor and roles/bigquery.jobUser role to the service account under https://console.cloud.google.com/iam-admin/iam?project=<your GCP project id>
  4. Your GCP user / group will need to be given the roles/iam.serviceAccountUser and the roles/iam.serviceAccountTokenCreator role on thedevelopment project just for the dop-docker-user service account in order to enable Service Account Impersonation.
    Grant service account user
  5. Authenticating with your GCP environment by typing in gcloud auth application-default login in your terminal and following instructions. Make sure you proceed to the stage where application_default_credentials.json is created on your machine (For windows users, make a note of the path, this will be required on a later stage)
  6. Clone this repository to your machine.

For DBT

  1. Setup a service account for your GCP project called dop-dbt-user in https://console.cloud.google.com/iam-admin/serviceaccounts?project=<your GCP project id>
  2. Assign the roles/bigquery.dataEditor and roles/bigquery.jobUser role to the service account at project level under https://console.cloud.google.com/iam-admin/iam?project=<your GCP project id>
  3. Your GCP user / group will need to be given the roles/iam.serviceAccountUser and the roles/iam.serviceAccountTokenCreator role on the development project just for the dop-dbt-user service account in order to enable Service Account Impersonation.

Instructions for Setting things up

Run Airflow with DOP in Docker - Mac

See README in the service project setup and follow instructions.

Once it's setup, you should see example DOP DAGs such as dop__example_covid19 Airflow in Docker

Run Airflow with DOP in Docker - Windows

This is currently working in progress, however the instructions on what needs to be done is in the Makefile

Run on Composer

Prerequisites

  1. Create a dedicate service account for Composer and call it dop-composer-user with following roles at project level
    • roles/bigquery.dataEditor
    • roles/bigquery.jobUser
    • roles/composer.worker
    • roles/compute.viewer
  2. Create a dedicated service account for DBT with limited permissions.
    1. [Already done in here if it’s DEV] Call it dop-dbt-user@<GCP project id> and create in https://console.cloud.google.com/iam-admin/serviceaccounts?project=<your GCP project id>
    2. [Already done in here if it’s DEV] Assign the roles/bigquery.dataEditor and roles/bigquery.jobUser role to the service account at project level under https://console.cloud.google.com/iam-admin/iam?project=<your GCP project id>
    3. The dop-composer-user will need to be given the roles/iam.serviceAccountUser and the roles/iam.serviceAccountTokenCreator role just for the dop-dbt-user service account in order to enable Service Account Impersonation.

Create Composer Cluster

  1. Use the service account already created dop-composer-user instead of the default service account
  2. Use the following environment variables
    DOP_PROJECT_ID : {REPLACE WITH THE GCP PROJECT ID WHERE DOP WILL PERSIST ALL DATA TO}
    DOP_LOCATION : {REPLACE WITH GCP REGION LOCATION WHRE DOP WILL PERSIST ALL DATA TO}
    DOP_SERVICE_PROJECT_PATH := {REPLACE WITH THE ABSOLUTE PATH OF THE Service Project, i.e. /home/airflow/gcs/dags/dop_{service project name}
    DOP_INFRA_PROJECT_ID := {REPLACE WITH THE GCP INFRASTRUCTURE PROJECT ID WHERE BUILD ARTIFACTS ARE STORED, i.e. a DBT docker image stored in GCR}
    
    and optionally
    DOP_GCR_PULL_SECRET_NAME:= {This maybe needed if the project storing the gcr images are not he same as where Cloud Composer runs, however this might be a better alternative https://medium.com/google-cloud/using-single-docker-repository-with-multiple-gke-projects-1672689f780c}
    
  3. Add the following Python Packages
    dataclasses==0.7
    
  4. Finally create a new node pool with the following k8 label
    key: cloud.google.com/gke-nodepool
    value: kubernetes-task-pool
    

Deployment

See Service Project README

Misc

Service Account Impersonation

Impersonation is a GCP feature allows a user / service account to impersonate as another service account.
This is a very useful feature and offers the following benefits

  • When doing development locally, especially with automation involved (i.e using Docker), it is very risky to interact with GCP services by using your user account directly because it may have a lot of permissions. By impersonate as another service account with less permissions, it is a lot safer (least privilege)
  • There is no credential needs to be downloaded, all permissions are linked to the user account. If an employee leaves the company, access to GCP will be revoked immediately because the impersonation process is no longer possible

The following diagram explains how we use Impersonation in DOP when it runs in Docker DOP Docker Account Impersonation

And when running DBT jobs on production, we are also using this technique to use the composer service account to impersonate as the dop-dbt-user service account so that service account keys are not required.

There are two very google articles explaining how impersonation works and why using it

You might also like...
Cross-platform config and manager for click console utilities.

climan Help the project financially: Donate: https://smartlegion.github.io/donate/ Yandex Money: https://yoomoney.ru/to/4100115206129186 PayPal: https

YourCity is a platform to match people to their prefect city.
YourCity is a platform to match people to their prefect city.

YourCity YourCity is a city matching App that matches users to their ideal city. It is a fullstack React App made with a Redux state manager and a bac

A multi-platform fuzzer for poking at userland binaries and servers

litefuzz A multi-platform fuzzer for poking at userland binaries and servers litefuzz intro why how it works what it does what it doesn't do support p

A platform for developers 👩‍💻  who wants to share their programs and projects.
A platform for developers 👩‍💻 who wants to share their programs and projects.

Fest-Practice-2021 This project is excluded from Hacktoberfest 2021. Please use this as a testing repo/project. A platform for developers 👩‍💻 who wa

Speed up your typing by some exercises in the multi-platform(Windows/Ubuntu).

Introduction This project purpose is speed up your typing by some exercises in the multi-platform(Windows/Ubuntu). Build Environment Software Environm

An Airdrop alternative for cross-platform users only for desktop with Python

PyDrop An Airdrop alternative for cross-platform users only for desktop with Python, -version 1.0 with less effort, just as a practice. ##############

Platform Tree for Xiaomi Redmi Note 7/7S (lavender)
Platform Tree for Xiaomi Redmi Note 7/7S (lavender)

The Xiaomi Redmi Note 7 (codenamed "lavender") is a mid-range smartphone from Xiaomi announced in January 2019. Device specifications Device Xiaomi Re

A Classroom Engagement Platform

Project Introduction This is project introduction Setup Setting up Postgres This is the most tricky part when setting up the application. You will nee

Traffic flow test platform, especially for reinforcement learning
Traffic flow test platform, especially for reinforcement learning

Traffic Flow Test Platform Traffic flow test platform, especially for reinforcement learning, named TFTP. A traffic signal control framework that can

Comments
  • Release DOP v0.3.0

    Release DOP v0.3.0

    A number of new features where added in this version

    DOP v0.3.0 — 2021-08-11

    Features

    • Support for "generic" airflow operators: you can now use regular python operators as part of your config files.

    • Support for “dbt docs” command to generate documentation for all dbt tasks: Users can now add “docs generate” as a target in their DOP configuration and additionally specify a GCS bucket with the --bucket and --bucket-path options where documents are copied to.

    • Serve dbt docs: Documents generated by dbt can be served as a web page by deploying the provided app on GAE. Note that deploying is an additional step that needs to be carried out after docs have been generated. See infrastructure/dbt-docs/README.md for details.

    • dbt tasks artifacts run_results created by dbt tasks saved to BigQuery: This json file contains information on completed dbt invocations and is saved in the BQ table “run_results” for analysis and debugging.

    • Add support for Airflow v1.10.14 and v1.10.15 local environments: Users can specify which version they want to use by setting the AIRFLOW_VERSION environment variable.

    • Pre-commit linters: added pre-commit hooks to ensure python, yaml and some support for plain text file consistency in formatting and style throughout DOP codebase.

    Changes

    • Ensure DAGs using the same DBT project do not run concurrently: Safety feature to safely allow selective execution of workflows by calling specific commands or tags (e.g. dbt run --m) within a single dbt project. This avoids creating inter-dependant workflows to avoid overriding each other's artifacts, since they will share the same target location (within the dbt container).

    • Test time-partitioning: Time-partitioning of datetime type properly validated as part of schema validation.

    • Use Python 3.7 and dbt 0.19.1 in Composer K8s Operator

    • Add Dataflow example task: with the introduction of "regular" in the yaml config Airflow Operators, it is now possible to run compute intensive Dataflow jobs. Check example_dataflow_template for an example on how to implement a Dataflow pipeline.

    opened by dinigo 0
Releases(v0.3.0)
  • v0.3.0(Aug 17, 2021)

    Features

    • Support for "generic" airflow operators: you can now use regular python operators as part of your config files.

    • Support for “dbt docs” command to generate documentation for all dbt tasks: Users can now add “docs generate” as a target in their DOP configuration and additionally specify a GCS bucket with the --bucket and --bucket-path options where documents are copied to.

    • Serve dbt docs: Documents generated by dbt can be served as a web page by deploying the provided app on GAE. Note that deploying is an additional step that needs to be carried out after docs have been generated. See infrastructure/dbt-docs/README.md for details.

    • dbt tasks artifacts run_results created by dbt tasks saved to BigQuery: This json file contains information on completed dbt invocations and is saved in the BQ table “run_results” for analysis and debugging.

    • Add support for Airflow v1.10.14 and v1.10.15 local environments: Users can specify which version they want to use by setting the AIRFLOW_VERSION environment variable.

    • Pre-commit linters: added pre-commit hooks to ensure python, yaml and some support for plain text file consistency in formatting and style throughout DOP codebase.

    Changes

    • Ensure DAGs using the same DBT project do not run concurrently: Safety feature to safely allow selective execution of workflows by calling specific commands or tags (e.g. dbt run --m) within a single dbt project. This avoids creating inter-dependant workflows to avoid overriding each other's artifacts, since they will share the same target location (within the dbt container).

    • Test time-partitioning: Time-partitioning of datetime type properly validated as part of schema validation.

    • Use Python 3.7 and dbt 0.19.1 in Composer K8s Operator

    • Add Dataflow example task: with the introduction of "regular" in the yaml config Airflow Operators, it is now possible to run compute intensive Dataflow jobs. Check example_dataflow_template for an example on how to implement a Dataflow pipeline.

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Mar 30, 2021)

Owner
Datatonic
We accelerate business impact through Machine Learning and Analytics
Datatonic
Push a record and you will receive a email when that date

Push a record and you will receive a email when that date

5 Nov 28, 2022
Nag0mi ctf problem 2021 writeup

Nag0mi ctf problem 2021 writeup

3 Apr 04, 2022
A python script for combining multiple native SU2 format meshes into one mesh file for multi-zone simulations.

A python script for combining multiple native SU2 format meshes into one mesh file for multi-zone simulations.

MKursatUzuner 1 Jan 20, 2022
This Python3 script will monitor Upwork RSS feed and then email you the results.

Upwork RSS Parser This Python3 script will monitor Upwork RSS feed and then email you the results. Table of Contents General Info Technologies Used Fe

Chris 5 Nov 29, 2021
Integer sets where all subsets have unique sums

Evil Sums Generation of sets of numbers where all constituents are recoverable from a partial sum.

Charlotte 5 Sep 24, 2022
Turn crypto miner on/off depending on powerwall charge level

Mining Crypto with Tesla Solar and Powerwalls This script turns a crypto miner on and off when the Tesla Powerwall level drops/rises above a certain t

Matt 1 Nov 09, 2021
A10 cipher - A Hill 2x2 cipher that totally gone wrong

A10_cipher This is a Hill 2x2 cipher that totally gone wrong, it encrypts with H

Caner Çetin 15 Oct 19, 2022
A clock purely made with python(turtle)...

Clock A clock purely made with python(turtle)... Requirements Pythone3 IDE or any other IDE Installation Clone this repository Running Open this proje

Abhyush 1 Jan 11, 2022
Tutorials for on-ramping to StarkNet

Full-Stack StarkNet Repo containing the code for a short tutorial series I wrote while diving into StarkNet and learning Cairo. Aims to onramp existin

Sam Barnes 71 Dec 07, 2022
Build your own Etherscan with web3.py

Build your own Etherscan with web3.py Video Tutorial: Run it pip3 install -r requirements.txt export FLASK_APP=app export FLASK_ENV=development flask

35 Jan 02, 2023
Practice10 - Operasi String With Python

Operasi String MY SOSIAL MEDIA : Apa itu Python String ? String adalah urutan si

Maulana Reza Badrudin 1 Jan 05, 2022
A script to add issues to a project in Github based on label or status.

Add Github Issues to Project (Beta) A python script to move Github issues to a next-gen (beta) Github Project Getting Started These instructions will

Kate Donaldson 3 Jan 16, 2022
Enjoyable scripting experience with Python

Enjoyable scripting experience with Python

8 Jun 08, 2022
This repository requires you to solve a problem by writing some basic python code.

Can You Solve a Problem? A beginner friendly repository that requires you to solve familiar problems with python. This could be as simple as implement

Precious Kolawole 11 Nov 30, 2022
Processamento da Informação - Disciplina UFABC

Processamento da Informacao Disciplina UFABC, Linguagem de Programação Python - 2021.2 Objetivos Apresentar os fundamentos sobre manipulação e tratame

Melissa Junqueira de Barros Lins 1 Jun 12, 2022
Intelligent Employer Profiling Platform.

Intelligent Employer Profiling Platform Setup Instructions Generating Model Data Ensure that Python 3.9+ and pip is installed. Install project depende

Harvey Donnelly 2 Jan 09, 2022
Explore related sequences in the OEIS

OEIS explorer This is a tool for exploring two different kinds of relationships between sequences in the OEIS: mentions (links) of other sequences on

Alex Hall 6 Mar 15, 2022
Arcpy Tool developed for ArcMap 10.x that checks DVOF points against TDS data and creates an output feature class as well as a check database.

DVOF_check_tool Arcpy Tool developed for ArcMap 10.x that checks DVOF points against TDS data and creates an output feature class as well as a check d

3 Apr 18, 2022
Py-Parser est un parser de code python en python encore en plien dévlopement.

PY - PARSER Py-Parser est un parser de code python en python encore en plien dévlopement. Une fois achevé, il servira a de nombreux projets comme glad

pf4 3 Feb 21, 2022
Simple Python API for the Ergo Platform Explorer

Ergo is a "Resilient Platform for Contractual Money." It is designed to be a platform for applications with the main focus to provide an efficient, se

7 Jul 06, 2021