Introduction

This repository is to show you how to integrate Zeppelin with Airflow. The philosophy behind the ingtegration is to make the transition from development stage to production stage as smooth as possible.
Zeppelin is good at data pipeline development (Spark, Flink, Hive, Python, Shell and etc), while Airflow is the de-facto standard of Job orchestration.

How to run it

Step 1. Initialize enviromenment.

Run this following commands to initialize environment.

Download spark which is used by Zeppelin
Download zeppelin airflow plugins

git clone https://github.com/zjffdu/zeppelin_airflow.git
cd zeppelin_airflow
./init.sh

Step 2 Start Zeppelin + Airflow via docker-compose

docker-compose -f docker-compose-LocalExecutor.yml up -d

Step 3. Use Zeppelin + Airflow

Open http://localhost:8085 for Zeppelin http://localhost:8083 for Airflow

There's one dag zeppelin_example in Airflow. This dag just run 3 Zeppelin notes:

Python Tutorial/01. IPython Basics
Spark Tutorial/02. Spark Basics Features
Spark Tutorial/03. Spark SQL (PySpark)

You can enable it, then Airflow would run these Zeppelin notes.

Actually Zeppelin would not run these notes directly, instead it would clone note and run the cloned note.

More features would come soon, stay tuned.

Show you how to integrate Zeppelin with Airflow

Related tags

Overview

Introduction

How to run it

Step 1. Initialize enviromenment.

Step 2 Start Zeppelin + Airflow via docker-compose

Step 3. Use Zeppelin + Airflow

More features would come soon, stay tuned.

Owner

Jeff Zhang

Hydrogen (or other pure gas phase species) depressurization calculations

PyChemia, Python Framework for Materials Discovery and Design

A computer algebra system written in pure Python

This tool parses log data and allows to define analysis pipelines for anomaly detection.

A data structure that extends pyspark.sql.DataFrame with metadata information.

Dbt-core - dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

DataPrep — The easiest way to prepare data in Python

A simple and efficient tool to parallelize Pandas operations on all available CPUs

Pipeline and Dataset helpers for complex algorithm evaluation.

The Dash Enterprise App Gallery "Oil & Gas Wells" example

Hg002-qc-snakemake - HG002 QC Snakemake

Toolchest provides APIs for scientific and bioinformatic data analysis.

Titanic data analysis for python

Template for a Dataflow Flex Template in Python

Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production

Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database

PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

A data analysis using python and pandas to showcase trends in school performance.

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

A tax calculator for stocks and dividends activities.