Integrate bus data from a variety of sources (batch processing and real time processing).

Last update: Nov 25, 2021

Related tags

Data Analysis bus_data_ingestion_pipeline

Overview

Purpose: This is integrate bus data from a variety of sources such as: csv, json api, sensor data ... into Relational Database (batch processing and real time processing)

Technique:

Python
Application: Kafka, MQTT Explorer, Grafana, Influxdb, MS VS Studio 2019, MS SQL Server, PowerBI Desktop
Framework: kafka-python, numpy, paho-mqtt, pandas, pyodbc, pyspark
Database: sql -- install MS SQL Server
Evironment: window 10 64bit
Editor: cmd

Workflow:

Import raw data offline from csv, txt file source into DataLake (stored in MS SQL Server) with python. Then ETL (Extract Transform Load) data from DataLake into Data Warehouse with SSIS (SQL Server Integration Services).
Setup schedule for pipeline ETL.
Modeling and Visualization from DWH.
Crawl the online General Transport Feed Spec (GTFS) file into JSON file. Convert from Protobuf to JSON file or CSV then save it to my database with python and kafka streaming. Source: https://developer.nationaltransport.ie/
Streaming and draw the data into the dashboard to show the performance by sensor data with paho-mqtt (or kafka-python) and BI tool Grafana.

Output:

Data pipeline from data sources into target data.
Data stored in Data warehouse for analysis.
Raw data from Crawl the online General Transport Feed Spec.
Real-time dashboard with streaming processing.

Next Step:

Analysis data in DWH
Build Real-time dashboard for raw data from Crawl the online General Transport Feed Spec.

Owner

GitHub Repository

Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).

PandasVault ⁠— Advanced Pandas Functions and Code Snippets The only Pandas utility package you would ever need. It has no exotic external dependencies

374 Jan 07, 2023

Package for decomposing EMG signals into motor unit firings, as used in Formento et al 2021.

EMGDecomp Package for decomposing EMG signals into motor unit firings, created for Formento et al 2021. Based heavily on Negro et al, 2016. Supports G

13 Nov 01, 2022

A Python package for Bayesian forecasting with object-oriented design and probabilistic models under the hood.

Disclaimer This project is stable and being incubated for long-term support. It may contain new experimental code, for which APIs are subject to chang

1.6k Dec 29, 2022

Find exposed data in Azure with this public blob scanner

BlobHunter A tool for scanning Azure blob storage accounts for publicly opened blobs. BlobHunter is a part of "Hunting Azure Blobs Exposes Millions of

250 Jan 03, 2023

Python Practicum - prepare for your Data Science interview or get a refresher.

Python-Practicum Python Practicum - prepare for your Data Science interview or get a refresher. Data Data visualization using data on births from the

1 Jul 27, 2021

An Integrated Experimental Platform for time series data anomaly detection.

Curve Sorry to tell contributors and users. We decided to archive the project temporarily due to the employee work plan of collaborators. There are no

486 Dec 21, 2022

Analysiscsv.py for extracting analysis and exporting as CSV

wcc_analysis Lichess page documentation: https://lichess.org/page/world-championships Each WCC has a study, studies are fetched using: https://lichess

32 Apr 25, 2022

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors. GWpy provides a user-f

342 Jan 07, 2023

ICLR 2022 Paper submission trend analysis

Visualize ICLR 2022 OpenReview Data

75 Dec 06, 2022

cLoops2: full stack analysis tool for chromatin interactions

cLoops2: full stack analysis tool for chromatin interactions Introduction cLoops2 is an extension of our previous work, cLoops. From loop-calling base

25 Dec 14, 2022

Intake is a lightweight package for finding, investigating, loading and disseminating data.

Intake: A general interface for loading data Intake is a lightweight set of tools for loading and sharing data in data science projects. Intake helps

851 Jan 01, 2023

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

2.9k Jan 06, 2023

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

1 Jan 19, 2022

Vectorizers for a range of different data types

Vectorizers for a range of different data types

69 Dec 29, 2022

Projects that implement various aspects of Data Engineering.

DATAWAREHOUSE ON AWS The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming applicatio

2 Oct 14, 2021

Multiple Pairwise Comparisons (Post Hoc) Tests in Python

scikit-posthocs is a Python package that provides post hoc tests for pairwise multiple comparisons that are usually performed in statistical data anal

264 Dec 30, 2022

A Python and R autograding solution

Otter-Grader Otter Grader is a light-weight, modular open-source autograder developed by the Data Science Education Program at UC Berkeley. It is desi

93 Jan 03, 2023

This module is used to create Convolutional AutoEncoders for Variational Data Assimilation

VarDACAE This module is used to create Convolutional AutoEncoders for Variational Data Assimilation. A user can define, create and train an AE for Dat

23 Dec 16, 2022

A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

GBiStat package A python package to assist programmers with data analysis. This package could be used to plot : Binomial Distribution of the dataset p

4 Oct 17, 2022

Data exploration done quick.

Pandas Tab Implementation of Stata's tabulate command in Pandas for extremely easy to type one-way and two-way tabulations. Support: Python 3.7 and 3.

20 Aug 27, 2022