PyEmits, a python package for easy manipulation in time-series data.

Last update: Sep 23, 2022

Related tags

Data Analysis PyEmits

Overview

PyEmits, a python package for easy manipulation in time-series data. Time-series data is very common in real life.

Engineering
FSI industry (Financial Services Industry)
FMCG (Fast Moving Consumer Good)

Data scientist's work consists of:

forecasting
prediction/simulation
data prepration
cleansing
anomaly detection
descriptive data analysis/exploratory data analysis

each new business unit shall build the following wheels again and again

data pipeline
1. extraction
2. transformation
  1. cleansing
  2. feature engineering
  3. remove outliers
  4. AI landing for prediction, forecasting
3. write it back to database
ml framework
1. multiple model training
2. multiple model prediction
3. kfold validation
4. anomaly detection
5. forecasting
6. deep learning model in easy way
7. ensemble modelling
exploratory data analysis
1. descriptive data analysis
2. ...

That's why I create this project, also for fun. haha

This project is under active development, free to use (Apache 2.0) I am happy to see anyone can contribute for more advancement on features

Install

pip install pyemits

Features highlight

Easy training

import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel

X = np.random.randint(1, 100, size=(1000, 10))
y = np.random.randint(1, 100, size=(1000, 1))

raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer(['XGBoost'], [None], raw_data_model)
trainer.fit()

Accept neural network as model

import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel
from pyemits.core.ml.regression.nn import KerasWrapper

X = np.random.randint(1, 100, size=(1000, 10, 10))
y = np.random.randint(1, 100, size=(1000, 4))

keras_lstm_model = KerasWrapper.from_simple_lstm_model((10, 10), 4)
raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer([keras_lstm_model], [None], raw_data_model)
trainer.fit()

also keep flexibility on customized model

import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel
from pyemits.core.ml.regression.nn import KerasWrapper

X = np.random.randint(1, 100, size=(1000, 10, 10))
y = np.random.randint(1, 100, size=(1000, 4))

from keras.layers import Dense, Dropout, LSTM
from keras import Sequential

model = Sequential()
model.add(LSTM(128,
               activation='softmax',
               input_shape=(10, 10),
               ))
model.add(Dropout(0.1))
model.add(Dense(4))
model.compile(loss='mse', optimizer='adam', metrics=['mse'])

keras_lstm_model = KerasWrapper(model, nickname='LSTM')
raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer([keras_lstm_model], [None], raw_data_model)
trainer.fit()

or attach it in algo config

import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel
from pyemits.core.ml.regression.nn import KerasWrapper
from pyemits.common.config_model import KerasSequentialConfig

X = np.random.randint(1, 100, size=(1000, 10, 10))
y = np.random.randint(1, 100, size=(1000, 4))

from keras.layers import Dense, Dropout, LSTM
from keras import Sequential

keras_lstm_model = KerasWrapper(nickname='LSTM')
config = KerasSequentialConfig(layer=[LSTM(128,
                                           activation='softmax',
                                           input_shape=(10, 10),
                                           ),
                                      Dropout(0.1),
                                      Dense(4)],
                               compile=dict(loss='mse', optimizer='adam', metrics=['mse']))

raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer([keras_lstm_model],
                     [config],
                     raw_data_model, 
                     {'fit_config' : [dict(epochs=10, batch_size=32)]})
trainer.fit()

PyTorch, MXNet under development you can leave me a message if you want to contribute

MultiOutput training

import numpy as np 

from pyemits.core.ml.regression.trainer import RegressionDataModel, MultiOutputRegTrainer
from pyemits.core.preprocessing.splitting import SlidingWindowSplitter

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

# when use auto-regressive like MultiOutput, pls set ravel = True
# ravel = False, when you are using LSTM which support multiple dimension
splitter = SlidingWindowSplitter(24,24,ravel=True)
X, y = splitter.split(X, y)

raw_data_model = RegressionDataModel(X,y)
trainer = MultiOutputRegTrainer(['XGBoost'], [None], raw_data_model)
trainer.fit()

Parallel training
- provide fast training using parallel job
- use RegTrainer as base, but add Parallel running

import numpy as np 

from pyemits.core.ml.regression.trainer import RegressionDataModel, ParallelRegTrainer

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X,y)
trainer = ParallelRegTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model)
trainer.fit()

or you can use RegTrainer for multiple model, but it is not in Parallel job

import numpy as np 

from pyemits.core.ml.regression.trainer import RegressionDataModel,  RegTrainer

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X,y)
trainer = RegTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model)
trainer.fit()

KFold training
- KFoldConfig is global config, will apply to all

import numpy as np 

from pyemits.core.ml.regression.trainer import RegressionDataModel,  KFoldCVTrainer
from pyemits.common.config_model import KFoldConfig

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X,y)
trainer = KFoldCVTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model, {'kfold_config':KFoldConfig(n_splits=10)})
trainer.fit()

Easy prediction

import numpy as np 
from pyemits.core.ml.regression.trainer import RegressionDataModel,  RegTrainer
from pyemits.core.ml.regression.predictor import RegPredictor

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X,y)
trainer = RegTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model)
trainer.fit()

predictor = RegPredictor(trainer.clf_models, 'RegTrainer')
predictor.predict(RegressionDataModel(X))

Forecast at scale
- see examples: forecast at scale.ipynb
Data Model

from pyemits.common.data_model import RegressionDataModel
import numpy as np
X = np.random.randint(1, 100, size=(1000,10,10))
y = np.random.randint(1, 100, size=(1000, 1))

data_model = RegressionDataModel(X, y)

data_model._update_variable('X_shape', (1000,10,10))
data_model.X_shape

data_model.add_meta_data('X_shape', (1000,10,10))
data_model.meta_data

Anomaly detection (under development)
- see module: anomaly detection
- Kalman filter
Evaluation (under development)
- see module: evaluation
- backtesting
- model evaluation
Ensemble (under development)
- blending
- stacking
- voting
- by combo package
  - moa
  - aom
  - average
  - median
  - maximization
IO
- db connection
- local
dashboard ???
other miscellaneous feature
- continuous evaluation
- aggregation
- dimensional reduction
- data profile (intensive data overview)
to be confirmed

References

the following libraries gave me some idea/insight

greykit
1. changepoint detection
2. model summary
3. seaonality
pytorch-forecasting
darts
pyaf
orbit
kats/prophets by facebook
sktime
gluon ts
tslearn
pyts
luminaries
tods
autots
pyodds
scikit-hts

Python package to transfer data in a fast, reliable, and packetized form.

pySerialTransfer Python package to transfer data in a fast, reliable, and packetized form.

101 Dec 7, 2022

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

3.7k Jan 3, 2023

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 9, 2023

A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

The leading use-case for the staircase package is for the creation and analysis of step functions. Pretty exciting huh. But don't hit the close button

48 Dec 21, 2022

small package with utility functions for analyzing (fly) calcium imaging data

fly2p Tools for analyzing two-photon (2p) imaging data collected with Vidrio Scanimage software and micromanger. Loading scanimage data relies on scan

3 Dec 14, 2022

Integrate bus data from a variety of sources (batch processing and real time processing).

Purpose: This is integrate bus data from a variety of sources such as: csv, json api, sensor data ... into Relational Database (batch processing and r

1 Nov 25, 2021

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Realtime Financial Market Data Visualization and Analysis Introduction This repo shows my project about real-time stock data pipeline. All the code is

6 Sep 7, 2022

Fast, flexible and easy to use probabilistic modelling in Python.

Please consider citing the JMLR-MLOSS Manuscript if you've used pomegranate in your academic work! pomegranate is a package for building probabilistic

3k Jan 2, 2023

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

3.3k Jan 4, 2023

PyEmits, a python package for easy manipulation in time-series data.

Related tags

Overview

Install

Features highlight

References

You might also like...

Python package to transfer data in a fast, reliable, and packetized form.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

small package with utility functions for analyzing (fly) calcium imaging data

Integrate bus data from a variety of sources (batch processing and real time processing).

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Fast, flexible and easy to use probabilistic modelling in Python.

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Releases(v0.1.2)

v0.1.2(Dec 26, 2021)

Owner

Thompson

LynxKite: a complete graph data science platform for very large graphs and other datasets.

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

ped-crash-techvol: Texas Ped Crash Tech Volume Pack

The official pytorch implementation of ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance companies

A Python Tools to imaging the shallow seismic structure

An experimental project I'm undertaking for the sole purpose of increasing my Python knowledge

follow-analyzer helps GitHub users analyze their following and followers relationship

A 2-dimensional physics engine written in Cairo

Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Churn prediction with PySpark

Python Kalman filtering and optimal estimation library. Implements Kalman filter, particle filter, Extended Kalman filter, Unscented Kalman filter, g-h (alpha-beta), least squares, H Infinity, smoothers, and more. Has companion book 'Kalman and Bayesian Filters in Python'.

Instant search for and access to many datasets in Pyspark.

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

ToeholdTools is a Python package and desktop app designed to facilitate analyzing and designing toehold switches, created as part of the 2021 iGEM competition.

Random dataframe and database table generator

MotorcycleParts DataAnalysis python

Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Airflow ETL With EKS EFS Sagemaker