DimReductionClustering - Dimensionality Reduction + Clustering + Unsupervised Score Metrics

Overview

Dimensionality Reduction + Clustering + Unsupervised Score Metrics

  1. Introduction
  2. Installation
  3. Usage
  4. Hyperparameters matters
  5. BayesSearch example

1. Introduction

DimReductionClustering is a sklearn estimator allowing to reduce the dimension of your data and then to apply an unsupervised clustering algorithm. The quality of the cluster can be done according to different metrics. The steps of the pipeline are the following:

  • Perform a dimension reduction of the data using UMAP
  • Numerically find the best epsilon parameter for DBSCAN
  • Perform a density based clustering methods : DBSCAN
  • Estimate cluster quality using silhouette score or DBCV

2. Installation

Use the package manager pip to install DimReductionClustering like below. Rerun this command to check for and install updates .

!pip install umap-learn
!pip install git+https://github.com/christopherjenness/DBCV.git

!pip install git+https://github.com/MathieuCayssol/DimReductionClustering.git

3. Usage

Example on mnist data.

  • Import the data
from sklearn.model_selection import train_test_split
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1]*x_train.shape[1]))
X, X_test, Y, Y_test = train_test_split(x_train, y_train, stratify=y_train, test_size=0.9)
  • Instanciation + fit the model (same interface as a sklearn estimators)
model = DimReductionClustering(n_components=2, min_dist=0.000001, score_metric='silhouette', knn_topk=8, min_pts=4).fit(X)

Return the epsilon using elbow method :

  • Show the 2D plot :
model.display_plotly()

  • Get the score (Silhouette coefficient here)
model.score()

4. Hyperparameters matters

4.1 UMAP (dim reduction)

  • n_neighbors (global/local tradeoff) (default:15 ; 2-1/4 of data)

    → low value (glue small chain, more local)

    → high value (glue big chain, more global)

  • min_dist (0 to 0.99) the minimum distance apart that points are allowed to be in the low dimensional representation. This means that low values of min_dist will result in clumpier embeddings. This can be useful if you are interested in clustering, or in finer topological structure. Larger values of min_dist will prevent UMAP from packing points together and will focus on the preservation of the broad topological structure instead.

  • n_components low dimensional space. 2 or 3

  • metric (’euclidian’ by default). For NLP, good idea to choose ‘cosine’ as infrequent/frequent words will have different magnitude.

4.2 DBSCAN (clustering)

  • min_pts MinPts ≥ 3. Basic rule : = 2 * Dimension (4 for 2D and 6 for 3D). Higher for noisy data.

  • Epsilon The maximum distance between two samples for one to be considered as in the neighborhood of the other. k-distance graph with k nearest neighbor. Sort result by descending order. Find elbow using orthogonal projection on a line between first and last point of the graph. y-coordinate of max(d((x,y),Proj(x,y))) is the optimal epsilon. Click here to know more about elbow method

! There is no Epsilon hyperparameters in the implementation, only k-th neighbor for KNN.

  • knn_topk k-th Nearest Neighbors. Between 3 and 20.

4.3 Score metric

5. BayesSearch example

!pip install scikit-optimize

from skopt.space import Integer
from skopt.space import Real
from skopt.space import Categorical
from skopt.utils import use_named_args
from skopt import BayesSearchCV

search_space = list()
#UMAP Hyperparameters
search_space.append(Integer(5, 200, name='n_neighbors', prior='uniform'))
search_space.append(Real(0.0000001, 0.2, name='min_dist', prior='uniform'))
#Search epsilon with KNN Hyperparameters
search_space.append(Integer(3, 20, name='knn_topk', prior='uniform'))
#DBSCAN Hyperparameters
search_space.append(Integer(4, 15, name='min_pts', prior='uniform'))


params = {search_space[i].name : search_space[i] for i in range((len(search_space)))}

train_indices = [i for i in range(X.shape[0])]  # indices for training
test_indices = [i for i in range(X.shape[0])]  # indices for testing

cv = [(train_indices, test_indices)]

clf = BayesSearchCV(estimator=DimReductionClustering(), search_spaces=params, n_jobs=-1, cv=cv)

clf.fit(X)

clf.best_params_

clf.best_score_
Official Pytorch implementation for video neural representation (NeRV)

NeRV: Neural Representations for Videos (NeurIPS 2021) Project Page | Paper | UVG Data Hao Chen, Bo He, Hanyu Wang, Yixuan Ren, Ser-Nam Lim, Abhinav S

hao 214 Dec 28, 2022
The code for our paper "AutoSF: Searching Scoring Functions for Knowledge Graph Embedding"

AutoSF The code for our paper "AutoSF: Searching Scoring Functions for Knowledge Graph Embedding" and this paper has been accepted by ICDE2020. News:

AutoML Research 64 Dec 17, 2022
Official Implementation of Swapping Autoencoder for Deep Image Manipulation (NeurIPS 2020)

Swapping Autoencoder for Deep Image Manipulation Taesung Park, Jun-Yan Zhu, Oliver Wang, Jingwan Lu, Eli Shechtman, Alexei A. Efros, Richard Zhang UC

449 Dec 27, 2022
Generic Foreground Segmentation in Images

Pixel Objectness The following repository contains pretrained model for pixel objectness. Please visit our project page for the paper and visual resul

Suyog Jain 157 Nov 21, 2022
A novel Engagement Detection with Multi-Task Training (ED-MTT) system

A novel Engagement Detection with Multi-Task Training (ED-MTT) system which minimizes MSE and triplet loss together to determine the engagement level of students in an e-learning environment.

Onur Çopur 12 Nov 11, 2022
A set of tools to pre-calibrate and calibrate (multi-focus) plenoptic cameras (e.g., a Raytrix R12) based on the libpleno.

COMPOTE: Calibration Of Multi-focus PlenOpTic camEra. COMPOTE is a set of tools to pre-calibrate and calibrate (multifocus) plenoptic cameras (e.g., a

ComSEE - Computers that SEE 4 May 10, 2022
Llvlir - Low Level Variable Length Intermediate Representation

Low Level Variable Length Intermediate Representation Low Level Variable Length

Michael Clark 2 Jan 24, 2022
DeepFaceLab fork which provides IPython Notebook to use DFL with Google Colab

DFL-Colab — DeepFaceLab fork for Google Colab This project provides you IPython Notebook to use DeepFaceLab with Google Colaboratory. You can create y

779 Jan 05, 2023
Pytorch Implementation of "Diagonal Attention and Style-based GAN for Content-Style disentanglement in image generation and translation" (ICCV 2021)

DiagonalGAN Official Pytorch Implementation of "Diagonal Attention and Style-based GAN for Content-Style Disentanglement in Image Generation and Trans

32 Dec 06, 2022
Implementation of Gans

GAN Generative Adverserial Networks are an approach to generative data modelling using Deep learning methods. I have currently implemented : DCGAN on

Sibam Parida 5 Sep 07, 2021
Embracing Single Stride 3D Object Detector with Sparse Transformer

SST: Single-stride Sparse Transformer This is the official implementation of paper: Embracing Single Stride 3D Object Detector with Sparse Transformer

TuSimple 385 Dec 28, 2022
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Dec 30, 2022
This repo holds codes of the ICCV21 paper: Visual Alignment Constraint for Continuous Sign Language Recognition.

VAC_CSLR This repo holds codes of the paper: Visual Alignment Constraint for Continuous Sign Language Recognition.(ICCV 2021) [paper] Prerequisites Th

Yuecong Min 64 Dec 19, 2022
Voice Gender Recognition

In this project it was used some different Machine Learning models to identify the gender of a voice (Female or Male) based on some specific speech and voice attributes.

Anne Livia 1 Jan 27, 2022
python 93% acc. CNN Dogs Vs Cats ( Pytorch )

English | 简体中文(测试中...敬请期待) Cnn-Classification-Dog-Vs-Cat 猫狗辨别 (pytorch版本) CNN Resnet18 的猫狗分类器,基于ResNet及其变体网路系列,对于一般的图像识别任务表现优异,模型精准度高达93%(小型样本)。 项目制作于

apple ye 1 May 22, 2022
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

This is a release of our VIMPAC paper to illustrate the implementations. The pretrained checkpoints and scripts will be soon open-sourced in HuggingFace transformers.

Hao Tan 74 Dec 03, 2022
The Official Implementation of the ICCV-2021 Paper: Semantically Coherent Out-of-Distribution Detection.

SCOOD-UDG (ICCV 2021) This repository is the official implementation of the paper: Semantically Coherent Out-of-Distribution Detection Jingkang Yang,

Jake YANG 62 Nov 21, 2022
Blind Image Super-resolution with Elaborate Degradation Modeling on Noise and Kernel

Blind Image Super-resolution with Elaborate Degradation Modeling on Noise and Kernel This repository is the official PyTorch implementation of BSRDM w

Zongsheng Yue 69 Jan 05, 2023