t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology.

Related tags

Data Analysistreesne
Overview

tree-SNE

t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology. Building on recent advances in speeding up t-SNE and obtaining finer-grained structure, we combine the two to create tree-SNE, a hierarchical clustering and visualization algorithm based on stacked one-dimensional t-SNE embeddings. We also introduce alpha-clustering, which recommends the optimal cluster assignment, without foreknowledge of the number of clusters, based off of the cluster stability across multiple scales. We demonstrate the effectiveness of tree-SNE and alpha-clustering on images of handwritten digits, mass cytometry (CyTOF) data from blood cells, and single-cell RNA-sequencing (scRNA-seq) data from retinal cells. Furthermore, to demonstrate the validity of the visualization, we use alpha-clustering to obtain unsupervised clustering results competitive with the state of the art on several image data sets.

ArXiv preprint: https://arxiv.org/abs/2002.05687

Prerequisites

Install Fit-SNE from https://github.com/KlugerLab/FIt-SNE and add the FIt-SNE directory that you cloned to your PYTHONPATH environmental variable. This lets tree-SNE access the Python file used to interface with FIt-SNE. This can be done one of several ways:

  • run export PYTHONPATH="$PYTHONPATH":/path/to/FIt-SNE in your terminal before running your Python script using tree-SNE
  • add export PYTHONPATH="$PYTHONPATH":/path/to/FIt-SNE to your .bash_profile
  • add the line import sys; sys.path.append('/path/to/FIt-SNE/') to your Python script before calling import tree_sne

Also make sure to have Numpy, Scipy, Sklearn, and Matplotlib installed.

We've tested with Python 3.6+.

Test/Example

Run example.py to make sure everything is set up right. This will run tree-SNE on the USPS handwritten digit dataset, run alpha-clustering, calculate the NMI, and display the tree. You can refer to this file for calling conventions. Note the top line adding FIt-SNE to the Python path.

Sample Usage

Assuming you have a 2D Numpy array containing your data in a variable X. To build a tree-SNE plot with 30 layers, cluster on each layer, and determine the optimal clustering via alpha-clustering (note does not require preknowledge of the number of clusters):

from tree_sne import TreeSNE

tree = TreeSNE()
embeddings, layer_clusters, best_clusters = tree.fit(X, n_layers = 30)

The embeddings variable will contain each data point's embedding in each layer, with embeddings.shape of (n_points, n_layers, n_features). For now, n_features will always be 1, as we haven't yet implemented stacked 2D t-SNE embeddings. The variable layer_clusters will contain cluster assignments for each point in each layer of the embedding, and best_clusters will contain optimal cluster assignments for the data.

To display the tree using our code with cluster labels, run:

from display_tree import display_tree_mnist
import numpy as np

display_tree_mnist(embeddings, true_labels = best_clusters, legend_labels = list(np.unique(best_clusters)), distinct = True)

Alternatively, some labels you provide can be used instead of best_clusters. We realize this is messy but until we refactor this is what we have. We're sorry. You don't have to use our display code if you don't want to, and we'll improve it soon.

If your data has more clusters, reduce the conservativeness parameter to TreeSNE. Typical values range from 1 to 2. It should never drop below 1 according to our theory motivation for its implementation, and we've only had to decrease it when trying to find 100 clusters, in which case we set it to 1.3. n_layers and conservativeness are the only two parameters that we think users may want to adjust, at least for the time being. Once we've refactored we'll write more documentation. Note that conservativeness only effects alpha-clustering and does not actually change the tree-SNE embedding itself.

MNIST tree-SNE example plot

Authors

Acknowledgments

The authors thank Stefan Steinerberger for inspiration, support, and advice; George Linderman for enabling one-dimensional t-SNE with degrees of freedom < 1 in the FIt-SNE package; Scott Gigante for data pre-processing and helpful discussions of visualizations and alpha-clustering; Smita Krishnaswamy for encouragement and feedback; and Ariel Jaffe for discussing the Nyström method and its relationship to subsampled spectral clustering.

Owner
Isaac Robinson
Yale computer science and math major interested in entrepreneurship
Isaac Robinson
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and lo

Coiled 102 Nov 10, 2022
A distributed block-based data storage and compute engine

Nebula is an extremely-fast end-to-end interactive big data analytics solution. Nebula is designed as a high-performance columnar data storage and tabular OLAP engine.

Columns AI 131 Dec 26, 2022
The Master's in Data Science Program run by the Faculty of Mathematics and Information Science

The Master's in Data Science Program run by the Faculty of Mathematics and Information Science is among the first European programs in Data Science and is fully focused on data engineering and data a

Amir Ali 2 Jun 17, 2022
apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.

Please consider citing the manuscript if you use apricot in your academic work! You can find more thorough documentation here. apricot implements subm

Jacob Schreiber 457 Dec 20, 2022
AWS Glue ETL Code Samples

AWS Glue ETL Code Samples This repository has samples that demonstrate various aspects of the new AWS Glue service, as well as various AWS Glue utilit

AWS Samples 1.2k Jan 03, 2023
A collection of learning outcomes data analysis using Python and SQL, from DQLab.

Data Analyst with PYTHON Data Analyst berperan dalam menghasilkan analisa data serta mempresentasikan insight untuk membantu proses pengambilan keputu

6 Oct 11, 2022
Python reader for Linked Data in HDF5 files

Linked Data are becoming more popular for user-created metadata in HDF5 files.

The HDF Group 8 May 17, 2022
cLoops2: full stack analysis tool for chromatin interactions

cLoops2: full stack analysis tool for chromatin interactions Introduction cLoops2 is an extension of our previous work, cLoops. From loop-calling base

YaqiangCao 25 Dec 14, 2022
Common bioinformatics database construction

biodb Common bioinformatics database construction 1.taxonomy (Substance classification database) Download the database wget -c https://ftp.ncbi.nlm.ni

sy520 2 Jan 04, 2022
The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

About The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficien

ROOT 2k Dec 29, 2022
A simple and efficient tool to parallelize Pandas operations on all available CPUs

Pandaral·lel Without parallelization With parallelization Installation $ pip install pandarallel [--upgrade] [--user] Requirements On Windows, Pandara

Manu NALEPA 2.8k Dec 31, 2022
Functional tensors for probabilistic programming

Funsor Funsor is a tensor-like library for functions and distributions. See Functional tensors for probabilistic programming for a system description.

208 Dec 29, 2022
Desafio 1 ~ Bantotal

Challenge 01 | Bantotal Please read the instructions for the challenge by selecting your preferred language below: Español Português License Copyright

Maratona Behind the Code 44 Sep 28, 2022
A set of tools to analyse the output from TraDIS analyses

QuaTradis (Quadram TraDis) A set of tools to analyse the output from TraDIS analyses Contents Introduction Installation Required dependencies Bioconda

Quadram Institute Bioscience 2 Feb 16, 2022
yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

The yt Project yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data. yt supports structured, varia

The yt project 367 Dec 25, 2022
In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

ETL Pipeline for AWS Project Description In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 t

Mobeen Ahmed 1 Nov 01, 2021
A CLI tool to reduce the friction between data scientists by reducing git conflicts removing notebook metadata and gracefully resolving git conflicts.

databooks is a package for reducing the friction data scientists while using Jupyter notebooks, by reducing the number of git conflicts between different notebooks and assisting in the resolution of

dataroots 86 Dec 25, 2022
Analysiscsv.py for extracting analysis and exporting as CSV

wcc_analysis Lichess page documentation: https://lichess.org/page/world-championships Each WCC has a study, studies are fetched using: https://lichess

32 Apr 25, 2022
Office365 (Microsoft365) audit log analysis tool

Office365 (Microsoft365) audit log analysis tool The header describes it all WHY?? The first line of code was written long time before other colleague

Anatoly 1 Jul 27, 2022
Pandas and Spark DataFrame comparison for humans

DataComPy DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS's PROC COMPARE for Pand

Capital One 259 Dec 24, 2022