GeneDisco is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery.

Overview

GeneDisco: A benchmark for active learning in drug discovery

Python version Library version

In vitro cellular experimentation with genetic interventions, using for example CRISPR technologies, is an essential step in early-stage drug discovery and target validation that serves to assess initial hypotheses about causal associations between biological mechanisms and disease pathologies. With billions of potential hypotheses to test, the experimental design space for in vitro genetic experiments is extremely vast, and the available experimental capacity - even at the largest research institutions in the world - pales in relation to the size of this biological hypothesis space.

GeneDisco (published at ICLR-22) is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery. GeneDisco contains a curated set of multiple publicly available experimental data sets as well as open-source i mplementations of state-of-the-art active learning policies for experimental design and exploration.

GeneDisco ICLR-22 Challenge

Learn more about the GeneDisco challenge for experimental design for optimally exploring the vast genetic intervention space here.

Install

pip install genedisco

Use

How to Run the Full Benchmark Suite?

Experiments (all baselines, acquisition functions, input and target datasets, multiple seeds) included in GeneDisco can be executed sequentially for e.g. acquired batch size 64, 8 cycles and a bayesian_mlp model using:

run_experiments \
  --cache_directory=/path/to/genedisco_cache  \
  --output_directory=/path/to/genedisco_output  \
  --acquisition_batch_size=64  \
  --num_active_learning_cycles=8  \
  --max_num_jobs=1

Results are written to the folder at /path/to/genedisco_cache, and processed datasets will be cached at /path/to/genedisco_cache (please replace both with your desired paths) for faster startup in future invocations.

Note that due to the number of experiments being run by the above command, we recommend execution on a compute cluster.
The GeneDisco codebase also supports execution on slurm compute clusters (the slurm command must be available on the executing node) using the following command and using dependencies in a Python virtualenv available at /path/to/your/virtualenv (please replace with your own virtualenv path):

run_experiments \
  --cache_directory=/path/to/genedisco_cache  \
  --output_directory=/path/to/genedisco_output  \
  --acquisition_batch_size=64  \
  --num_active_learning_cycles=8  \
  --schedule_on_slurm \
  --schedule_children_on_slurm \
  --remote_execution_virtualenv_path=/path/to/your/virtualenv

Other scheduling systems are currently not supported by default.

How to Run A Single Isolated Experiment (One Learning Cycle)?

To run one active learning loop cycle, for example, with the "topuncertain" acquisition function, the "achilles" feature set and the "schmidt_2021_ifng" task, execute the following command:

active_learning_loop  \
    --cache_directory=/path/to/genedisco/genedisco_cache \
    --output_directory=/path/to/genedisco/genedisco_output \
    --model_name="bayesian_mlp" \
    --acquisition_function_name="topuncertain" \
    --acquisition_batch_size=64 \
    --num_active_learning_cycles=8 \
    --feature_set_name="achilles" \
    --dataset_name="schmidt_2021_ifng" 

How to Evaluate a Custom Acquisition Function?

To run a custom acquisition function, set --acquisition_function_name="custom" and --acquisition_function_path to the file path that contains your custom acquisition function.

active_learning_loop  \
    --cache_directory=/path/to/genedisco/genedisco_cache \
    --output_directory=/path/to/genedisco/genedisco_output \
    --model_name="bayesian_mlp" \
    --acquisition_function_name="custom" \
    --acquisition_function_path=/path/to/custom_acquisition_function.py \
    --acquisition_batch_size=64 \
    --num_active_learning_cycles=8 \
    --feature_set_name="achilles" \
    --dataset_name="schmidt_2021_ifng" 

...where "/path/to/custom_acquisition_function.py" contains code for your custom acquisition function corresponding to the BaseBatchAcquisitionFunction interface, e.g.:

import numpy as np
from typing import AnyStr, List
from slingpy import AbstractDataSource
from slingpy.models.abstract_base_model import AbstractBaseModel
from genedisco.active_learning_methods.acquisition_functions.base_acquisition_function import \
    BaseBatchAcquisitionFunction

class RandomBatchAcquisitionFunction(BaseBatchAcquisitionFunction):
    def __call__(self,
                 dataset_x: AbstractDataSource,
                 batch_size: int,
                 available_indices: List[AnyStr], 
                 last_selected_indices: List[AnyStr] = None, 
                 model: AbstractBaseModel = None,
                 temperature: float = 0.9,
                 ) -> List:
        selected = np.random.choice(available_indices, size=batch_size, replace=False)
        return selected

Note that the last class implementing BaseBatchAcquisitionFunction is loaded by GeneDisco if there are multiple valid acquisition functions present in the loaded file.

Citation

Please consider citing, if you reference or use our methodology, code or results in your work:

@inproceedings{mehrjou2022genedisco,
    title={{GeneDisco: A Benchmark for Experimental Design in Drug Discovery}},
    author={Mehrjou, Arash and Soleymani, Ashkan and Jesson, Andrew and Notin, Pascal and Gal, Yarin and Bauer, Stefan and Schwab, Patrick},
    booktitle={{International Conference on Learning Representations (ICLR)}},
    year={2022}
}

License

License

Authors

Patrick Schwab, GlaxoSmithKline plc
Arash Mehrjou, GlaxoSmithKline plc
Andrew Jesson, University of Oxford
Ashkan Soleymani, MIT

Acknowledgements

PS and AM are employees and shareholders of GlaxoSmithKline plc.

Manifold Alignment for Semantically Aligned Style Transfer

Manifold Alignment for Semantically Aligned Style Transfer [Paper] Getting Started MAST has been tested on CentOS 7.6 with python = 3.6. It supports

35 Nov 14, 2022
Distributing reference energies for SMIRNOFF implementations

Warning: This code is currently experimental and under active development. Is it not yet suitable for distribution or use as reference implementation.

Open Force Field Initiative 1 Dec 07, 2021
Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

NeX: Real-time View Synthesis with Neural Basis Expansion Project Page | Video | Paper | COLAB | Shiny Dataset We present NeX, a new approach to novel

538 Jan 09, 2023
Automatic voice-synthetised summaries of latest research papers on arXiv

PaperWhisperer PaperWhisperer is a Python application that keeps you up-to-date with research papers. How? It retrieves the latest articles from arXiv

Valerio Velardo 124 Dec 20, 2022
Removing Inter-Experimental Variability from Functional Data in Systems Neuroscience

Removing Inter-Experimental Variability from Functional Data in Systems Neuroscience This repository is the official implementation of [https://www.bi

Eulerlab 6 Oct 09, 2022
Label Mask for Multi-label Classification

LM-MLC 一种基于完型填空的多标签分类算法 1 前言 本文主要介绍本人在全球人工智能技术创新大赛【赛道一】设计的一种基于完型填空(模板)的多标签分类算法:LM-MLC,该算法拟合能力很强能感知标签关联性,在多个数据集上测试表明该算法与主流算法无显著性差异,在该比赛数据集上的dev效果很好,但是由

52 Nov 20, 2022
GLNet for Memory-Efficient Segmentation of Ultra-High Resolution Images

GLNet for Memory-Efficient Segmentation of Ultra-High Resolution Images Collaborative Global-Local Networks for Memory-Efficient Segmentation of Ultra-

VITA 298 Dec 12, 2022
pytorch implementation of "Contrastive Multiview Coding", "Momentum Contrast for Unsupervised Visual Representation Learning", and "Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination"

Unofficial implementation: MoCo: Momentum Contrast for Unsupervised Visual Representation Learning (Paper) InsDis: Unsupervised Feature Learning via N

Zhiqiang Shen 16 Nov 04, 2020
The Dual Memory is build from a simple CNN for the deep memory and Linear Regression fro the fast Memory

Simple-DMA a simple Dual Memory Architecture for classifications. based on the paper Dual-Memory Deep Learning Architectures for Lifelong Learning of

1 Jan 27, 2022
BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches

BLEND is a mechanism that can efficiently find fuzzy seed matches between sequences to significantly improve the performance and accuracy while reducing the memory space usage of two important applic

SAFARI Research Group at ETH Zurich and Carnegie Mellon University 19 Dec 26, 2022
Independent and minimal implementations of some reinforcement learning algorithms using PyTorch (including PPO, A3C, A2C, ...).

PyTorch RL Minimal Implementations There are implementations of some reinforcement learning algorithms, whose characteristics are as follow: Less pack

Gemini Light 4 Dec 31, 2022
PFENet: Prior Guided Feature Enrichment Network for Few-shot Segmentation (TPAMI).

PFENet This is the implementation of our paper PFENet: Prior Guided Feature Enrichment Network for Few-shot Segmentation that has been accepted to IEE

DV Lab 230 Dec 31, 2022
WSDM‘2022: Knowledge Enhanced Sports Game Summarization

Knowledge Enhanced Sports Game Summarization Cooming Soon! :) Data will be released after approval process. Code will be published once the author of

Jiaan Wang 14 Jul 13, 2022
'Solving the sampling problem of the Sycamore quantum supremacy circuits

solve_sycamore This repo contains data, contraction code, and contraction order for the paper ''Solving the sampling problem of the Sycamore quantum s

Feng Pan 29 Nov 28, 2022
Official PyTorch Implementation of "Self-supervised Auxiliary Learning with Meta-paths for Heterogeneous Graphs". NeurIPS 2020.

Self-supervised Auxiliary Learning with Meta-paths for Heterogeneous Graphs This repository is the implementation of SELAR. Dasol Hwang* , Jinyoung Pa

MLV Lab (Machine Learning and Vision Lab at Korea University) 48 Nov 09, 2022
Churn prediction

Churn-prediction Churn-prediction Data preprocessing:: Label encoder is used to normalize the categorical variable Data Transformation:: For each data

1 Sep 28, 2022
MusicYOLO framework uses the object detection model, YOLOx, to locate notes in the spectrogram.

MusicYOLO MusicYOLO framework uses the object detection model, YOLOX, to locate notes in the spectrogram. Its performance on the ISMIR2014 dataset, MI

Xianke Wang 2 Aug 02, 2022
Parameterized Explainer for Graph Neural Network

PGExplainer This is a Tensorflow implementation of the paper: Parameterized Explainer for Graph Neural Network https://arxiv.org/abs/2011.04573 NeurIP

Dongsheng Luo 89 Dec 12, 2022
A ssl analyzer which could analyzer target domain's certificate.

ssl_analyzer A ssl analyzer which could analyzer target domain's certificate. Analyze the domain name ssl certificate information according to the inp

vincent 17 Dec 12, 2022
A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

CLEVR Dataset Generation This is the code used to generate the CLEVR dataset as described in the paper: CLEVR: A Diagnostic Dataset for Compositional

Facebook Research 503 Jan 04, 2023