Codebase of deep learning models for inferring stability of mRNA molecules

Last update: Dec 29, 2022

Related tags

Overview

Kaggle OpenVaccine Models

Codebase of deep learning models for inferring stability of mRNA molecules, corresponding to the Kaggle Open Vaccine Challenge and accompanying manuscript "Predictive models of RNA degradation through dual crowdsourcing", Wayment-Steele et al (2021) (full citation when available).

Models contained here are:

"Nullrecurrent": A reconstruction of winning solution by Jiayang Gao. Link to original notebooks provided below.

"DegScore-XGBoost": A model based the original DegScore model and XGBoost.

NB on other historic names for models

The Nullrecurrent model was called "OV" model in some instances and the .h5 model files for the Nullrecurrent model are labeled "ov".
The DegScore-XGBoost model was called the "BT" model in Eterna analysis.

Organization

scripts: Python scripts to perform inference.

notebooks: Python notebooks to perform inference.

model_files: Store .h5 model files used at inference time.

data: Data corresponding to Kaggle challenge and to subsequent tests on mRNAs.

data/Kaggle_RYOS_data

This directory contains training set and test sets in .csv and in .json form.

Kaggle_RYOS_trainset_prediction_output_Sep2021.txt contains predictions from the Nullrecurrent code in this repository.

Model MCRMSEs were evaluated by uploading submissions to the Kaggle competition website at https://www.kaggle.com/c/stanford-covid-vaccine.

data/mRNA_233x_data

This directory contains original data and scripts to reproduce model analysis from manuscript.

Because all the original formats are slightly different, the reformat_*.py scripts read in the original formats and reformats them in two forms for each prediction: "FULL" and "PCR" in the directory formatted_predictions.

"FULL" is per-nucleotide predictions for all the nucleotides. "PCR" has had the regions outside the RT-PCR sequencing set to NaN.

python collate_predictions.py reads in all the data and outputs all_predictions_233x.csv

RegenerateFigure5.ipynb reproduces the final scatterplot comparisons.

posthoc_code_predictions contains predictions from the Nullrecurrent code model contained in this repository. To generate these predictions use the sequence file in the mRNA_233x_data folder and run the following command(s):

python scripts/nullrecurrent_inference.py -d deg_Mg_pH10 -i 233_sequences.txt -o 233x_nullrecurrent_output_Oct2021_deg_Mg_50C.txt,

etc.

Dependencies

Install via pip install requirements.txt or conda install --file requirements.txt.

Not pip-installable: EternaFold, Vienna, and Arnie, see below.

Setup

Install git-lfs (best to do before git-cloning this KaggleOpenVaccine repo).
Install EternaFold (the nullrecurrent model uses this), available for free noncommercial use here.
Install ViennaRNA (the DegScore-XGBoost model uses this), available here.
Git clone Arnie, which wraps EternaFold in python and allows RNA thermodynamic calculations across many packages. Follow instructions here to link EternaFold to it.
Add path to this repository as KOV_PATH (so that script can find path to stored model files):

export KOV_PATH='/path/to/KaggleOpenVaccine'

Usage

To run the nullrecurrent winning solution on one construct, given in example.txt:

CGC

Run

python scripts/nullrecurrent_inference.py [-d deg] -i example.txt -o predict.txt

where the deg is one of the following options

deg_Mg_pH10
deg_pH10
deg_Mg_50C
deg_50C

Similarly, for the DegScore-XGBoost model :

python scripts/degscore-xgboost_inference.py -i example.txt -o predict.txt

This write a text file of output predictions to predict.txt:

(Nullrecurrent output)

2.1289976365, 2.650808962, 2.1869660805000004

(DegScore-XGBoost output)

0.2697107, 0.37091506, 0.48528114

A note on energy model versions

The predictions in the Kaggle competition and for the manuscript were performed with EternaFold parameters and CONTRAfold-SE code. The currently available EternaFold code will result in slightly different values. For more on the difference, see the EternaFold README.

Individual Kaggle Solutions

This code is based on the winning solution for the Open Vaccine Kaggle Competition Challenge. The competition can be found here:

https://www.kaggle.com/c/stanford-covid-vaccine/overview

This code is also the supplementary material for the Kaggle Competition Solution Paper. The individual Kaggle writeups for the top solutions that have been featured in that paper can be found in the following table:

Team Name	Team Members	Rank	Link to the solution
Nullrecurrent	Jiayang Gao	1	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189620

Kazuki ** 2	Kazuki Onodera, Kazuki Fujikawa	2	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189709

Striderl	Hanfei Mao	3	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189574

FromTheWheel & Dyed & StoneShop	Gilles Vandewiele, Michele Tinti, Bram Steenwinckel	4	https://www.kaggle.com/group16/covid-19-mrna-4th-place-solution

tito	Takuya Ito	5	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189691

nyanp	Taiga Noumi	6	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189241

One architecture	Shujun He	7	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189564

ishikei	Keiichiro Ishi	8	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/190314

Keep going to be GM	Youhan Lee	9	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189845

Social Distancing Please	Fatih Öztürk,Anthony Chiu,Emin Ozturk	11	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189571

The Machine	Karim Amer,Mohamed Fares	13	https://www.kaggle.com/c/stanford-covid-vaccine/discussion/189585

You might also like...

PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.

PySlowFast PySlowFast is an open source video understanding codebase from FAIR that provides state-of-the-art video classification models with efficie

5.3k Jan 3, 2023

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

GLIDE This is the official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing w

2.9k Jan 4, 2023

Official codebase for Decision Transformer: Reinforcement Learning via Sequence Modeling.

Decision Transformer Lili Chen*, Kevin Lu*, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas†, and Igor M

1.4k Jan 7, 2023

Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World

Legged Robots that Keep on Learning Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World, whic

70 Dec 7, 2022

Official codebase for "B-Pref: Benchmarking Preference-BasedReinforcement Learning" contains scripts to reproduce experiments.

B-Pref Official codebase for B-Pref: Benchmarking Preference-BasedReinforcement Learning contains scripts to reproduce experiments. Install conda env

48 Dec 20, 2022

Codebase for "ProtoAttend: Attention-Based Prototypical Learning."

Codebase for "ProtoAttend: Attention-Based Prototypical Learning." Authors: Sercan O. Arik and Tomas Pfister Paper: Sercan O. Arik and Tomas Pfister,

2 May 17, 2022

Time-series-deep-learning - Developing Deep learning LSTM, BiLSTM models, and NeuralProphet for multi-step time-series forecasting of stock price.

Stock Price Prediction Using Deep Learning Univariate Time Series Predicting stock price using historical data of a company using Neural networks for

7 Nov 27, 2022

Spearmint Bayesian optimization codebase

Spearmint Spearmint is a software package to perform Bayesian optimization. The Software is designed to automatically run experiments (thus the code n

Formerly: Harvard Intelligent Probabilistic Systems Group -- Now at Princeton

1.5k Dec 29, 2022

A general 3D Object Detection codebase in PyTorch.

Det3D is the first 3D Object Detection toolbox which provides off the box implementations of many 3D object detection algorithms such as PointPillars, SECOND, PIXOR, etc, as well as state-of-the-art methods on major benchmarks like KITTI(ViP) and nuScenes(CBGS).

1.4k Jan 5, 2023

Comments

HW edits

Changes:

Remove hardcoded paths in scripts

Remove tmp csv output files for nullrecurrent

Rename to reflect model naming in paper "nullrecurrent"

Reorganize example inputs and outputs

Update README

Add requirements file

opened by HWaymentSteele 0

Releases(v1.0)

v1.0(Sep 30, 2022)

Release to accompany Wayment-Steele et al. (2022) "Deep learning models for predicting RNA degradation via dual crowdsourcing".
Source code(tar.gz)
Source code(zip)

Codebase of deep learning models for inferring stability of mRNA molecules

Related tags

Overview

Kaggle OpenVaccine Models

Organization

data/Kaggle_RYOS_data

data/mRNA_233x_data

Dependencies

Setup

Usage

A note on energy model versions

Individual Kaggle Solutions

You might also like...

PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

Official codebase for Decision Transformer: Reinforcement Learning via Sequence Modeling.

Official codebase for Legged Robots that Keep on Learning: Fine-Tuning Locomotion Policies in the Real World

Official codebase for "B-Pref: Benchmarking Preference-BasedReinforcement Learning" contains scripts to reproduce experiments.

Codebase for "ProtoAttend: Attention-Based Prototypical Learning."

Time-series-deep-learning - Developing Deep learning LSTM, BiLSTM models, and NeuralProphet for multi-step time-series forecasting of stock price.

Spearmint Bayesian optimization codebase

A general 3D Object Detection codebase in PyTorch.

Comments

HW edits

Releases(v1.0)

v1.0(Sep 30, 2022)

Owner

Eternagame

Registration Loss Learning for Deep Probabilistic Point Set Registration

A package related to building quasi-fibration symmetries

This is an (re-)implementation of DeepLab-ResNet in TensorFlow for semantic image segmentation on the PASCAL VOC dataset.

Practical tutorials and labs for TensorFlow used by Nvidia, FFN, CNN, RNN, Kaggle, AE

StackRec: Efficient Training of Very Deep Sequential Recommender Models by Iterative Stacking

An addon uses SMPL's poses and global translation to drive cartoon character in Blender.

Protect against subdomain takeover

Large scale PTM - PPI relation extraction

Fast Axiomatic Attribution for Neural Networks (NeurIPS*2021)

A simple code to convert image format and channel as well as resizing and renaming multiple images.

PyTorch Implementation of the SuRP algorithm by the authors of the AISTATS 2022 paper "An Information-Theoretic Justification for Model Pruning"

Code for ACM MM 2020 paper "NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination"

Voice Conversion by CycleGAN (语音克隆/语音转换)：CycleGAN-VC3

realsense d400 -> jpg + csv

ByteTrack: Multi-Object Tracking by Associating Every Detection Box

Prometheus exporter for Cisco Unified Computing System (UCS) Manager

This repository is the official implementation of Using Time-Series Privileged Information for Provably Efficient Learning of Prediction Models

Deep Learning Based EDM Subgenre Classification using Mel-Spectrogram and Tempogram Features"

PyTorch implementations of neural network models for keyword spotting

This repository is for the preprint "A generative nonparametric Bayesian model for whole genomes"