Uni-Fold: Training your own deep protein-folding models.

Related tags

Deep LearningUni-Fold
Overview

Uni-Fold: Training your own deep protein-folding models.

This package provides and implementation of a trainable, Transformer-based deep protein folding model. We modified the open-source code of DeepMind AlphaFold v2.0 and provided code to train the model from scratch. See the reference and the repository of DeepMind AlphaFold v2.0. To train your own Uni-Fold models, please follow the steps below:

1. Install the environment.

Run the following code to install the dependencies of Uni-Fold:

  conda create -n unifold python=3.8.10 -y
  conda activate unifold
  ./install_dependencies.sh

Uni-Fold has been tested for Python 3.8.10, CUDA 11.1 and OpenMPI 4.1.1. We recommend using Conda >= 4.10 when installing the environment: using Conda with lower level may lead to some conflicts between packages.

2. Prepare data before training.

Before you start to train your own folding models, you shall prepare the features and labels of the training proteins. Features of proteins mainly include the amino acid sequence, MSAs and templates of proteins. These messages should be contained in a pickle file /features.pkl for each training protein. Uni-Fold provides scripts to process input FASTA files, relying on several external databases and tools. Labels are CIF files containing the structures of the proteins.

2.1 Datasets and external tools.

Uni-Fold adopts the same data processing pipeline as AlphaFold2. We kept the scripts of downloading corresponding databases for searching sequence homologies and templates in the AlphaFold2 repo. Use the command

  bash scripts/download_all_data.sh /path/to/database/directory

to download all required databases of Uni-Fold.

If you successfully installed the Conda environment in Section 1, external tools of search homogenous sequences and templates should be installed properly. As an alternative, you can customize the parameters of feature preparation script to refer to your own databases and tools.

2.2 Run the preparation code.

An example command of running the feature preparation pipeline would be

  python generate_pkl_features.py \
    --fasta_dir ./example_data/fasta \
    --output_dir ./out \
    --data_dir /path/to/database/directory \
    --num_workers 1

This command automatically processes all FASTA files under fasta_dir, and dumps the results to output_dir. Note that each FASTA file should contain only one sequence. The default number of cpu used in hhblits and jackhmmer are 4 and 8. You can modify them in unifold/data/tools/hhblits.py and unifold/data/tools/jackhmmer.py, respectively.

2.3 Organize your training data.

Uni-Fold uses the class DataSystem to automatically sample and load the training entries. To make everything goes right, you shall pay attention to how the training data is organized. Two directories should be established, one with input features (features.pkl files, referred as features_dir) and the other with labels (*.cif files, referred as mmcif_dir). The feature directory should have its files named as _ _ /features.pkl , and the label directory should have its files named as .cif . Users shall make sure that all proteins used for training have their corresponding labels. See ./example_data/features and ./example_data/mmcif for instances of features_dir and mmcif_dir.

3. Train Uni-Fold.

3.1 Configuration.

Before you conduct any actual training processes, please make sure that you correctly configured the code. Modify the training configurations in unifold/train/train_config.py. We annotated the default configurations to reproduce AlphaFold in the script. Specifically, modify the data setups in unifold/train/train_config.py:

"data": {
  "train": {
    "features_dir": "where/training/protein/features/are/stored/",
    "mmcif_dir": "where/training/mmcif/files/are/stored/",
    "sample_weights": "which/specifies/proteins/for/training.json"
  },
  "eval": {
    "features_dir": "where/validation/protein/features/are/stored/",
    "mmcif_dir": "where/validation/mmcif/files/are/stored/",
    "sample_weights": "which/specifies/proteins/for/training.json"
  }
}

The specified data should be contained in two folders, namely a features_dir and a mmcif_dir. Organizations of the two directories are introduced in Section 2.3. Meanwhile, if you want to specify the subset of training data under the directories, or assign customized sample weights for each protein, write a json file and feed its path to sample_weights. This is optional, as you can leave it as None (and the program will attempt to use all entries under features_dir with uniform weights). The json file should be a dictionary contains the basename of directories of protein features ([pdb_id]_[model_id]_[chain_id]) and the sample weight of each protein in the training process (integer or float), such as:

{"1am9_1_C": 82, "1amp_1_A": 291, "1aoj_1_A": 60, "1aoz_1_A": 552}

or for uniform sampling, simply using a list of protein entries suffices:

["1am9_1_C", "1amp_1_A", "1aoj_1_A", "1aoz_1_A"]

Meanwhile, the configurations of models can be edited in unifold/model/config.py for users who want to customize their own folding models.

3.2 Run the training code!

To train the model on a single node without MPI, run

python train.py

You can also train the model using MPI (or workload managers that supports MPI, such as PBS or Slurm) by running:

mpirun -n <numer_of_gpus> python train.py

In either way, make sure you properly configurate the option use_mpi in unifold/train/train_config.py.

4. Inference with trained models.

4.1 Inference from features.pkl.

We provide the run_from_pkl.py script to support inferencing protein structures from features.pkl inputs. A demo command would be

python run_from_pkl.py \
  --pickle_dir ./example_data/features \
  --model_names model_2 \
  --model_paths /path/to/model_2.npz \
  --output_dir ./out

or

python run_from_pkl.py \
  --pickle_paths ./example_data/features/1ak0_1_A/features.pkl \
  --model_names model_2 \
  --model_paths /path/to/model_2.npz \
  --output_dir ./out

The command will generate structures of input features from different input models (in PDB format), the running time of each component, and corresponding residue-wise confidence score (predicted LDDT, or pLDDT).

4.2 Inference from FASTA files.

Essentially, inferencing the structures from given FASTA files includes two steps, i.e. generating the pickled features and predicting structures from them. We provided a script, run_from_fasta.py, as a more friendly user interface. An example usage would be

python run_from_pkl.py \
  --fasta_paths ./example_data/fasta/1ak0_1_A.fasta \
  --model_names model_2 \
  --model_paths /path/to/model_2.npz \
  --data_dir /path/to/database/directory
  --output_dir ./out

4.3 Generate MSA with MMseqs2.

It may take hours and much memory to generate MSA for sequences,especially for long sequences. In this condition, MMseqs2 may be a more efficient way. It can be used in the following way after it is installed:

# download and build database
mkdir mmseqs_db && cd mmseqs_db
wget http://wwwuser.gwdg.de/~compbiol/colabfold/uniref30_2103.tar.gz
wget http://wwwuser.gwdg.de/~compbiol/colabfold/colabfold_envdb_202108.tar.gz
tar xzvf uniref30_2103.tar.gz
tar xzvf colabfold_envdb_202108.tar.gz
mmseqs tsv2exprofiledb uniref30_2103 uniref30_2103_db
mmseqs tsv2exprofiledb colabfold_envdb_202108 colabfold_envdb_202108_db
mmseqs createindex uniref30_2103_db tmp
mmseqs createindex colabfold_envdb_202108_db tmp
cd ..

# MSA search
./scripts/colabfold_search.sh mmseqs "query.fasta" "mmseqs_db/" "result/" "uniref30_2103_db" "" "colabfold_envdb_202108_db" "1" "0" "1"

5. Changes from AlphaFold to Uni-Fold.

  • We implemented classes and methods for training and inference pipelines by adding scripts under unifold/train and unifold/inference.
  • We added scripts for installing the environment, training and inferencing.
  • Files under unifold/common, unifold/data and unifold/relax are minimally altered for re-structuring the repository.
  • Files under unifold/model are moderately altered to allow mixed-precision training.
  • We removed unused scripts in training AlphaFold model.

6. License and disclaimer.

6.1 Uni-Fold code license.

Copyright 2021 Beijing DP Technology Co., Ltd.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

6.2 Use of third-party software.

Use of the third-party software, libraries or code may be governed by separate terms and conditions or license provisions. Your use of the third-party software, libraries or code is subject to any such terms and you should check that you can comply with any applicable restrictions or terms and conditions before use.

6.3 Contributing to Uni-Fold.

Uni-Fold is an ongoing project. Our target is to design better protein folding models and to apply them in real scenarios. We welcome the community to join us in developing the repository together, including but not limited to 1) reports and fixes of bugs,2) new features and 3) better interfaces. Please refer to CONTRIBUTING.md for more information.

Owner
DeepModeling
Define the future of scientific computing together
DeepModeling
🍀 Pytorch implementation of various Attention Mechanisms, MLP, Re-parameter, Convolution, which is helpful to further understand papers.⭐⭐⭐

🍀 Pytorch implementation of various Attention Mechanisms, MLP, Re-parameter, Convolution, which is helpful to further understand papers.⭐⭐⭐

xmu-xiaoma66 7.7k Jan 05, 2023
Code for SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics (ACL'2020).

SentiBERT Code for SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics (ACL'2020). https://arxiv.org/abs/20

Da Yin 66 Aug 13, 2022
MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.

Documentation | FAQ | Release Notes | Roadmap | MACE Model Zoo | Demo | Join Us | 中文 Mobile AI Compute Engine (or MACE for short) is a deep learning i

Xiaomi 4.7k Dec 29, 2022
Code for NeurIPS 2021 paper 'Spatio-Temporal Variational Gaussian Processes'

Spatio-Temporal Variational GPs This repository is the official implementation of the methods in the publication: O. Hamelijnck, W.J. Wilkinson, N.A.

AaltoML 26 Sep 16, 2022
A curated list of neural rendering resources.

Awesome-of-Neural-Rendering A curated list of neural rendering and related resources. Please feel free to pull requests or open an issue to add papers

Zhiwei ZHANG 43 Dec 09, 2022
Demonstrates how to divide a DL model into multiple IR model files (division) and introduce a simplest way to implement a custom layer works with OpenVINO IR models.

Demonstration of OpenVINO techniques - Model-division and a simplest-way to support custom layers Description: Model Optimizer in Intel(r) OpenVINO(tm

Yasunori Shimura 12 Nov 09, 2022
Official Implementation of SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for Spatial-Aware Visual Representations

Official Implementation of SimIPU SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for Spatial-Aware Visual Representations Since

Zhyever 37 Dec 01, 2022
a practicable framework used in Deep Learning. So far UDL only provide DCFNet implementation for the ICCV paper (Dynamic Cross Feature Fusion for Remote Sensing Pansharpening)

UDL UDL is a practicable framework used in Deep Learning (computer vision). Benchmark codes, results and models are available in UDL, please contact @

Xiao Wu 11 Sep 30, 2022
Implementation for our AAAI2021 paper (Entity Structure Within and Throughout: Modeling Mention Dependencies for Document-Level Relation Extraction).

SSAN Introduction This is the pytorch implementation of the SSAN model (see our AAAI2021 paper: Entity Structure Within and Throughout: Modeling Menti

benfeng 69 Nov 15, 2022
[EMNLP 2020] Keep CALM and Explore: Language Models for Action Generation in Text-based Games

Contextual Action Language Model (CALM) and the ClubFloyd Dataset Code and data for paper Keep CALM and Explore: Language Models for Action Generation

Princeton Natural Language Processing 43 Dec 16, 2022
In real-world applications of machine learning, reliable and safe systems must consider measures of performance beyond standard test set accuracy

PixMix Introduction In real-world applications of machine learning, reliable and safe systems must consider measures of performance beyond standard te

Andy Zou 79 Dec 30, 2022
Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on PaddlePaddle

DOC | Quick Start | 中文 Breaking News !! 🔥 🔥 🔥 OGB-LSC KDD CUP 2021 winners announced!! (2021.06.17) Super excited to announce our PGL team won TWO

1.5k Jan 06, 2023
Gesture Volume Control v.2

Gesture volume control v.2 In this project I am going to learn how to use Gesture Control to change the volume of a computer. I first look into hand t

Pavel Dat 23 Dec 26, 2022
Real time Human Detection Counting

In this python project, we are going to build the Human Detection and Counting System through Webcam or you can give your own video or images. This is a deep learning project on computer vision, whic

Mir Nawaz Ahmad 2 Jun 17, 2022
An open framework for Federated Learning.

Welcome to Intel® Open Federated Learning Federated learning is a distributed machine learning approach that enables organizations to collaborate on m

Intel Corporation 397 Dec 27, 2022
GAN-based 3D human pose estimation model for 3DV'17 paper

Tensorflow implementation for 3DV 2017 conference paper "Adversarially Parameterized Optimization for 3D Human Pose Estimation". @inproceedings{jack20

Dominic Jack 15 Feb 27, 2021
Towards the D-Optimal Online Experiment Design for Recommender Selection (KDD 2021)

Towards the D-Optimal Online Experiment Design for Recommender Selection (KDD 2021) Contact 0 Jan 11, 2022

Acoustic mosquito detection code with Bayesian Neural Networks

HumBugDB Acoustic mosquito detection with Bayesian Neural Networks. Extract audio or features from our large-scale dataset on Zenodo. This repository

31 Nov 28, 2022
Convert Apple NeuralHash model for CSAM Detection to ONNX.

Apple NeuralHash is a perceptual hashing method for images based on neural networks. It can tolerate image resize and compression.

Asuhariet Ygvar 1.5k Dec 31, 2022
PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

MAE for Self-supervised ViT Introduction This is an unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-sup

36 Oct 30, 2022