OntoProtein: Protein Pretraining With Ontology Embedding

Last update: Dec 14, 2022

Overview

OntoProtein

This is the implement of the paper "OntoProtein: Protein Pretraining With Ontology Embedding". OntoProtein is an effective method that make use of structure in GO (Gene Ontology) into text-enhanced protein pre-training model.

Quick links

Overview
Requirements
Data preparation
- Pre-training data
- Downstream task data
Protein pre-training model
Usage for protein-related tasks
Citation

Overview

In this work we present OntoProtein, a knowledge-enhanced protein language model that jointly optimize the KE and MLM objectives, which bring excellent improvements to a wide range of protein tasks. And we introduce ProteinKG25, a new large-scale KG dataset, promting the research on protein language pre-training.

Requirements

To run our code, please install dependency packages for related steps.

Environment for pre-training data generation

python3.8 / biopython 1.37 / goatools

Environment for OntoProtein pre-training

python3.8 / pytorch 1.9 / transformer 4.5.1+ / deepspeed 0.5.1/ lmdb /

Environment for protein-related tasks

python3.8 / pytorch 1.9 / transformer 4.5.1+ / lmdb

Note: environments configurations of some baseline models or methods in our experiments, e.g. BLAST, DeepGraphGO, we provide related links to configurate as follows:

BLAST / Interproscan / DeepGraphGO / GNN-PPI

Data preparation

For pretraining OntoProtein, fine-tuning on protein-related tasks and inference, we provide acquirement approach of related data.

Pre-training data

To incorporate Gene Ontology knowledge into language models and train OntoProtein, we construct ProteinKG25, a large-scale KG dataset with aligned descriptions and protein sequences respectively to GO terms and protein entities. There have two approach to acquire the pre-training data: 1) download our prepared data ProteinKG25, 2) generate your own pre-training data.

Download released data

We have released our prepared data ProteinKG25 in Google Drive.

The whole compressed package includes following files:

go_def.txt: GO term definition, which is text data. We concatenate GO term name and corresponding definition by colon.
go_type.txt: The ontology type which the specific GO term belong to. The index is correponding to GO ID in go2id.txt file.
go2id.txt: The ID mapping of GO terms.
go_go_triplet.txt: GO-GO triplet data. The triplet data constitutes the interior structure of Gene Ontology. The data format is < h r t>, where h and t are respectively head entity and tail entity, both GO term nodes. r is relation between two GO terms, e.g. is_a and part_of.
protein_seq.txt: Protein sequence data. The whole protein sequence data are used as inputs in MLM module and protein representations in KE module.
protein2id.txt: The ID mapping of proteins.
protein_go_train_triplet.txt: Protein-GO triplet data. The triplet data constitutes the exterior structure of Gene Ontology, i.e. Gene annotation. The data format is <h r t>, where h and t are respectively head entity and tail entity. It is different from GO-GO triplet that a triplet in Protein-GO triplet means a specific gene annotation, where the head entity is a specific protein and tail entity is the corresponding GO term, e.g. protein binding function. r is relation between the protein and GO term.
relation2id.txt: The ID mapping of relations. We mix relations in two triplet relation.

Generate your own pre-training data

For generating your own pre-training data, you need download following raw data:

go.obo: the structure data of Gene Ontology. The download link and detailed format see in Gene Ontology`
uniprot_sprot.dat: protein Swiss-Prot database. [link]
goa_uniprot_all.gpa: Gene Annotation data. [link]

When download these raw data, you can excute following script to generate pre-training data:

python tools/gen_onto_protein_data.py

Downstream task data

Our experiments involved with several protein-related downstream tasks. [Download datasets]

Protein pre-training model

You can pre-training your own OntoProtein based above pretraining dataset. We provide the script bash script/run_pretrain.sh to run pre-training. And the detailed arguments are all listed in src/training_args.py, you can set pre-training hyperparameters to your need.

Usage for protein-related tasks

Running examples

The shell files of training and evaluation for every task are provided in script/ , and could directly run.

Also, you can utilize the running codes run_downstream.py , and write your shell files according to your need:

run_downstream.py: support {ss3, ss8, contact, remote_homology, fluorescence, stability} tasks;

Training models

Running shell files: bash script/run_{task}.sh, and the contents of shell files are as follow:

sh run_main.sh \
    --model ./model/ss3/ProtBertModel \
    --output_file ss3-ProtBert \
    --task_name ss3 \
    --do_train True \
    --epoch 5 \
    --optimizer AdamW \
    --per_device_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --eval_step 100 \
    --eval_batchsize 4 \
    --warmup_ratio 0.08 \
    --frozen_bert False

You can set more detailed parameters in run_main.sh. The details of main.sh are as follows:

LR=3e-5
SEED=3
DATA_DIR=data/datasets
OUTPUT_DIR=data/output_data/$TASK_NAME-$SEED-$OI

python run_downstream.py \
  --task_name $TASK_NAME \
  --data_dir $DATA_DIR \
  --do_train $DO_TRAIN \
  --do_predict True \
  --model_name_or_path $MODEL \
  --per_device_train_batch_size $BS \
  --per_device_eval_batch_size $EB \
  --gradient_accumulation_steps $GS \
  --learning_rate $LR \
  --num_train_epochs $EPOCHS \
  --warmup_ratio $WR \
  --logging_steps $ES \
  --eval_steps $ES \
  --output_dir $OUTPUT_DIR \
  --seed $SEED \
  --optimizer $OPTIMIZER \
  --frozen_bert $FROZEN_BERT \
  --mean_output $MEAN_OUTPUT \

Notice: the best checkpoint is saved in OUTPUT_DIR/.

OntoProtein: Protein Pretraining With Ontology Embedding

Related tags

Overview

OntoProtein

Quick links

Overview

Requirements

Environment for pre-training data generation

Environment for OntoProtein pre-training

Environment for protein-related tasks

Data preparation

Pre-training data

Download released data

Generate your own pre-training data

Downstream task data

Protein pre-training model

Usage for protein-related tasks

Running examples

Training models

Owner

ZJUNLP

A pytorch-version implementation codes of paper: "BSN++: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation"

Repository for MDPGT

Official implementation of "A Unified Objective for Novel Class Discovery", ICCV2021 (Oral)

[CVPR 2021] 'Searching by Generating: Flexible and Efficient One-Shot NAS with Architecture Generator'

PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition, CVPR 2018

Research code for CVPR 2021 paper "End-to-End Human Pose and Mesh Reconstruction with Transformers"

Predictive AI layer for existing databases.

Most popular metrics used to evaluate object detection algorithms.

Decision Transformer: A brand new Offline RL Pattern

Tensorflow2.0 🍎🍊 is delicious, just eat it! 😋😋

Code for Boundary-Aware Segmentation Network for Mobile and Web Applications

NLP made easy

Code for our CVPR 2021 paper "MetaCam+DSCE"

用强化学习DQN算法，训练AI模型来玩合成大西瓜游戏，提供Keras版本和PARL（paddle）版本

Ganilla - Official Pytorch implementation of GANILLA

The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

[ICRA 2022] An opensource framework for cooperative detection. Official implementation for OPV2V.

Rendering Point Clouds with Compute Shaders

The Official TensorFlow Implementation for SPatchGAN (ICCV2021)

RIM: Reliable Influence-based Active Learning on Graphs.