STonKGs is a Sophisticated Transformer that can be jointly trained on biomedical text and knowledge graphs

Overview

STonKGs

Tests PyPI PyPI - Python Version PyPI - License Documentation Status DOI Code style: black

STonKGs is a Sophisticated Transformer that can be jointly trained on biomedical text and knowledge graphs. This multimodal Transformer combines structured information from KGs with unstructured text data to learn joint representations. While we demonstrated STonKGs on a biomedical knowledge graph ( i.e., from INDRA), the model can be applied other domains. In the following sections we describe the scripts that are necessary to be run to train the model on any given dataset.

💪 Getting Started

Data Format

Since STonKGs is operating on both text and KG data, it's expected that the respective data files include columns for both modalities. More specifically, the expected data format is a pandas dataframe (or a pickled pandas dataframe for the pre-training script), in which each row is containing one text-triple pair. The following columns are expected:

  • source: Source node in the triple of a given text-triple pair
  • target: Target node in the triple of a given text-triple pair
  • evidence: Text of a given text-triple pair
  • (optional) class: Class label for a given text-triple pair in fine-tuning tasks (does not apply to the pre-training procedure)

Note that both source and target nodes are required to be in the Biological Expression Langauge (BEL) format, more specifically, they need to be contained in the INDRA KG. For more details on the BEL format, see for example the INDRA documentation for BEL processor and PyBEL.

Pre-training STonKGs

Once you have installed STonKGs as a Python package (see below), you can start training the STonKGs on your dataset by running:

$ python3 -m stonkgs.models.stonkgs_pretraining

The configuration of the model can be easily modified by altering the parameters of the pretrain_stonkgs method. The only required argument to be changed is PRETRAINING_PREPROCESSED_POSITIVE_DF_PATH, which should point to your dataset.

Downloading the pre-trained STonKGs model on the INDRA KG

We released the pre-trained STonKGs models on the INDRA KG for possible future adaptations, such as further pre-training on other KGs. Both STonKGs150k as well as STonKGs300k are accessible through Hugging Face's model hub.

The easiest way to download and initialize the pre-trained STonKGs model is to use the from_default_pretrained() class method (with STonKGs150k being the default):

from stonkgs import STonKGsForPreTraining

# Download the model from the model hub and initialize it for pre-training 
# using from_default_pretrained
stonkgs_pretraining = STonKGsForPreTraining.from_default_pretrained()

Alternatively, since our code is based on Hugging Face's transformers package, the pre-trained model can be easily downloaded and initialized using the .from_pretrained() function:

from stonkgs import STonKGsForPreTraining

# Download the model from the model hub and initialize it for pre-training 
# using from_pretrained
stonkgs_pretraining = STonKGsForPreTraining.from_pretrained(
    'stonkgs/stonkgs-150k',
)

Extracting Embeddings

The learned embeddings of the pre-trained STonKGs models (or your own STonKGs variants) can be extracted in two simple steps. First, a given dataset with text-triple pairs (a pandas DataFrame, see Data Format) needs to be preprocessed using the preprocess_file_for_embeddings function. Then, one can obtain the learned embeddings using the preprocessed data and the get_stonkgs_embeddings function:

import pandas as pd

from stonkgs import get_stonkgs_embeddings, preprocess_df_for_embeddings

# Generate some example data
# Note that the evidence sentences are typically longer than in this example data
rows = [
    [
        "p(HGNC:1748 ! CDH1)",
        "p(HGNC:2515 ! CTNND1)",
        "Some example sentence about CDH1 and CTNND1.",
    ],
    [
        "p(HGNC:6871 ! MAPK1)",
        "p(HGNC:6018 ! IL6)",
        "Another example about some interaction between MAPK and IL6.",
    ],
    [
        "p(HGNC:3229 ! EGF)",
        "p(HGNC:4066 ! GAB1)",
        "One last example in which Gab1 and EGF are mentioned.",
    ],
]
example_df = pd.DataFrame(rows, columns=["source", "target", "evidence"])

# 1. Preprocess the text-triple data for embedding extraction
preprocessed_df_for_embeddings = preprocess_df_for_embeddings(example_df)

# 2. Extract the embeddings 
embedding_df = get_stonkgs_embeddings(preprocessed_df_for_embeddings)

Fine-tuning STonKGs

The most straightforward way of fine-tuning STonKGs on the original six classfication tasks is to run the fine-tuning script (note that this script assumes that you have a mlflow logger specified, e.g. using the --logging_dir argument):

$ python3 -m stonkgs.models.stonkgs_finetuning

Moreover, using STonKGs for your own fine-tuning tasks (i.e., sequence classification tasks) in your own code is just as easy as initializing the pre-trained model:

from stonkgs import STonKGsForSequenceClassification

# Download the model from the model hub and initialize it for fine-tuning
stonkgs_model_finetuning = STonKGsForSequenceClassification.from_default_pretrained(
    num_labels=number_of_labels_in_your_task,
)

# Initialize a Trainer based on the training dataset
trainer = Trainer(
    model=model,
    args=some_previously_defined_training_args,
    train_dataset=some_previously_defined_finetuning_data,
)

# Fine-tune the model to the moon 
trainer.train()

Using STonKGs for Inference

You can generate new predictions for previously unseen text-triple pairs (as long as the nodes are contained in the INDRA KG) based on either 1) the fine-tuned models used for the benchmark or 2) your own fine-tuned models. In order to do that, you first need to load/initialize the fine-tuned model:

from stonkgs.api import get_species_model, infer

model = get_species_model()

# Next, you want to use that model on your dataframe (consisting of at least source, target
# and evidence columns, see **Data Format**) to generate the class probabilities for each
# text-triple pair belonging to each of the specified classes in the respective fine-tuning task:
example_data = ...

# See Extracting Embeddings for the initialization of the example data
# This returns both the raw (transformers) PredictionOutput as well as the class probabilities 
# for each text-triple pair
raw_results, probabilities = infer(model, example_data)

ProtSTonKGs

It is possible to download the extension of STonKGs, the pre-trained ProtSTonKGs model, and initialize it for further pre-training on text, KG and amino acid sequence data:

from stonkgs import ProtSTonKGsForPreTraining

# Download the model from the model hub and initialize it for pre-training 
# using from_pretrained
protstonkgs_pretraining = ProtSTonKGsForPreTraining.from_pretrained(
    'stonkgs/protstonkgs',
)

Moreover, analogous to STonKGs, ProtSTonKGs can be used for fine-tuning sequence classification tasks as well:

from stonkgs import ProtSTonKGsForSequenceClassification

# Download the model from the model hub and initialize it for fine-tuning
protstonkgs_model_finetuning = ProtSTonKGsForSequenceClassification.from_default_pretrained(
    num_labels=number_of_labels_in_your_task,
)

# Initialize a Trainer based on the training dataset
trainer = Trainer(
    model=model,
    args=some_previously_defined_training_args,
    train_dataset=some_previously_defined_finetuning_data,
)

# Fine-tune the model to the moon 
trainer.train()

⬇️ Installation

The most recent release can be installed from PyPI with:

$ pip install stonkgs

The most recent code and data can be installed directly from GitHub with:

$ pip install git+https://github.com/stonkgs/stonkgs.git

To install in development mode, use the following:

$ git clone git+https://github.com/stonkgs/stonkgs.git
$ cd stonkgs
$ pip install -e .

Warning: Because stellargraph doesn't currently work on Python 3.9, this software can only be installed on Python 3.8.

Artifacts

The pre-trained models are hosted on HuggingFace The fine-tuned models are hosted on the STonKGs community page on Zenodo along with the other artifacts (node2vec embeddings, random walks, etc.)

Acknowledgements

⚖️ License

The code in this package is licensed under the MIT License.

📖 Citation

Balabin H., Hoyt C.T., Birkenbihl C., Gyori B.M., Bachman J.A., Komdaullil A.T., Plöger P.G., Hofmann-Apitius M., Domingo-Fernández D. STonKGs: A Sophisticated Transformer Trained on Biomedical Text and Knowledge Graphs (2021), bioRxiv, 2021.08.17.456616v1.

🎁 Support

This project has been supported by several organizations (in alphabetical order):

💰 Funding

This project has been funded by the following grants:

Funding Body Program Grant
DARPA Automating Scientific Knowledge Extraction (ASKE) HR00111990009

🍪 Cookiecutter

This package was created with @audreyfeldroy's cookiecutter package using @cthoyt's cookiecutter-snekpack template.

🛠️ Development

The final section of the README is for if you want to get involved by making a code contribution.

Testing

After cloning the repository and installing tox with pip install tox, the unit tests in the tests/ folder can be run reproducibly with:

$ tox

Additionally, these tests are automatically re-run with each commit in a GitHub Action.

📦 Making a Release

After installing the package in development mode and installing tox with pip install tox, the commands for making a new release are contained within the finish environment in tox.ini. Run the following from the shell:

$ tox -e finish

This script does the following:

  1. Uses BumpVersion to switch the version number in the setup.cfg and src/stonkgs/version.py to not have the -dev suffix
  2. Packages the code in both a tar archive and a wheel
  3. Uploads to PyPI using twine. Be sure to have a .pypirc file configured to avoid the need for manual input at this step
  4. Push to GitHub. You'll need to make a release going with the commit where the version was bumped.
  5. Bump the version to the next patch. If you made big changes and want to bump the version by minor, you can use tox -e bumpversion minor after.
Comments
  • Add default loading function

    Add default loading function

    This PR uses the top-level __init__.py to make imports a bit less verbose and adds a class method that wraps loading from huggingface's hub so users don't need so many constants

    opened by cthoyt 3
  • Add code for using fine-tuned models

    Add code for using fine-tuned models

    • [x] Upload fine-tuned models to Zenodo under CC 0 license each in their own record (@cthoyt)
      • [x] species (https://zenodo.org/record/5205530)
      • [x] location (https://zenodo.org/record/5205553)
      • [x] disease (https://zenodo.org/record/5205592)
      • [x] correct (multiclass; https://zenodo.org/record/5206139)
      • [x] correct (binary; https://zenodo.org/record/5205989)
      • [x] cell line (https://zenodo.org/record/5205915)
    • [x] Automate download of fine-tuned models with PyStow (@cthoyt; see example in 660920d, done in 5dfecde)
    • [x] Show how to load model into python object (@helena-balabin)
    • [x] Show how to make a prediction on a given text triple, similarly to the README example that's on the general pre-trained model (@helena-balabin)
    opened by cthoyt 2
  • Prepare code for black

    Prepare code for black

    black is an automatic code formatter and linter. After all of the flake8 hell, it's a solution we can use to automate most of the pain away. This PR prepares the project for applying black by updating the flake8 config (adding a specific ignore) and adding a tox environment for applying black itself.

    After accepting this PR, do two things in order:

    1. Uncomment the flake8-black plugin in the flake8 environment of the tox.ini
    2. Run tox -e lint
    opened by cthoyt 1
  • Stream, yo

    Stream, yo

    I got a sigkill from lack of memory when I did this on the full COVID19 emmaa model, so I'm switching everything over to iterables and streams rather than building up full pandas dataframes each time.

    opened by cthoyt 0
  • Improve label to identifier mapping

    Improve label to identifier mapping

    Make sure that during inference, the predictions are correctly mapped to any kind of class labels when saving the dataframe. (This requires accessing/saving the id2tag attribute in each of the fine-tuning model classes.)

    opened by helena-balabin 0
Releases(v0.1.5)
Owner
STonKGs
Multimodal Transformers for biomedical text and Knowledge Graph data
STonKGs
LCG T-TEST USING EUCLIDEAN METHOD

This project has been created for statistical usage, purposing for determining ATL takers and nontakers using LCG ttest and Euclidean Method, especially for internal business case in Telkomsel.

2 Jan 21, 2022
Fully featured implementation of Routing Transformer

Routing Transformer A fully featured implementation of Routing Transformer. The paper proposes using k-means to route similar queries / keys into the

Phil Wang 246 Jan 02, 2023
A Python script that compares files in directories

compare-files A Python script that compares files in different directories, this is similar to the command filecmp.cmp(f1, f2). I made this script in

Colvin 1 Oct 15, 2021
iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

Bytedance Inc. 435 Jan 06, 2023
This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

Speech-Backbones This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab. Grad-TTS Official implementation of the Grad-

HUAWEI Noah's Ark Lab 295 Jan 07, 2023
Jarvis is a simple Chatbot with a GUI capable of chatting and retrieving information and daily news from the internet for it's user.

J.A.R.V.I.S Kindly consider starring this repository if you like the program :-) What/Who is J.A.R.V.I.S? J.A.R.V.I.S is an chatbot written that is bu

Epicalable 50 Dec 31, 2022
nlpcommon is a python Open Source Toolkit for text classification.

nlpcommon nlpcommon, Python Text Tool. Guide Feature Install Usage Dataset Contact Cite Reference Feature nlpcommon is a python Open Source

xuming 3 May 29, 2022
A high-level yet extensible library for fast language model tuning via automatic prompt search

ruPrompts ruPrompts is a high-level yet extensible library for fast language model tuning via automatic prompt search, featuring integration with Hugg

Sber AI 37 Dec 07, 2022
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Kundan Krishna 6 Jun 04, 2021
Codes to pre-train Japanese T5 models

t5-japanese Codes to pre-train a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts. The model is available at https://hug

Megagon Labs 37 Dec 25, 2022
Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Toy Machine Learning Pipeline Table of Contents About Getting Started ML task description and evaluation procedure Dataset description Repository stru

Shreya Shankar 190 Dec 21, 2022
Sample data associated with the Aurora-BP study

The Aurora-BP Study and Dataset This repository contains sample code, sample data, and explanatory information for working with the Aurora-BP dataset

Microsoft 16 Dec 12, 2022
🌐 Translation microservice powered by AI

Dot Translate 🌐 A microservice for quick and local translation using A.I. This service starts a local webserver used for neural machine translation.

Dot HQ 48 Nov 22, 2022
Huggingface Transformers + Adapters = ❤️

adapter-transformers A friendly fork of HuggingFace's Transformers, adding Adapters to PyTorch language models adapter-transformers is an extension of

AdapterHub 1.2k Jan 09, 2023
端到端的长本文摘要模型(法研杯2020司法摘要赛道)

端到端的长文本摘要模型(法研杯2020司法摘要赛道)

苏剑林(Jianlin Su) 334 Jan 08, 2023
Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application with a focus on embedded systems.

Welcome to Spokestack Python! This library is intended for developing voice interfaces in Python. This can include anything from Raspberry Pi applicat

Spokestack 133 Sep 20, 2022
Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers an

Parv Bhatt 1 Jan 01, 2022
An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

FantasyBert English | 中文 Introduction An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations. You can imp

Fan 137 Oct 26, 2022
Chinese Pre-Trained Language Models (CPM-LM) Version-I

CPM-Generate 为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告] 若您想使用CPM-1进行推理,我们建议使用高效推理工具BMI

Tsinghua AI 1.4k Jan 03, 2023
The RWKV Language Model

RWKV-LM We propose the RWKV language model, with alternating time-mix and channel-mix layers: The R, K, V are generated by linear transforms of input,

PENG Bo 877 Jan 05, 2023