Precision Medicine Knowledge Graph (PrimeKG)

Overview

PrimeKG


website GitHub Repo stars GitHub Repo forks License: MIT

Website | bioRxiv Paper | Harvard Dataverse

Precision Medicine Knowledge Graph (PrimeKG) presents a holistic view of diseases. PrimeKG integrates 20 high-quality biomedical resources to describe 17,080 diseases with 4,050,249 relationships representing ten major biological scales, considerably expanding previous efforts in disease-rooted knowledge graphs. We accompany PrimeKG’s graph structure with text descriptions of clinical guidelines for drugs and diseases to enable multimodal analyses.

Updates

Unique Features of PrimeKG

  • Diverse coverage of diseases: PrimeKG contains over 17,000 diseases including rare dieases. Disease nodes in PrimeKG are densely connected to other nodes in the graph and have been optimized for clinical relevance in downstream precision medicine tasks.
  • Heterogeneous knowledge graph: PrimeKG contains over 100,000 nodes distributed over various biological scales as depicted below. PrimeKG also contains over 4 million relationships between these nodes distributed over 29 types of edges.
  • Multimodal integration of clinical knowledge: Disease and drug nodes in PrimeKG are augmented with clinical descriptors that come from medical authorities such as Mayo Clinic, Orphanet, Drug Bank, and so forth.
  • Ready-to-use datasets: PrimeKG is minimally dependent on external packages. Our knowledge graph can be retrieved in a ready-to-use format from Harvard Dataverse.
  • Data functions: PrimeKG provides extensive data functions, including processors for primary resources and scripts to build an updated knowledge graph.

overview

PrimeKG-example

Environment setup

Using pip

To install the dependencies required to run the PrimeKG code, use pip:

pip install -r requirements.txt

Or use conda

conda env create --name PrimeKG --file=environments.yml

Building an updated PrimeKG

Downloading primary data resources

All persistent identifiers and weblinks to download the 20 primary data resources used to build PrimeKG are systematically provided in the Data Records section of our article. We have also mentioned the exact filenames that were downloaded from each resource for easy corroboration.

Curating primary data resources

We provide the scripts used to process all primary data resources and the names of the resulting output files generated by those scripts. We would be happy to share the intermediate processing datasets that were used to create PrimeKG on request.

Database Processing scripts Expected script output
Bgee bgee.py anatomy_gene.csv
Comparative Toxicogenomics Database ctd.py exposure_data.csv
DisGeNET - curated_gene_disease_associations.tsv
DrugBank drugbank_drug_drug.py drug_drug.csv
DrugBank parsexml_drugbank.ipynb, Parsed_feature.ipynb 12 drug feature files
DrugBank drugbank_drug_protein.py drug_protein.csv
Drug Central drugcentral_queries.txt drug_disease.csv
Drug Central drugcentral_feature.Rmd dc_features.csv
Entrez Gene ncbigene.py protein_go_associations.csv
Gene Ontology go.py go_terms_info.csv, go_terms_relations.csv
Human Phenotype Ontology hpo.py, hpo_obo_parser.py hp_terms.csv, hp_parents.csv, hp_references.csv
Human Phenotype Ontology hpoa.py disease_phenotype_pos.csv, disease_phenotype_neg.csv
MONDO mondo.py, mondo_obo_parser.py mondo_terms.csv, mondo_parents.csv, mondo_references.csv, mondo_subsets.csv, mondo_definitions.csv
Reactome reactome.py reactome_ncbi.csv, reactome_terms.csv, reactome_relations.csv
SIDER sider.py sider.csv
UBERON uberon.py uberon_terms.csv, uberon_rels.csv, uberon_is_a.csv
UMLS umls.py, map_umls_mondo.py umls_mondo.csv
UMLS umls.ipynb umls_def_disorder_2021.csv, umls_def_disease_2021.csv

Harmonizing datasets into PrimeKG

The code to harmonize datasets and construct PrimeKG is available at build_graph.ipynb. Simply run this jupyter notebook in order to construct the knowledge graph form the outputs of the processing files mentioned above. This jupyter notebook produces all three versions of PrimeKG, kg_raw.csv, kg_giant.csv, and the complete version kg.csv.

Feature extraction

The code required to engineer features can be found at engineer_features.ipynb and mapping_mayo.ipynb.

Cite Us

If you find PrimeKG useful, cite our work:

@article{chandak2022building,
  title={Building a knowledge graph to enable precision medicine},
  author={Chandak, Payal and Huang, Kexin and Zitnik, Marinka},
  journal={bioRxiv},
  doi={10.1101/2022.05.01.489928},
  URL={https://www.biorxiv.org/content/early/2022/05/01/2022.05.01.489928},
  year={2022}
}

Data Server

PrimeKG is hosted on Harvard Dataverse with the following persistent identifier https://doi.org/10.7910/DVN/IXA7BM. When Dataverse is under maintenance, PrimeKG datasets cannot be retrieved. That happens rarely; please check the status on the Dataverse website.

License

PrimeKG codebase is under MIT license. For individual dataset usage, please refer to the dataset license found in the website.

Owner
Machine Learning for Medicine and Science @ Harvard
Machine Learning for Medicine and Science @ Harvard
Basic yet complete Machine Learning pipeline for NLP tasks

Basic yet complete Machine Learning pipeline for NLP tasks This repository accompanies the article on building basic yet complete ML pipelines for sol

Ivan 20 Aug 22, 2022
Blackstone is a spaCy model and library for processing long-form, unstructured legal text

Blackstone Blackstone is a spaCy model and library for processing long-form, unstructured legal text. Blackstone is an experimental research project f

ICLR&D 579 Jan 08, 2023
String Gen + Word Checker

Creates random strings and checks if any of them are a real words. Mostly a waste of time ngl but it is cool to see it work and the fact that it can generate a real random word within10sec

1 Jan 06, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.6k Dec 27, 2022
Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

TextCortex - HemingwAI Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingw

TextCortex AI 27 Nov 28, 2022
[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

Cambridge Language Technology Lab 61 Dec 10, 2022
OCR을 이용하여 인원수를 인식 후 줌을 Kill 해줍니다

How To Use killtheZoom-2.0 Windows 0. https://joyhong.tistory.com/79 이 글을 보면서 tesseract를 C:\Program Files\Tesseract-OCR 경로로 설치해주세요(한국어 언어 추가 필요) 상단의 초

김정인 9 Sep 13, 2021
An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

IVR-Chatbot Achievements 🏆 Team Uhtred won the Maverick 2.0 Bot-a-thon 2021 organized by AbInbev India. ❓ Problem Statement As we all know that, lot

ARYAMAAN PANDEY 9 Dec 08, 2022
Unofficial PyTorch implementation of Google AI's VoiceFilter system

VoiceFilter Note from Seung-won (2020.10.25) Hi everyone! It's Seung-won from MINDs Lab, Inc. It's been a long time since I've released this open-sour

MINDs Lab 881 Jan 03, 2023
Transformer Based Korean Sentence Spacing Corrector

TKOrrector Transformer Based Korean Sentence Spacing Corrector License Summary This solution is made available under Apache 2 license. See the LICENSE

Paul Hyung Yuel Kim 3 Apr 18, 2022
Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Recurrent VLN-BERT Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian

YicongHong 109 Dec 21, 2022
Simple program that translates the name of files into English

Simple program that translates the name of files into English. Useful for when editing/inspecting programs that were developed in a foreign language.

0 Dec 22, 2021
This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest

Rachford-Rice Contest This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest. Can you solve the Rachford-Rice problem for all t

13 Sep 20, 2022
Scene Text Retrieval via Joint Text Detection and Similarity Learning

This is the code of "Scene Text Retrieval via Joint Text Detection and Similarity Learning". For more details, please refer to our CVPR2021 paper.

79 Nov 29, 2022
Sapiens is a human antibody language model based on BERT.

Sapiens: Human antibody language model ____ _ / ___| __ _ _ __ (_) ___ _ __ ___ \___ \ / _` | '_ \| |/ _ \ '

Merck Sharp & Dohme Corp. a subsidiary of Merck & Co., Inc. 13 Nov 20, 2022
NLP and Text Generation Experiments in TensorFlow 2.x / 1.x

Code has been run on Google Colab, thanks Google for providing computational resources Contents Natural Language Processing(自然语言处理) Text Classificati

1.5k Nov 14, 2022
AI and Machine Learning workflows on Anthos Bare Metal.

Hybrid and Sovereign AI on Anthos Bare Metal Table of Contents Overview Terraform as IaC Substrate ABM Cluster on GCE using Terraform TensorFlow ResNe

Google Cloud Platform 8 Nov 26, 2022
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

44 Jan 06, 2023
Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

Google Text-To-Speech Batch Prompt File Maker Are you in the need of IVR prompts, but you have no voice actors? Let Google talk your prompts like a pr

Ponchotitlán 1 Aug 19, 2021
Predict an emoji that is associated with a text

Sentiment Analysis Sentiment analysis in computational linguistics is a general term for techniques that quantify sentiment or mood in a text. Can you

Tetsumichi(Telly) Umada 30 Sep 07, 2022