Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

Overview

MOF-Water-Affinity-Prediction-

The following Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks (MOFs). The training set is extracted from the Cambridge Structural Database and the CoRE_MOF 2019 dataset.

Prediction Model

The prediction model is used to determine whether a given MOF is hydrophobic or hydrophilic. It uses a Random Forest model from the XGBoost library through a scikit-learn interface. The model reads in a .csv file of training data and then predicts the water affinity of a user inputted MOF. The user can specify what input parameters are to be used in the model.

Overfitting/Underfitting

This script was created to investigate how the prediction model’s accuracy and precision vary with the number and combination of inputs. This script allows a user to compare how the different combinations of inputs affect the score and the standard deviation of the model’s results.

It operates by reading in a .csv file of training data containing 13 input parameters. It then generates a list of all the possible combinations of input parameters according to the lengths specified by the user. For example, if the user wants all the combinations of length 3, 4, and 10 possible, the program will generate a list of all combinations of those lengths, and then use each combination as input for the model. Basically, each combination will undergo the same process as in the prediction model above, and then its results will be added into a .csv file for later analysis. Finally, a plot is created with filters for visualization.

.cif to .csv Converter

In order to create a training set for the prediction model, a csv must be created with all the available datapoints. This includes the MOFs and their crystallographic data. The data needed is collected from three different sources: WebCSD, CoRE_MOF 2019 dataset, and the MOF’s .cif files. Furthermore, additional calculations need to be performed from the information collected from the .cif files.

The code works by reading a .txt file, folder, or both, containing the refcodes and .cif files given to the MOF by the Cambrdige Structural Database. It then searches for these refcodes in the CoRE_MOF 2019 dataset, and retrieves the crystallographic data attached to them. Additionally, it uses the .cif files of the MOFs to calculate the atomic mass percentage of the metals contained in the MOF. These calculations are stored in columns 14-17, but are treated as one input parameter in the models in an attempt to relate them to each other. It also states the MOFs in the training set as hydrophobic and hydrophilic based on previously collected information from the literature. Finally, it produces a .csv file ready for use in the prediction model.

.cif folders

Three different folders are used to store .cif files.

  1. cif: these are hydrophobic MOFs received from Dr. Z. Qiao.
  2. manual hydrophobic: these are hydrophobic MOFs collected from the literature
  3. manual hydrophilic: these are hydrophilic MOFs collected from the literature

To add additional .cif files:

Add additional .cif files into either the manual hydrophobic folder or the manual hydrophilic folder. Make sure the file names represent the CCDC refcodes (including or excluding the CoRE_MOF 2019 name extensions). Finally, add these refcodes into the .txt file available in each folder so that the .cif files can be read by the cif Reader program.

This project is licensed under the terms of the GNU General Public License v3.0

Finding project directories in Python (data science) projects, just like there R rprojroot and here packages

Find relative paths from a project root directory Finding project directories in Python (data science) projects, just like there R here and rprojroot

Daniel Chen 102 Nov 16, 2022
DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis.

DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis. The main goal of the package is to accelerate the process of computing estimates of forward reachable sets for nonlinear dy

2 Nov 08, 2021
Programmatically access the physical and chemical properties of elements in modern periodic table.

API to fetch elements of the periodic table in JSON format. Uses Pandas for dumping .csv data to .json and Flask for API Integration. Deployed on "pyt

the techno hack 3 Oct 23, 2022
A lightweight interface for reading in output from the Weather Research and Forecasting (WRF) model into xarray Dataset

xwrf A lightweight interface for reading in output from the Weather Research and Forecasting (WRF) model into xarray Dataset. The primary objective of

National Center for Atmospheric Research 43 Nov 29, 2022
PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)

PandaPy "I came across PandaPy last week and have already used it in my current project. It is a fascinating Python library with a lot of potential to

Derek Snow 527 Jan 02, 2023
Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Statistical Analysis 📈 This repository focuses on statistical analysis and the exploration used on various data sets for personal and professional pr

Andy Pham 1 Sep 03, 2022
Single machine, multiple cards training; mix-precision training; DALI data loader.

Template Script Category Description Category script comparison script train.py, loader.py for single-machine-multiple-cards training train_DP.py, tra

2 Jun 27, 2022
Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video.

Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video. You can chose the cha

2 Jul 22, 2022
A forecasting system dedicated to smart city data

smart-city-predictions System prognostyczny dedykowany dla danych inteligentnych miast Praca inżynierska realizowana przez Michała Stawikowskiego and

Kevin Lai 1 Nov 08, 2021
This is an analysis and prediction project for house prices in King County, USA based on certain features of the house

This is a project for analysis and estimation of House Prices in King County USA The .csv file contains the data of the house and the .ipynb file con

Amit Prakash 1 Jan 21, 2022
Data and code accompanying the paper Politics and Virality in the Time of Twitter

Politics and Virality in the Time of Twitter Data and code accompanying the paper Politics and Virality in the Time of Twitter. In specific: the code

Cardiff NLP 3 Jul 02, 2022
MDAnalysis is a Python library to analyze molecular dynamics simulations.

MDAnalysis Repository README [*] MDAnalysis is a Python library for the analysis of computer simulations of many-body systems at the molecular scale,

MDAnalysis 933 Dec 28, 2022
Weather analysis with Python, SQLite, SQLAlchemy, and Flask

Surf's Up Weather analysis with Python, SQLite, SQLAlchemy, and Flask Overview The purpose of this analysis was to examine weather trends (precipitati

Art Tucker 1 Sep 05, 2021
PyChemia, Python Framework for Materials Discovery and Design

PyChemia, Python Framework for Materials Discovery and Design PyChemia is an open-source Python Library for materials structural search. The purpose o

Materials Discovery Group 61 Oct 02, 2022
Bigdata Simulation Library Of Dream By Sandman Books

BIGDATA SIMULATION LIBRARY OF DREAM BY SANDMAN BOOKS ================= Solution Architecture Description In the realm of Dreaming, its ruler SANDMAN,

Maycon Cypriano 3 Jun 30, 2022
Ejercicios Panda usando Pandas

Readme Below we add configuration details to locally test your application To co

1 Jan 22, 2022
Pipeline and Dataset helpers for complex algorithm evaluation.

tpcp - Tiny Pipelines for Complex Problems A generic way to build object-oriented datasets and algorithm pipelines and tools to evaluate them pip inst

Machine Learning and Data Analytics Lab FAU 3 Dec 07, 2022
ETL flow framework based on Yaml configs in Python

ETL framework based on Yaml configs in Python A light framework for creating data streams. Setting up streams through configuration in the Yaml file.

Павел Максимов 18 Jul 06, 2022
MapReader: A computer vision pipeline for the semantic exploration of maps at scale

MapReader A computer vision pipeline for the semantic exploration of maps at scale MapReader is an end-to-end computer vision (CV) pipeline designed b

Living with Machines 25 Dec 26, 2022
signac-flow - manage workflows with signac

signac-flow - manage workflows with signac The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, a

Glotzer Group 44 Oct 14, 2022