Authors: Ada Madejska, MCDB, UCSB (contact: [email protected]) Nick Noll, UCSB This pipeline takes error-prone Nanopore reads and tries to increase the percentage identity of the results of identifying species with BLAST. The reads in fastq format are put through the pipeline which includes the following steps. 1. Quality control - very short and very long reads (reads that highly deviate from the usual length of the 16S sequence) are dropped. 2. Kmer frequency matrix - make a kmer frequency matrix based on the reads from the quality control step. The value of k can be changed (k=5 or 6 is recommended) 3. UMAP projection and HDBSCAN clustering - the kmer frequency matrix is used to create a UMAP projection. The default parameters for UMAP and HDBSCAN functions have been chosen based on mock dataset but can be changed. 4. Refinement - based on our tests on mock datasets, sometimes reads from different species can cluster together. To prevent that, we include a refinement step based on MSA of Clustal Omega on each cluster. The alignment outputs a guide tree which is used for dividing the cluster into smaller subclusters. The distance threshold can be changed to suit each dataset. 5. Consensus making - lastly, based on the defined clusters, the last step creates a consensus sequence based on majority calling. The direction of the reads is fixed using minimap2, the alignment is performed by MAFFT, and the consensus is created using em_cons. The reads are run through BLASTN to check for identity of each cluster. Software Dependencies: To successfully run the pipeline, certain software need to be installed. 1. Minimap2 - for the consensus making step (https://github.com/lh3/minimap2) 2. MAFFT - for alignment in the consensus making step (https://mafft.cbrc.jp/alignment/software/) 3. EM_CONS - for creating the consensus (http://emboss.sourceforge.net/apps/cvs/emboss/apps/cons.html) 4. NCBIN - for identification of the consensus sequences in the database (https://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/) (a 16S database is also required) 5. CLUSTALO - for the refinement step (http://www.clustal.org/omega/) Specifications: This pipeline runs in python3.8.10 and julia v"1.4.1". The following Python libraries are also required: BioPython hdbscan matplotlib pandas sklearn umap Following Julia packages are required: Pkg DataFrames CSV
A pipeline that creates consensus sequences from a Nanopore reads. I
Overview
Python Implementation of Scalable In-Memory Updatable Bitmap Indexing
PyUpBit CS490 Large Scale Data Analytics β Implementation of Updatable Compressed Bitmap Indexing Paper Table of Contents About The Project Usage Cont
Exploratory Data Analysis of the 2019 Indian General Elections using a dataset from Kaggle.
2019-indian-election-eda Exploratory Data Analysis of the 2019 Indian General Elections using a dataset from Kaggle. This project is a part of the Cou
Churn prediction with PySpark
It is expected to develop a machine learning model that can predict customers who will leave the company.
Orchest is a browser based IDE for Data Science.
Orchest is a browser based IDE for Data Science. It integrates your favorite Data Science tools out of the box, so you donβt have to. The application is easy to use and can run on your laptop as well
Developed for analyzing the covariance for OrcVIO
about This repo is developed for analyzing the covariance for OrcVIO environment setup platform ubuntu 18.04 using conda conda env create --file envir
Single machine, multiple cards training; mix-precision training; DALI data loader.
Template Script Category Description Category script comparison script train.py, loader.py for single-machine-multiple-cards training train_DP.py, tra
EOD Historical Data Python Library (Unofficial)
EOD Historical Data Python Library (Unofficial) https://eodhistoricaldata.com Installation python3 -m pip install eodhistoricaldata Note Demo API key
VevestaX is an open source Python package for ML Engineers and Data Scientists.
VevestaX Track failed and successful experiments as well as features. VevestaX is an open source Python package for ML Engineers and Data Scientists.
Finding project directories in Python (data science) projects, just like there R rprojroot and here packages
Find relative paths from a project root directory Finding project directories in Python (data science) projects, just like there R here and rprojroot
π Python Flask game that consolidates data from Nasdaq, allowing the user to practice buying and selling stocks.
Web Trader Web Trader is a trading website that consolidates data from Nasdaq, allowing the user to search up the ticker symbol and price of any stock
Hydrogen (or other pure gas phase species) depressurization calculations
HydDown Hydrogen (or other pure gas phase species) depressurization calculations This code is published under an MIT license. Install as simple as: pi
Statistical Analysis π focused on statistical analysis and exploration used on various data sets for personal and professional projects.
Statistical Analysis π This repository focuses on statistical analysis and the exploration used on various data sets for personal and professional pr
apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.
Please consider citing the manuscript if you use apricot in your academic work! You can find more thorough documentation here. apricot implements subm
A meta plugin for processing timelapse data timepoint by timepoint in napari
napari-time-slicer A meta plugin for processing timelapse data timepoint by timepoint. It enables a list of napari plugins to process 2D+t or 3D+t dat
Fast, flexible and easy to use probabilistic modelling in Python.
Please consider citing the JMLR-MLOSS Manuscript if you've used pomegranate in your academic work! pomegranate is a package for building probabilistic
PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system
h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark PySpark bindings for the H3 core library. For available functions,
4CAT: Capture and Analysis Toolkit
4CAT: Capture and Analysis Toolkit 4CAT is a research tool that can be used to analyse and process data from online social platforms. Its goal is to m
Udacity-api-reporting-pipeline - Udacity api reporting pipeline
udacity-api-reporting-pipeline In this exercise, you'll use portions of each of
An Indexer that works out-of-the-box when you have less than 100K stored Documents
U100KIndexer An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with
Repositori untuk menyimpan material Long Course STMKGxHMGI tentang Geophysical Python for Seismic Data Analysis
Long Course "Geophysical Python for Seismic Data Analysis" Instruktur: Dr.rer.nat. Wiwit Suryanto, M.Si Dipersiapkan oleh: Anang Sahroni Waktu: Sesi 1