A tool for batch processing large fasta files and accompanying metadata table to upload to repositories via API

Overview

Fasta Uploader

A tool for batch processing large fasta files and accompanying metadata table to repositories via API

The python fasta_uploader.py script breaks large fasta files (e.g. 500mb) and related (one-to-one) tab-delimited sample contextual data into smaller batches of 1000 or some specified # of records which can then be uploaded to a given sequence repository if an API endpoint is selected. Currently there is one option for the API interface: VirusSeq.

This tool is developed by the SFU Centre for Infectious Disease Epidemiology and One Health in conjunction with VirusSeq and it works well with DataHarmonizer!

Authors: Damion Dooley, Nithu Sara John

Details

Given a fasta file and a sample metadata file with a column that matches to fasta file record identifiers, break both into respective sets of smaller batches of records which are submitted to an API for processing.

Processing is three step:

  1. Construct batches of files. Since only two files are read and parsed in one go, processing of them is reliable after that point, so no further error reporting is needed during the batch file generation process.

    1. Importantly, if rerunning fasta_batch_submit.py, this step will be skipped unless -f --force parameter is run. Currently input files are still required in this case.
  2. IF API option is included, submit each *.queued.fasta batch to API, wait for it to finish or error out (capture error report) and proceed to next batch.

    1. Some types of error trigger sudden death, i.e. sys.exit() because they would also occur in subsequent API batch calls. For example missing tabular data column names will trigger an exit. Once resolved, rerun with -r to force regeneration of output files.
    2. There is an option to just try submitting one of the batches, e.g. the first one, via "-n 0" parameter. This allows error debugging of just the first batch. Once error patterns are determined, those that apply to remaining source contextual data can be applied, and first batch removed from source fasta and contextual data files, and the whole batching can be redone using -r reset, or by manually deleting the output files and rerunning.
  3. The processing status of existing API requests is reported from the API server end. Some may be queued by the API server, others may have been processed successfully, and others may have line-by-line errors in field content that are converted by fasta_uploader.py into new [output file batch.#].queued.fasta and [output batch.#].queued.tsv files which can be edited and then submitted back to the API by rerunning the program with the same command line parameters.

Requires Biopython and Requests modules

  • "pip install biopython"
  • "pip install requests"

Usage

Run the command in a folder with the appropriate input files, and output files can be generated there too. Rerun it in the same folder to incrementally fix any submission errors and then restart submission.

python fasta_uploader.py [options]

Options:

-h, --help
  Show this help message and exit.
-f FASTA_FILE, --fasta=FASTA_FILE
  Provide a fasta file name.
-m METADATA_FILE, --metadata=METADATA_FILE
  Provide a COMMA .csv or TAB .tsv delimited sample contextual data file name.
-b BATCH, --batch=BATCH
  Provide number of fasta records to include in each batch. Default is 1000.
-o OUTPUT_FILE, --output=OUTPUT_FILE
  Provide an output file name/path.
-k KEY_FIELD, --key=KEY_FIELD
  Provide the metadata field name to match to fasta record identifier.
-n BATCH_NUMBER, --number=BATCH_NUMBER
  Process only given batch number to API instead of all batches.

Parameters involved in optional API call:

-a API, --api=API     
  Provide the target API to send data too.  A batch submission job will be initiated for it. Default is "VirusSeq_Portal".
-u API_TOKEN, --user=API_TOKEN
  An API user token is required for API access.
-d, --dev
  Test against a development server rather than live one.  Provide an API endpoint URL.
 -s, --short
  Report up to given # of fasta record related errors for each batch submission.  Useful for taking care of repeated errors first based on first instance.
 -r, --reset
  Regenerate all batch files and begin API resubmission process even if batch files already exist under given output file pattern.

For example:

python fasta_uploader.py -f "consensus_final.fasta" -m "final set 1.csv" -k "fasta header name" -a VirusSeq_Portal -u ENTER_API_KEY_HERE

This will convert consensus_final.fasta and related final set 1.csv contextual data records into batches of 1000 records by default, and will begin submitting each batch to the VirusSeq portal.

python fasta_uploader.py -f "consensus_final.fasta" -m "final set 1.csv" -n 0 -k "fasta header name" -a VirusSeq_Portal -u ENTER_API_KEY_HERE

Like the above but only first batch is submitted so that one can see any errors, and if they apply to all batches, can fix them in original "final set 1.csv" file. Once batch 0 is fixed, all its records can be removed from the consensus_final.fasta and final set 1.csv.csv files, and the whole job can be resubmitted.

Owner
Centre for Infectious Disease and One Health
Hsiao Laboratory at Simon Fraser University
Centre for Infectious Disease and One Health
ZipFly is a zip archive generator based on zipfile.py

ZipFly is a zip archive generator based on zipfile.py. It was created by Buzon.io to generate very large ZIP archives for immediate sending out to clients, or for writing large ZIP archives without m

Buzon 506 Jan 04, 2023
Object-oriented file system path manipulation

path (aka path pie, formerly path.py) implements path objects as first-class entities, allowing common operations on files to be invoked on those path

Jason R. Coombs 1k Dec 28, 2022
This simple python script pcopy reads a list of file names and copies them to a separate folder

pCopy This simple python script pcopy reads a list of file names and copies them to a separate folder. Pre-requisites Python 3 (ver. 3.6) How to use

Madhuranga Rathnayake 0 Sep 03, 2021
Organize the files into the relevant sub-folders

This program can be used to organize files in a directory by their file extension. And move duplicate files to a duplicates folder.

Thushara Thiwanka 2 Dec 15, 2021
Ini adalah program python untuk mengubah background foto dalam 1 folder, tidak perlu satu satu

Myherokuapp my web drive You can see my web drive and can request film/Application do you want in here my blog you can visit my blog RemBg ini adalah

XnuxersXploitXen 13 Dec 01, 2022
Copy only text-like files from the folder

copy-only-text-like-files-from-folder-python copy only text-like files from the folder This project is for those who want to copy only source code or

1 May 17, 2022
Transforme rapidamente seu arquivo CSV (de qualquer tamanho) para SQL de forma rápida.

Transformador de CSV para SQL Transforme rapidamente seu arquivo CSV (de qualquer tamanho) para SQL de forma rápida, e com isso insira seus dados usan

William Rodrigues 4 Oct 17, 2022
LightCSV - This CSV reader is implemented in just pure Python.

LightCSV Simple light CSV reader This CSV reader is implemented in just pure Python. It allows to specify a separator, a quote char and column titles

Jose Rodriguez 6 Mar 05, 2022
This is just a GUI that detects your file's real extension using the filetype module.

Real-file.extnsn This is just a GUI that detects your file's real extension using the filetype module. Requirements Python 3.4 and above filetype modu

1 Aug 08, 2021
Swiss army knife for Apple's .tbd file manipulation

Description Inspired by tbdswizzler, this simple python tool for manipulating Apple's .tbd format. Installation python3 -m pip install --user -U pytbd

10 Aug 31, 2022
Uncompress DEFLATE streams in pure Python

stream-inflate Uncompress DEFLATE streams in pure Python. Installation pip install stream-inflate Usage from stream_inflate import stream_inflate impo

Michal Charemza 7 Oct 13, 2022
Python function to stream unzip all the files in a ZIP archive: without loading the entire ZIP file or any of its files into memory at once

Python function to stream unzip all the files in a ZIP archive: without loading the entire ZIP file or any of its files into memory at once

Department for International Trade 206 Jan 02, 2023
CredSweeper is a tool to detect credentials in any directories or files.

CredSweeper is a tool to detect credentials in any directories or files. CredSweeper could help users to detect unwanted exposure of credentials (such as personal information, token, passwords, api k

Samsung 54 Dec 13, 2022
Python library for reading and writing tabular data via streams.

tabulator-py A library for reading and writing tabular data (csv/xls/json/etc). [Important Notice] We have released Frictionless Framework. This frame

Frictionless Data 231 Dec 09, 2022
Python package to read and display segregated file names present in a directory based on type of the file

tpyfilestructure Python package to read and display segregated file names present in a directory based on type of the file. Installation You can insta

Tharun Kumar T 2 Nov 28, 2021
Small-File-Explorer - I coded a small file explorer with several options

Petit explorateur de fichier / Small file explorer Pour la première option (création de répertoire) / For the first option (creation of a directory) e

Xerox 1 Jan 03, 2022
This python project contains a class FileProcessor which allows one to grab a file and get some meta data and header information from it

This python project contains a class FileProcessor which allows one to grab a file and get some meta data and header information from it. In the current state, it outputs a PrettyTable to txt file as

Joshua Wren 1 Nov 09, 2021
Uproot is a library for reading and writing ROOT files in pure Python and NumPy.

Uproot is a library for reading and writing ROOT files in pure Python and NumPy. Unlike the standard C++ ROOT implementation, Uproot is only an I/O li

Scikit-HEP Project 164 Dec 31, 2022
Maltego transforms to pivot between PE files based on their VirusTotal codeblocks

VirusTotal Codeblocks Maltego Transforms Introduction These Maltego transforms allow you to pivot between different PE files based on codeblocks they

Ariel Jungheit 18 Feb 03, 2022
Find potentially sensitive files

find_files Find potentially sensitive files This script searchs for potentially sensitive files based off of file name or string contained in the file

4 Aug 20, 2022