orfipy is a tool written in python/cython to extract ORFs in an extremely and fast and flexible manner

Last update: Nov 21, 2022

Overview

Introduction

orfipy is a tool written in python/cython to extract ORFs in an extremely and fast and flexible manner. Other popular ORF searching tools are OrfM and getorf. Compared to OrfM and getorf, orfipy provides the most options to fine tune ORF searches. orfipy uses multiple CPU cores and is particularly faster for data containing multiple smaller fasta sequences such as de-novo transcriptome assemblies. Please read the paper here.

Please cite as: Urminder Singh, Eve Syrkin Wurtele, orfipy: a fast and flexible tool for extracting ORFs, Bioinformatics, 2021;, btab090, https://doi.org/10.1093/bioinformatics/btab090

Installation

Install latest stable version

pip install orfipy

Or install via conda

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

conda create -n orfipy -c bioconda orfipy

Install the development version from source

git clone https://github.com/urmi-21/orfipy.git
cd orfipy
pip install .

or use pip

pip install git+git://github.com/urmi-21/orfipy.git

Examples

Details of orfipy algorithm are in the paper. Please go through the SI if you are interested to know differences between orfipy and other ORF finder tools and how to set orfipy parameters to match the output of other tools.

Below are some usage examples for orfipy

To see full list of options use the command:

orfipy -h

Input

orfipy version 0.0.3 and above, supports sequences in Fasta/Fastq format (orfipy uses pyfastx). Input files can be in .gz format.

Extract ORF sequences and write ORF sequences in orfs.fa file

orfipy input.fasta --dna orfs.fa --min 10 --max 10000 --procs 4 --table 1 --outdir orfs_out

Use standard codon table but use only ATG as start codon

orfipy input.fa.gz --dna orfs.fa --start ATG

Note: Users can also provide their own translation table, as a .json file, to orfipy using --table option. Example of json file containing a valid translation table is here

See available codon tables

orfipy --show-table

Extract ORFs BED file

orfipy input.fasta --bed orfs.bed --min 50 --procs 4
or
orfipy input.fasta --min 50 --procs 4 > orfs.bed

Extract ORFs BED12 file

Note: Add --include-stop for orfipy output to be consistent with Transdecoder.Predict output .bed file.

orfipy testseq.fa --min 100 --bed12 of.bed --partial-5 --partial-3 --include-stop

Extract ORFs peptide sequences using default translation table

orfipy input.fasta --pep orfs_peptides.fa --min 50 --procs 4

API

Users can directly import the ORF search algorithm, written in cython, in their python ecosystem.

>>> import orfipy_core 
>>> seq='ATGCATGACTAGCATCAGCATCAGCAT'
>>> for start,stop,strand,description in orfipy_core.orfs(seq,minlen=3,maxlen=1000):
...     print(start,stop,strand,description)
... 
0 9 + ID=Seq_ORF.1;ORF_type=complete;ORF_len=9;ORF_frame=1;Start:ATG;Stop:TAG

orfipy_core.orfs function can take following arguments

seq: Required input sequence (str)
name ['Seq'] Name (str)
minlen [0] min length (int)
maxlen [1000000] max length (int)
strand ['b'] Strand to use, (b)oth, (f)wd or (r)ev (char)
starts [['TTG','CTG','ATG']] Start codons to use (list)
stops=['TAA','TAG','TGA'] Stop codons to use (list)
include_stop [False] Include stop codon in ORF (bool)
partial3 [False] Report ORFs without a stop (bool)
partial5 [False] Report ORFs without a start (bool)
between_stops [False] Report ORFs defined as between stops (bool)

Comparison with getorf and OrfM

Comparison of orfipy features and performance with getorf and OrfM. Tools were run on different data and ORFs were output to both nucleotide and peptide Fasta files (fasta), only peptide Fasta (peptide) and BED (bed). For details see the publication and SI

orfipy is most flexible, particularly faster for data containing multiple smaller fasta sequences such as de-novo transcriptome assemblies or collection of microbial genomes.
OrfM is fast (faster for Fastq), uses less memory, but ORF search options are limited
getorf is memory efficient but slower, no Fastq support. Provides some flexibility in ORF searches.

Funding

This work is funded in part by the National Science Foundation award IOS 1546858, "Orphan Genes: An Untapped Genetic Reservoir of Novel Traits". This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562 (Bridges HPC environment through allocations TG-MCB190098 and TG-MCB200123 awarded from XSEDE and HPC Consortium).

Comments

Compatibility with lower-case fasta sequences (A weird bug)

Hello Urminder!

Something strange happens to me when I try to run orfipy with a particular genome.

At the end of the program, it does not return the predicted ORFs.

$ orfipy cdiff.fasta orfipy version 0.0.3 Using translation table: Standard (transl_table=1) start: ['TTG', 'CTG', 'ATG'] stop: ['TAA', 'TAG', 'TGA'] Setting chunk size 714 MB. Procs 45 Logs will be saved to: orfipy_cdiff.fasta_out/orfipy_2021_03_04_13_16_06.031643.log Processing 8597268 bytes Processed 1 sequences in 0.39 seconds

I tested using prodigal and it works without problems. Is the only genome that I have this problem and I can't understand why.

Can you reproduce the error? Best regards! Enzo.

cdiff.fasta.gz

opened by EnzoAndree 7

cannot install from bioconda

I created a new environment to install orfipy, but still have conflicts error.

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: |
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed

UnsatisfiableError:

opened by lijing28101 5

conda installation issue

Hi,

Small thing, when running orfipy on a gzip fastq file I ran into this:

Traceback (most recent call last):
  File "/home/ben/e/orfipy-0.0.2/bin/orfipy", line 11, in <module>
    sys.exit(main())
  File "/home/ben/e/orfipy-0.0.2/lib/python3.8/site-packages/orfipy/__main__.py", line 338, in main
    orfipy.findorfs.main(infile,
  File "/home/ben/e/orfipy-0.0.2/lib/python3.8/site-packages/orfipy/findorfs.py", line 479, in main
    seqs = Fasta(infasta)
  File "/home/ben/e/orfipy-0.0.2/lib/python3.8/site-packages/pyfaidx/__init__.py", line 996, in __init__
    self.faidx = Faidx(
  File "/home/ben/e/orfipy-0.0.2/lib/python3.8/site-packages/pyfaidx/__init__.py", line 354, in __init__
    raise ImportError(
ImportError: BioPython >= 1.73 must be installed to read block gzip files.

I suppose this can be fixed by specifying a biopython version constraint in the conda definition? Thanks.

opened by wwood 4

Overlap ORFs density threshold

Hello, have you developed a way to allow a certain size of overlaps between ORFs in order to maximize the density of longest ORFs along scaffolds?

Tanks you very much for this soft by the way, very efficient and easy to use.

All the best

Ben

opened by BenjaminGuinet 3
Not all ORFs found?

I'm trying to understand how this tool works.

Here is a sequence of ~1 kb length:

region_of_interest GACTCGGTGCTATGTTCTGAATATTTCTGACTTGCATTTTTAATGGAGATAAAATGAAGCATTTAATACATGACGTAGATGAAGACATGAATGAAACTACAGACAAACTTAACTCTTCTCTCATTCTTCCTTTCAGTAAGGACTATGAGTTCTGTTCAAATGGCGTTTATTTCTATTGTGGAAAGATGGGTTCAGGTAAGACATTTAATTTAATTCGTCATATACTCATAACAGAACGTTTAGGAAATGACTCATATTATGACCAAATCATTATATCAGCAACTTCAGACTCTATGGACTCAACAGCGAAAACATTTATGTCAAAAGTTCAAGCCTCTGTCGTTAAAGTTCCAGACAGTGAACTCATTGAATTTCTTCAACGTTACATTCGACGTAAGAGGAAATATTATGCCATCGTTGAATTTATACAGTCAGGAATGCAAAAGACTTCTGAGGAGATGGAAAGAATTATTGACAAACACCACTTACGTCAGTACTCAGGAGTTTACGATATGAAACGACTGACAAACTACATTCTATCAAAACTTTCAAAATACCCCTTCAAAAAATATCCTTCAAACACTCTGCTCGTTTGCGACGACTTCGCTGGTAAAGGTTTAGTGTCAAAACCAGACTCACCATTAGCTAATATCATTACTAAAGTCAGACATTACCACTTAACTGTAGCAATACTTATGCAAACATGGAGGTTTTTAGCTTTAAACATAAAACGTCTCATAACTGACTTCGTTATCTTTCAAGGTTTCTCACGTTATGATATTGAACTCATTTGGAAACAGTCAGGTATAACATTACCTTTTGAAGAAATTTGGGAAGCATATAAGTCTCTCATCTCTCCTCGTTCATACCTTGAGATTCATATCATGACTAATACCATTAAAGTCAAAAATATTCCATGGGAACGACCAACATTGTTTTAAAGTTTAACCTTCAATTGACTGA

In the IGV genome viewer, where ATG = green bar and stop codon = red, the forward sequence appears thus with 3-frame translation:

I note 10 start sites in frame 3, following the first STOP. Each of these, I thought, would constitute an alternate ORF, all ending in the same downstream STOP

But the output of orfipy is : // $ orfipy new_seq.fa --min 100 --max 100000 --procs 4 | sort -k2,2n orfipy version 0.0.4 Using translation table: Standard (transl_table=1) start: ['TTG', 'CTG', 'ATG'] stop: ['TAA', 'TAG', 'TGA'] Setting chunk size 12053 MB. Procs 4 Logs will be saved to: orfipy_new_seq.fa_out/orfipy_2021_09_01_15_35_34.526156.log Processed 1 sequences in 0.02 seconds

region_of_interest 26 938 ID=region_of_interest_ORF.3;ORF_type=complete;ORF_len=912;ORF_frame=3;Start:CTG;Stop:TAA 0 + region_of_interest 90 210 ID=region_of_interest_ORF.1;ORF_type=complete;ORF_len=120;ORF_frame=1;Start:ATG;Stop:TAA 0 + region_of_interest 246 513 ID=region_of_interest_ORF.2;ORF_type=complete;ORF_len=267;ORF_frame=1;Start:ATG;Stop:TGA 0 + region_of_interest 323 431 ID=region_of_interest_ORF.4;ORF_type=complete;ORF_len=108;ORF_frame=-2;Start:CTG;Stop:TGA 0 - region_of_interest 532 667 ID=region_of_interest_ORF.5;ORF_type=complete;ORF_len=135;ORF_frame=-3;Start:CTG;Stop:TAG 0 - // Can you explain why orfipy excluded so many potential ORFs here? And is there an option to force it to report them?

opened by krabapple 3
Python 3.9 support using conda

Hello Urminder!

Amazing work! I didn't think that Cython could reach such speed. I will keep it in mind for my next projects.

I wanted to report that conda fails to install orfipy when you have Python 3.9 installed. I strongly believe that orfipy should not have problems in Python 3.9.

Do you plan to enable Python 3.9 support in the conda recipe?

Best regards! Enzo.

opened by EnzoAndree 3
ImportError: undefined symbol: PySlice_Adjustindices

Hello @urmi-21 Thanks for the tool. I installed it successfully. However, it does not work. I get the following output for the command:- orfipy -h

Can you please suggest a solution?

opened by VJ-Ulaganathan 1
Is there a tool to update gtf/gff file according to orfipy results?

Hi,

Is there a tool to update gtf/gff(generated by stringtie2 or scallop2) file according to orfipy results? Add splice sites, UTRs, CDSs to existing gtf/gff file.

Best, Kun

opened by xiekunwhy 1
Full length ORF

Hi, I am using the orfipy. It sounds great tool for my recent work. I wonder could it be possible only to get full length orf not the partial orfs? Please let me know if there is any possibility?

opened by apoosakkannu 1

Raises IndexError if no match along the specified strand is found

You can reproduce the problem by running the following code

import orfipy_core
seq = '''GTATCGCTGGAGTCGGGTGATCTCCACGGAGACTCGAGTGGTCTCTTCTTGCCGGGAGCCGTCTTCGCCGGGGTTTCCTCTACCAGACCAAAGGGCTCTAGGACCCTCTTTTTGGCCTGGAAAACCGCCTTACCGAGGTTTCCGCCCCAAGACTTATCGTCCTGGAGCTTTTCCTGAAACTCGGAATCGGCGTGGTTGTACTTGAGGTAAGGATTATCCCCCGCCTCAAGTAGCTTGTTGTATTCGAGATCGTGCTCTCGCGCGACCTCGTCCGCCTTATTGACGGGCTGGCCTTTATCAAGGCCGTTGAAGGGTCCGAGGTATTTGTACCCCGGAACTACTAGACCGCGTTGGTCCTGTTTTTGCTGGTTAGCCTTAGGCCGAGGCGCACCTATGGGCGATGCACAACAGGGTTCCGACGGAGTGGGCAATGCCTCGGGAGATTGGCATTGCGATTCCCAGTGGATGGGCGACCGAGTCATCACCAAGTCCACCCGAACCTGGGTGCTGCCCAGCTACAACAACCACATCTACAAGGAAATCAACTCCACCGGCAACGGACTCAACGGCAGCGCCTACTTTGGATACAGTACTCCCTGGGGATATTTCGACTTTAACCGCTTCCACAGCCACTGGAGCCCCCGAGATTGGCAGCGACTCATCAACAACCACTGGGGCTTCAGACCCAAGGCCATGCACGTCAAAATCTTCAACATCCAAGTCAAAGAAGTCACCACCCAGGACCAGACCACCACCGTCGCCTACTTTGGATACAGTACTCCCTGGGGATATTTCGACTTTAACCGCTTCCAC'''
orfipy_core.orfs(seq, starts=['ATG'])
orfipy_core.orfs(seq, starts=['ATG'], strand='f')

The following statement orfipy_core.orfs(seq, starts=['ATG']) runs without any errors. However, orfipy_core.orfs(seq, starts=['ATG'], strand='f') throws IndexError. Posting the stacktrace also

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "orfipy/orfipy_core.pyx", line 28, in orfipy_core.orfs
  File "orfipy/orfipy_core.pyx", line 46, in orfipy_core.orfs
IndexError: list index out of range

Kindly look into this. Thanks in advance :)

opened by Prakash2403 1

Run Multiple Codon Table Numbers

Hello, I'm using orfipy for viral detection. I have a range of codon table numbers I'd like to run. Looks like --table only takes a single integer, and the json file only accepts a single . Please correct me if I'm wrong about this! Would be convenient to add an option to search multiple codon tables at the same time. In its current state, I have to run orfipy multiple times and combine the results. Brett

opened by brettyout 0
Suggestion: deterministic ORF IDs

Right now ORFs are numbered, which causes problems on subsequent runs with slightly changed parameters such as a different minimum length. Maybe encoding the start and stop position would be a better approach so that the order doesn't affect anything. Thanks!
enhancement

opened by Benjamin-Lee 0

Description dictionary

Hello,

I believe that the description of each ORF would be more accessible as a dictionary, instead of a string that is delimited with ;, :, and =. I am able to convert the string into a desirable dictionary

{'ID': '1', 'ORF_type': 'complete', 'ORF_len': '912', 'ORF_frame': '1', 'Start': 'TTG', 'Stop': 'TAA'}

through the following code

import orfipy_core

for start, stop, strand, description in orfipy_core.orfs(mers_sequence.upper()):
    descriptions = {}
    for info in description.split(';'):
        if '=' in info:
            info = info.split('=')
            name, content = info[0], info[1]
            if name == 'ID':
                content = content.split('.')[1]
        else:
            info = info.split(':')
            name, content = info[0], info[1]
        descriptions[name] = content

however, this functionality would be more conveniently integrated into the basic code of ORFIpy.

Thank you, Andrew

enhancement

opened by freiburgermsu 0

Releases(v0.0.4)

v0.0.4(Jul 20, 2021)

Release notes for orfipy release v0.0.4 orfipy now supports soft masked sequences via the --ignore-case option. Some bug fixed
Source code(tar.gz)
Source code(zip)
v0.0.3(Dec 31, 2020)
Major changes

Switch to pyfastx from pyfaidx

Index free strategy to iterate over the inputs

Added support for Fastq and gzipped files

Better handle large sequences such as whole chromosomes

Added basic API for python users

Multiple refactors

Overall, improved performance

Source code(tar.gz)
Source code(zip)
v0.0.2(Nov 7, 2020)
Added support for logs

Some fixes and code cleanup

Source code(tar.gz)
Source code(zip)
v0.0.1(Oct 15, 2020)

First release

orfipy v0.0.1
Source code(tar.gz)
Source code(zip)

Owner

Urminder Singh

PhD candidate at Iowa State University

GitHub Repository

Having a weak password is not good for a system that demands high confidentiality and security of user credentials

Having a weak password is not good for a system that demands high confidentiality and security of user credentials. It turns out that people find it difficult to make up a strong password that is str

0 Feb 07, 2022

Exploiting CVE-2021-42278 and CVE-2021-42287

noPac Exploiting CVE-2021-42278 and CVE-2021-42287 原项目noPac在实现上可能有点问题，导致在本地没有打通，于是参考sam-the-admin项目进行修改。使用 pip3 install -r requirements.txt # GetShel

2 Jun 23, 2022

一款Web在线自动免杀工具

一款利用加载器以及Python反序列化绕过AV的在线免杀工具因为打包方式的局限性，不能跨平台，若要生成exe格式的只能在Windows下运行本项目打包速度有点慢，提交后稍等一会开发环境及运行前端使用Bootstrap框架，后端使用Django框架。

172 Nov 28, 2022

About Hive Burp Suite Extension

Hive Burp Suite Extension Description Hive extension for Burp Suite. This extension allows you to send data from Burp to Hive in one click. Create iss

7 Dec 07, 2022

🔐 A simple command-line password manager.

PassVault What Is It? It is a command-line password manager, for educational purposes, that stores localy, in AES encryption, your sensitives datas in

5 Aug 15, 2022

Enhancing Twin Delayed Deep Deterministic Policy Gradient with Cross-Entropy Method

Enhancing Twin Delayed Deep Deterministic Policy Gradient with Cross-Entropy Method Hieu Trung Nguyen, Khang Tran and Ngoc Hoang Luong Setup Clone thi

6 Jun 29, 2022

Tool To generate Stable Undetected Payload

windowsPayload Tool To generate Stable Undetected Payload Don t Upload to Virus Total :) Follow on Social Media Platforms ScreenShots How to install +

117 Dec 30, 2022

Cam-Hacker: Ip Cameras hack with python

Cam-Hacker Hack Cameras Mode Of Execution: apt-get install python3 apt-get insta

9 Dec 17, 2022

🔎 Most Advanced Open Source Intelligence (OSINT) Framework for scanning IP Address, Emails, Websites, Organizations.

1.5k Dec 28, 2022

Cve-2021-22005-exp

cve-2021-22005-exp 0x01 漏洞简介 2021年9月21日，VMware发布安全公告，公开披露了vCenter Server中的19个安全漏洞，这些漏洞的CVSSv3评分范围为4.3-9.8。其中，最为严重的漏洞为vCenter Server 中的任意文件上传漏洞(CVE-20

146 Dec 31, 2022

Windows Virus who destroy some impotants files on C:\windows\system32\

psychic-robot Windows Virus who destroy some importants files on C:\windows\system32\ Signatures of psychic-robot.PY (python file) : Bkav Pro : ASP.We

1 Jan 06, 2022

Chromepass - Hacking Chrome Saved Passwords

Chromepass - Hacking Chrome Saved Passwords and Cookies View Demo · Report Bug · Request Feature Table of Contents About the Project AV Detection Gett

622 Jan 04, 2023

Check for breached passwords with k-anonymity

passwnd Check for breached passwords with k-anonymity Usage To get prompted to enter the password securely, simply run: passwnd.py Alternatively, you

1 Feb 08, 2022

A hashtag check python module

3 Aug 10, 2022

A great and handy python obfuscator for protecting code.

Python Code Obfuscator A handy and necessary tool that can protect your code anytime! Mostly Command Line tool that will obfuscate your code. Features

5 Nov 18, 2022

Lightweight and beneficial Dependency Injection plugin for apscheduler

Implementation of dependency injection for apscheduler Prerequisites: apscheduler-di solves the problem since apscheduler doesn't support Dependency I

11 Dec 07, 2022

I hacked my own webcam from a Kali Linux VM in my local network, using Ettercap to do the MiTM ARP poisoning attack, sniffing with Wireshark, and using metasploit

plan I - Linux Fundamentals Les utilisateurs et les droits Installer des programmes avec apt-get Surveiller l'activité du système Exécuter des program

148 Dec 22, 2022