Analysis of Antarctica sequencing samples contaminated with SARS-CoV-2

Overview

Analysis of SARS-CoV-2 reads in sequencing of 2018-2019 Antarctica samples in PRJNA692319

The samples analyzed here are described in this preprint, which is a pre-print by Istvan Csabai and co-workers that describes SARS-CoV-2 reads in samples from Antarctica sequencing in China. I was originally alerted to the pre-print by Carl Zimmer on Dec-23-2021. Istvan Csabai and coworkers subsequently posted a second pre-print that also analyzes the host reads.

Repeating key parts of the analysis

The code in this repo independently repeats some of the analyses.

To run the analysis, build the conda environment in environment.yml and then run the analysis using Snakefile. To do this on the Hutch cluster, using run.bash:

sbatch -c 16 run.bash

The results are placed in the ./results/ subdirectory. Most of the results files are not tracked due to file-size limitations, but the following key files are tracked:

  • results/alignment_counts.csv gives the number of reads aligning to SARS-CoV-2 for each sample. This confirms that three accessions (SRR13441704, SRR13441705, and SRR13441708) have most of the SARS-CoV-2 reads, although a few other samples also have some.
  • results/variant_analysis.csv reports all variants found in the samples relative to Wuhan-Hu-1.
  • results/variant_analysis_to_outgroup.csv reports the variants found in the samples that represent mutations from Wuhan-Hu-1 towards the two closest bat coronavirus relatives, RaTG13 and BANAL-20-52. Note that some of the reads contain three key mutations relative to Wuhan-Hu-1 (C8782T, C18060T, and T28144C) that move the sequence closer to the bat coronavirus relatives. These mutations define one of the two plausible progenitors for all currently known human SARS-CoV-2 sequences (see Kumar et al (2021) and Bloom (2021)).

Archived links after initially hearing about pre-print

I archived the following links on Dec-23-2021 after hearing about the pre-print from Carl Zimmer:

Deletion of some samples from SRA

On Jan-3-2022, I received an e-mail one of the pre-print authors, Istvan Csabai, saying that three of the samples (appearing to be the ones with the most SARS-CoV-2 reads) had been removed from the SRA. He also noted that bioRxiv had refused to publish their pre-print without explanation; the file he attached indicates the submission ID was BIORXIV-2021-472446v1. I confirmed that three of the accessions had indeed been removed from the SRA as shown in the following archived links:

I also e-mailed Richard Sever at bioRxiv to ask why the pre-print was rejected, and explained I had repeated and validated the key findings. Richard Sever said he could not give details about the pre-print review process, but that in the future the authors could appeal if they thought the rejection was unfounded.

Details from Istvan Csabai

On Jan-4-2022, I chatted with Istvan Csabai. He had contacted the authors of the pre-print, and shared their reply to him. The authors had prepped the samples in early 2019, and submitted to Sangon BioTech for sequencing in December, getting the results back in early January.

Second pre-print from Csabai and restoration of deleted files

Istvan Csabai then worked on a second pre-print that analyzed host reads and made various findings, including co-contamination with African green monkey (Vero?) and human DNA. He sent me pre-print drafts on Jan-16-2022 and on Jan-24-2022, and I provided comments on both drafts and agreed to be listed in the Acknowledgments.

On Feb-3-2022, Istvan Csabai told me that the second pre-print had also been rejected from bioRxiv. Because I had previously contacted Richard Sever when I heard the first pre-print was rejected, I suggested Istvan could CC me on an e-mail to Richard Sever appealing the rejection, which he did. Unfortunately, Richard Sever declined the appeal, so instead Istvan posted the pre-print on Resarch Square.

At that point on Feb-3-2022, I also re-checked the three deletion accessions (SRR13441704, SRR13441705, and SRR13441708). To my surprise, all three were now again available by public access. Here are archived links demonstrating that they were again available:

I confirmed that the replaced accessions were identical to the deleted ones.

Inquiry to authors of PRJNA692319

On Feb-8-2022, I e-mailed the Chinese authors of the paper to ask about the sample deletion and restoration. They e-mailed back almost immediately. They confirmed what they had told Istvan: they had sequenced the samples with Sangon Biotech (Shanghai) after extracting the DNA in December 2019 from their samples. The suspect that contamination of the samples happened at Sangon Biotech. They deleted the three most contaminated samples from the Sequence Read Archive. They do not know why the samples were then "un-deleted."

Owner
Jesse Bloom
I research the evolution of viruses and proteins.
Jesse Bloom
Unofficial pytorch implementation of the paper "Dynamic High-Pass Filtering and Multi-Spectral Attention for Image Super-Resolution"

DFSA Unofficial pytorch implementation of the ICCV 2021 paper "Dynamic High-Pass Filtering and Multi-Spectral Attention for Image Super-Resolution" (p

2 Nov 15, 2021
An open source app to help calm you down when needed.

By: Seanpm2001, Et; Al. Top README.md Read this article in a different language Sorted by: A-Z Sorting options unavailable ( af Afrikaans Afrikaans |

Sean P. Myrick V19.1.7.2 2 Oct 24, 2022
When BERT Plays the Lottery, All Tickets Are Winning

When BERT Plays the Lottery, All Tickets Are Winning Large Transformer-based models were shown to be reducible to a smaller number of self-attention h

Sai 16 Nov 10, 2022
Ensembling Off-the-shelf Models for GAN Training

Data-Efficient GANs with DiffAugment project | paper | datasets | video | slides Generated using only 100 images of Obama, grumpy cats, pandas, the Br

MIT HAN Lab 1.2k Dec 26, 2022
Attention-based CNN-LSTM and XGBoost hybrid model for stock prediction

Attention-based CNN-LSTM and XGBoost hybrid model for stock prediction Requirements The code has been tested running under Python 3.7.4, with the foll

zshicode 84 Jan 01, 2023
piSTAR Lab is a modular platform built to make AI experimentation accessible and fun. (pistar.ai)

piSTAR Lab WARNING: This is an early release. Overview piSTAR Lab is a modular deep reinforcement learning platform built to make AI experimentation a

piSTAR Lab 0 Aug 01, 2022
Unofficial Implementation of Oboe (SIGCOMM'18').

Oboe-Reproduce This is the unofficial implementation of the paper "Oboe: Auto-tuning video ABR algorithms to network conditions, Zahaib Akhtar, Yun Se

Tianchi Huang 13 Nov 04, 2022
Code for technical report "An Improved Baseline for Sentence-level Relation Extraction".

RE_improved_baseline Code for technical report "An Improved Baseline for Sentence-level Relation Extraction". Requirements torch = 1.8.1 transformers

Wenxuan Zhou 74 Nov 29, 2022
This project is a loose implementation of paper "Algorithmic Financial Trading with Deep Convolutional Neural Networks: Time Series to Image Conversion Approach"

Stock Market Buy/Sell/Hold prediction Using convolutional Neural Network This repo is an attempt to implement the research paper titled "Algorithmic F

Asutosh Nayak 136 Dec 28, 2022
A best practice for tensorflow project template architecture.

A best practice for tensorflow project template architecture.

Mahmoud Gamal Salem 3.6k Dec 22, 2022
MCMC samplers for Bayesian estimation in Python, including Metropolis-Hastings, NUTS, and Slice

Sampyl May 29, 2018: version 0.3 Sampyl is a package for sampling from probability distributions using MCMC methods. Similar to PyMC3 using theano to

Mat Leonard 304 Dec 25, 2022
Deep learning operations reinvented (for pytorch, tensorflow, jax and others)

This video in better quality. einops Flexible and powerful tensor operations for readable and reliable code. Supports numpy, pytorch, tensorflow, and

Alex Rogozhnikov 6.2k Jan 01, 2023
Make your AirPlay devices as TTS speakers

Apple AirPlayer Home Assistant integration component, make your AirPlay devices as TTS speakers. Before Use 2021.6.X or earlier Apple Airplayer compon

George Zhao 117 Dec 15, 2022
Multiwavelets-based operator model

Multiwavelet model for Operator maps Gaurav Gupta, Xiongye Xiao, and Paul Bogdan Multiwavelet-based Operator Learning for Differential Equations In Ne

Gaurav 33 Dec 04, 2022
CVAT is free, online, interactive video and image annotation tool for computer vision

Computer Vision Annotation Tool (CVAT) CVAT is free, online, interactive video and image annotation tool for computer vision. It is being used by our

OpenVINO Toolkit 8.6k Jan 04, 2023
SANet: A Slice-Aware Network for Pulmonary Nodule Detection

SANet: A Slice-Aware Network for Pulmonary Nodule Detection This paper (SANet) has been accepted and early accessed in IEEE TPAMI 2021. This code and

Jie Mei 39 Dec 17, 2022
Toontown House CT Edition

Toontown House: Classic Toontown House Classic source that should just work. ❓ W

Open Source Toontown Servers 5 Jan 09, 2022
Discord-Protect is a simple discord bot allowing you to have some security on your discord server by ordering a captcha to the user who joins your server.

Discord-Protect Discord-Protect is a simple discord bot allowing you to have some security on your discord server by ordering a captcha to the user wh

Tir Omar 2 Oct 28, 2021
Line-level Handwritten Text Recognition (HTR) system implemented with TensorFlow.

Line-level Handwritten Text Recognition with TensorFlow This model is an extended version of the Simple HTR system implemented by @Harald Scheidl and

Hoàng Tùng Lâm (Linus) 72 May 07, 2022
Facial Expression Detection In The Realtime

The human's facial expressions is very important to detect thier emotions and sentiment. It can be very efficient to use to make our computers make interviews. Furthermore, we have robots now can det

Adel El-Nabarawy 4 Mar 01, 2022