Smoking guns everywhere…
A few weeks ago a former journalist Nicholas Wade penned a long conspiracy “theory” pointing fingers at the Wuhan Institute of Virology for a possible lab leak causing the COVID-19 pandemic.
Wade quoted the Nobel-prize-winning virologist Baltimore in his piece:
“When I first saw the furin cleavage site in the viral sequence, with its arginine codons, I said to my wife it was the smoking gun for the origin of the virus,” said David Baltimore, an eminent virologist and former president of CalTech. “These features make a powerful challenge to the idea of a natural origin for SARS2,” he said.
Let’s reholster that smoking gun…
Kristian Andersen is a Ph.D. virologist who early in 2020 wrote an important Nature Letters paper showing that the SARS-CoV-2 virus was most likely a naturally derived virus and not engineered in a lab. Andersen responded to Baltimore’s smoking gun comment with some very specific tweets firmly debunking the smoke from any guns:
The furin cleavage site (FCS) / polybasic cleavage site is present in SARS-CoV-2 at the S1/S2 junction of the spike protein where it mediates the cutting (by the host protease furin, among others) of the spike, which is required for infection of cells.
This introductory tweet basically says that the FCS is like a flag within the virus’ spike protein for a human protein scissors called a furin to come and cut the spike. Cutting the spike at the location makes it much easier for the virus to then get into our cells.
Andersen provided a detailed map of that flag, shown below:
Andersen continued his scientific tweetise by explaining how the virus’ flag was made and what it does:
The FCS was created by an out-of-frame insertion of “CTCCTCGGCGGG” creating the “(P)RRAR” amino acid sequence, which constitutes a suboptimal polybasic cleavage site that is important for expanding SARS-CoV-2 host range, it’s transmission and pathogenesis, etc.
There’s a lot in that short sentence to let’s unpack that a bit.
Andersen’s comment about an out-of-frame insertion means that the DNA code CTCCTCGGCGGG is not read in the normal in-frame triplets CTC-CTC-GGC-GGG. If we consult a codon table we see that those codons translate into an amino acid sequence: L-L-G-G (or leucine-leucine-glycine-glycine). However, this short sequence was inserted into the virus genome out-of-frame so it was actually read nCT-CCT-CGG-CGG-Gnn. The first and last triplets are part of an existing codon with the letters nnn. So the middle three codons translate to: x-P-R-R-x. We have only a part of the FCS Andersen mentioned which is (P)RRAR — so where does the rest come from?
This dozen letters of genetic code was inserted within a previously existing codon, not between codons the way a scientist would engineer an insertion.
The previous amino acid was a serine. The codons for serine include: TCT, TCC, TCA, and TCG.
Since the first out-of-frame codon is nCT, we know that the original codon was TCT. The insertion happened between the first T of the serine codon, and the last two CT.
If our out-of-frame insert happens within this serine codon, we end up with the following: TCT-CCT-CGG-CGG-GCT.
Now if we translate that, we end up with the following: S-P-R-R-A.
The original amino acid next to the serine is an arginine, R. Therefore the full sequence including the original S-R becomes: S-P-R-R-A-R.
Sub-optimal cleavage site…
Next, after explaining how the FCS was encoded by an out-of-frame insertion of a dozen nucleotides (genetic letters), Andersen then calls the result a “suboptimal polybasic cleavage site”. What does he mean by this?
A furin is a protein which is evolutionarily designed to cut other proteins. Enzymes like proteases, protein scissors that cut other proteins, typically recognize very specific targets. In our case, furins recognize short amino acid sequences and cut near those targets. The sequence furins recognize are: R-X-[K/R]-R↓. We know R = arginine, K = lysine, and X = any amino acid. The arrow designates where furin cuts the target protein.
The FCS in the SARS-CoV-2 virus is R-R-A-R and differs from that simple R-X-[K/R]-R motif. Furthermore, efficient cleavage requires more than four amino acids. Optimal furin cleavage sites actually requires about 20 amino acids. The following illustrates what is known as optimal furin cleavage sites:
For example, the viral sequence contains two isoleucines in the P3′ and P4′ region which calls for small hydrophillic amino acids, whereas isoleucines are aliphatic amino acids.
It is clear that the SARS-CoV-2 site works as a furin target, but is far from optimal.
Transmission and pathogenesis…
Finally, Andersen concludes his second tweet in this series by saying: “…polybasic cleavage site …is important for expanding SARS-CoV-2 host range, it’s transmission and pathogenesis…”.
Studies of coronaviruses without the furin cleavage site showed that this short insertion is essential for infection into human cells and is key to the transmission (ability to infect new hosts) and pathogenicity (ability to grow in the host cells) of this virus.
Importantly, the furin cleavage site may have been the key in the spread of this virus from the original bat host to humans, possibly through an intermediate species.
Where did FCS come from…
Andersen then beautifully addressed the question of where the FCS came from. I’ll just extensively quote from most of his thread here. He tweeted:
FCSs are abundant, including being highly prevalent in coronaviruses. While SARS-CoV-2 is the first example of a SARSr virus with an FCS, other betacoronaviruses (the genus for SARS-CoV-2) have FCSs, including MERS and HKU1.
He’s saying that evolution solved the “problem” of putting FCS into viral genomes many times. This is not unusual or difficult for viruses.
There is nothing mysterious about having a “first example” of a virus with an FCS. Viruses sampled to date only give us a teeny-tiny fraction of all the viruses circulating in the wild. Fragments — such as the CTCCTCGGCGGG — come and go all the time.
How did SARS-CoV-2 acquire the FCS? We don’t know, however, we know four main mechanisms often lead to insertions: (1) mutation (2) polymerase slippage (3) template switching (4) recombination All of which play key roles in coronavirus (incl. SARS-CoV-2) evolution.
While we don’t know for sure how SARS-CoV-2 acquired the FCS, template switching is a very likely explanation with a plausible mechanism: https://link.springer.com/article/10.1007%2Fs00705-020-04750-z … We also find insertions — albeit not FCSs (yet) — in highly related viruses, e.g., RmYN02:
Template switching likely also play an important role during the ongoing evolution of SARS-CoV-2: https://www.biorxiv.org/content/10.1101/2021.04.23.441209v1 …. We need to see this in the context of the decades of evolution of the SARS-CoV-2 ancestor and related viruses in bats. It’s safe to say indels come and go.
The FCS itself, (P)RRAR, is not an optimal site (for cleavage) and has never previously been used in CoV experiments to the best of my knowledge — unlike more optimal sites, which have been inserted into SARSr CoVs for basic research:
The exact same (P)RRAR FCS found in SARS-CoV-2 can be found in different viruses, including Feline coronavirus (FCoV), which is an alphacoronavirus. Note, site not present in all closely related viruses and plenty of indels around the site — like SARS-CoV-2 vs SARSr CoVs.
If we zoom in on the (P)RRAR site in SARS-CoV-2 and compare it to the one found in (some) FCoV sequences, we can see there’s a fair bit of homology outside the FCS too — including likely O-linked glycans being conserved.
The (P)RRAR FCS isn’t optimal and while it’s ‘sufficient’ for SARS-CoV-2s ‘success’ as a pandemic virus, it’s not an ideal site as defined by the canonical R‐X‐K/R‐R FCS seen in many proteins (viral and otherwise).
The “P” from the (P)RRAR insert isn’t directly part of the cleavage site itself, but, intriguingly, may regulate it via the nearby O-linked glycans. This is seen in host proteins: https://www.jbc.org/article/S0021-9258(20)32890-8/fulltext …, but also in SARS-CoV-2:
Importantly, however, in recent month we have started seeing the “P” mutating towards residues creating more optimal furin sites — P681H and, especially, P681R, which can be found in B.1.1.7 and B.1.617.x, suggesting the virus may evolve towards more efficient usage of the site.
So Baltimore’s first point — that the FCS found in SARS-CoV-2 is somehow unusual — is simply incorrect. FCSs are found in a multitude of different coronaviruses, indels come and go frequently, and the exact (P)RRAR can be found in other coronaviruses.
Now, the codons. Here, Baltimore is talking about the two codons coding for the first two arginines (R) following the P — CGG. The CGG codon is rare in viruses because it’s an example of an unmethylated “CpG” site that can be bound by TLR9, leading to immune cell activation.
Despite being rare, however, CGG codons *are* found in all coronaviruses, albeit at low frequency. Specifically, of all arginine codons, CGG is used at these frequencies in these viruses: SARS: 5% SARS2: 3% SARSr: 2% ccCoVs: 4% HKU9: 7% FCoV: 2% Nothing unusual here.
Furthermore, if we go back to the FCoV sequences and compare them to SARS-CoV-2 at the nucleotide level you’ll see that FCoV also uses CGG to code for R immediately following the P. The next R is CGA (non-CpG) in FCoV, while it’s CGG in SARS-CoV-2 — one nucleotide difference.
We see CGG multiple times in different ways — here’s an example comparing another “PR” stretch between SARS-CoV-2, RaTG13, and SARS-CoV in the N gene. Note how SARS-CoV-2 and RaTG13 both use CGG, while SARS-CoV-2 uses CGC for the first R, while later R’s are coded by CGT or AGA.
One final point about the CGG codons in the FCS — if they were somehow “unnatural”, we’d see SARS-CoV-2 evolve away from “CGG” during the ongoing pandemic. We have more than a million genomes to analyze, so what do we find if we look at synonymous mutations at the “CGG_CGG” site?
Remarkably stable. Specifically, CGG is 99.87% conserved in the first codon and 99.84% conserved in the second. This is *very* strong evidence that SARS-CoV-2 ‘prefers’ CGG in these positions.
R is coded by six different codons, yet the simple single transition “CGA” is only observed in ~0.02% of sequences. The second most ‘popular’ codon at these sites is “CGT” (a transversion) at 0.11% frequency. In other words — there is nothing unusual about the codons either.
So Baltimore’s second point is also false, invalidating his hypothesis that the “FCS […] with its arginine codons […] was the smoking gun for the origin of the virus”. Baltimore does not provide any evidence to support his hypothesis and the data support a natural origin.
Does this disprove a lab leak? No. However, it disproves there being a “smoking gun” in the FCS and lends further evidence to natural emergence — but it also does not *prove* that scenario. To this day, we have yet to see any scientific evidence supporting a lab leak.
Baltimore backs down…
“should have softened the phrase ‘smoking gun’ because I don’t believe that it proves the origin of the furin cleavage site but it does sound that way. I believe that the question of whether the sequence was put in naturally or by molecular manipulation is very hard to determine but I wouldn’t rule out either origin.”
Where does that leave us…
Let’s get back to quoting Andersen who has been the most reliable and expert voice so far in all of these back-and-forth comments. Andersen emailed the LAT:
“We cannot prove that SARS-CoV-2 has a natural origin and we cannot prove that its emergence was not the result of a lab leak.
However, while both scenarios are possible, they are not equally likely,” Andersen wrote. “Precedence, data and other evidence strongly favor natural emergence as a highly likely scientific theory for the emergence of SARS-CoV-2, while the lab leak remains a speculative incomplete hypothesis with no credible evidence.”