Episode 2: The Role Of High Quality Databases In Primer Design

Hello, and welcome to the DNA Software Podcast. In our second episode today, we’re going to talk about the role of high-quality databases in primer design. I’m joined by our CEO and President Dr. John Santalucia.

John’s going to give us some more context around what to consider when building inclusivity, exclusivity, playlist, background playlist, and what’s important in bringing these sequences into databases and PanelPlex, and then how to best leverage the databases that you’ve curated, and just important considerations and best practices.

This is a question, John, we get from our customers quite often in terms of, which sequences to include, which maybe they shouldn’t. Just some best practices in which NCBI sessions are relevant or not. If you want to start maybe at a 30,000 foot level, and then we can drill down on how to maybe best proceed with building high quality databases in PanelPlex.

Sure. I think our software is probably unique in the world, in actually utilizing full databases as a part of the primer design at all. Most programs would utilize some simple TM calculations. They might have some accounting for simple primer dimers, simple hairpin formation, but they would completely bypass the genome resources that are available for understanding the variants, that are important to characterize for different pathogens.

Dealing with that, one of the reasons why most software doesn’t do it, is they don’t have the computing power, or the memory, or the disk drive, to store those large databases. If you were doing primer design with a laptop, you’re just not in the 21st century. We’ve been taking a three-pronged approach to our primer design solutions. One is understanding the science of PCR. Two is understanding that databases and modeling, and three is the high performance computing. All three of those work off of each other.

For the databases that we’re talking about today, let’s take an example like the Zika virus, for example. The Zika virus is a pathogen RNA virus, and there are many known variants of the Zika virus, and any design you would do for a clinical diagnostic, you’d want to detect all of those Zika variants. Routinely, what a user would do, using PanelPlex, our software, is they would go to a database like NCBI GenBank, or they’d go to one of the curated databases like the Los Alamos National Lab database, or the Viper database out of California, or there would be a number of other curated databases that are out there that are helpful.

For example, for the SARS-CoV-2 there’s the GISAID database. They’d go to one of those databases, and then they would download as many accessions as they could get, and for your inclusivity, however, there’s immediately a problem, which is not all of the accessions and sequences that you download are equally high quality. Some are low quality.

Not to interrupt you, but what does high or low quality mean, in terms of if I’m a PanelPlex user and I’m trying to discern, this is a high quality sequence, or this maybe isn’t, is it sequence length, ambiguity, what dictates high quality?

Yeah, you’re right on the right track with all of that. First of all, is the genome complete? For our background databases and exclusivity, we don’t care about the database if it’s fragmented, or just a small partial part of it, it’s no problem, but for the inclusivity database, it’s really important to have all of the members of the database cover the region you’re interested in, or the whole genome.

For example, if you’re doing a diagnostic for SARS-CoV-2, and you have some genomes are complete and some genomes are just one particular gene, the S gene, and then other sequences in the database are for the N gene, and other ones are for ORF8. Well, if you’re trying to do a consensus design, the program is going to be biased by the number of occurrences of the given sequence. The one that’s for the S gene does not have the N gene. There’s no way to have a single assay that would cover all of those. What happens is, partial sequences end up biasing the design, not toward the region that’s the best for design, but towards the region for which you have the most sequencing information.

Another level of quality is based on the number of ambiguity codes. Sequences that are less reliable, will often have a number of the N ambiguity codes or one of the other IUPAC ambiguity codes in there, and the more of those that are in there, the fewer regions that you have to do design to, and one of the problems in a large database, let’s say you’re doing influenza where we have tens of thousands of full length segment genomes, if some of them have ends at the 5′-end, some of them have ends in the middle, and some have near the 3′-end, again you start getting into this bias in your avoid regions, and you might have very few regions left where there’s full sequencing available.

Number of ends, ambiguity codes, and then another one is depth of coverage. Some sequencing methods like the Shotgun based Illumina sequencing methods, they can sometimes have blind spots, where the depth of coverage is either zero or low. The reliability is not good. Some sequencing methods have a higher error rate for each individual base call. They need to have even more higher redundancy in the coverage. Those three metrics, coverage, amount of ambiguity codes, and is the genome full length, are our number one quality metrics that we look at, particularly for the inclusivity.

When you mentioned Influenza having tens of thousands of variants, that’s a lot of sequence information to bring into an inclusivity playlist. What are some maybe limitations to consider around, number of sequences that we’re bringing in? Could we bring in 10,000 or 20,000?

For viruses, we haven’t found the limit for most viruses, the SARS-CoV-2 would be one major exception to that. For Influenza, we’ve done up to 15,000 accessions with no problem reaching no limits for the database size, but those are relatively short. The segments are usually between 2,000 and 1,000 nucleotides long.

A genome like SARS-CoV-2 on the other hand, is 30,000 nucleotides, and now there’s over a million, probably closer to 2 million, COVID genomes that have been sequenced. You’re talking about, on the order of 50 to a hundred billion nucleotides worth of genome. That starts getting hard for us to deal with.

I would say that a good rule of thumb is, well, I know for sure that we’ve done, for example, around 30,000 COVID genomes. Something on the order of something less than the human genome in total nucleotides, is something we easily handle. If you have databases that are larger than that, we probably can handle that internally at DNA software, because we have power user access to the cloud resources. We could throw a larger, hard drive on a particular run, show more memory at a particular run, those sorts of things.

That’s an important consideration, because I’ve heard customers think more is better, and that’s not necessarily the case. You want to consider the high quality over quantity. Sometimes they feel like, the more sequenced information I put into this playlist, the better outcome, but that’s not necessarily going to be the case.

As long as the quality is high, more is better generally, to a certain point. Of course the more you put into that database, the longer the runtime will be. Usually a better strategy is, if you have a lot of sequences that are partial sequences, that’s fine for the exclusivity and background. For the inclusivity, it’s a premium on high quality genomes, which fortunately nowadays there are a lot of high-quality genomes for virtually every pathogen that’s known.

That’s a good point to consider. You can include that in the exclusivity, you can include that in the background, but you want to be more selective with the inclusivity.

There are some organisms, where there are still spotty coverage in GenBank. I’m thinking specifically of some of the fungi. Those can be challenging to put together high quality database. Sometimes you have to work with scaffold level and contig level genome assemblies, and those are not as reliable as complete or full genomes.

And we still say a good rule of thumb is about 90% sequence similarity or homology, in that inclusivity grouping?

Yeah, that’s another good point. Sometimes users will come to us with a virus or bacteria that has a highly diverse phylogeny.

An example of that, would be something like the Human Papillomavirus or the Human Rhinovirus. These viruses are exceptionally diverse. Oftentimes you cannot detect those with a single assay. You need to break them up into separate tests to cover the full genome or the full genomic space.

Those can be a useful rule of thumb, is that we like to see that within the group you’re trying to do a consensus design, that there are about 90% similarity among all the sequences.

Yeah. I really appreciate you outlining what to consider, because this is one of the most frequently asked questions from customers. What goes in that inclusivity and maybe how to filter what’s worth putting in or what’s worth keeping out.

Let me just follow up with one more point that I just thought of, which is, if you think about what 90% similarity means, it means there’s a variation on average every 10 nucleotides. Well, if you have a primer that’s 25 nucleotides, that means you have a two and a half mutations on average, for every sequence in that inclusivity.

Now you may have a region that’s highly conserved, but on average you have a lot of variation even at 90%. If your forward primer has two and a half mutations, and your reverse primer two and a half mutations, and your probe, well you’re just going to have so much variations, it’s going to be hard to have a test that will cover all of them, unless you get lucky and there is a region.

For example, in the Human Rhinovirus, there’s a conserved region that, while there’s a lot of variation in the rest of the genome, there’s one good juicy region that is conserved, but those are considerations that users should be thinking about.

Yeah, thanks for answering these questions, John, because they get asked frequently and I think this is going to give some good guidance to our customer group on what to consider in how to create a high quality database, which will drive your primer design.

That’s one of the three prongs, is this database driven approach to excellence in design.


Yeah. Thank you everyone for joining our podcast today, I can always be reached with questions directly at joe@dnasoftware.com.

Please reach out to us if there’s a podcast topic that maybe you’d like to hear, and have us go into more depth about. Thank you for joining us today.

Take care, everyone.