Welcome, everyone, to today’s webinar. My name is John SantaLucia , I’m the founder of DNA Software, and it’s a pleasure for me today to be able to give you a webinar on PanelPlex Bast Practices.
Before I begin, I want to thank all of the users who’ve sent us their comments, critiques, and questions about PanelPlex and PanelPlex Consensus. I have designed this webinar today to sort of amalgamate the bulk of the questions that we’ve got, and requests, and sort of deal with some of the features and questions that people have.
Watch the Webinar
The talk today is going to be divided into two main parts. The first half of the webinar will be based on best practices about PanelPlex, and then the second half of the talk will be about PanelPlex Consensus. So, there’s two halves to it, each of them will be about half an hour each, and we’ll deal with these topics that are shown here setting up junctions and regions, how that should be done optimally; advanced parameters in PanelPlex, how to set up a highly multiplex, batch mode singleplexes, and two case studies, one on a cancer gene panel and another one on mRNA Splice Junction design.
Then we’ll do a section on PanelPlex Consensus, where we will focus a lot on making high-quality, optimal inclusivity, exclusivity, and background panels. We’ll talk about some of the challenges for doing designs for bacteria, and talk about dealing with cases where viruses in particular have a high degree of diversity and require splitting the inclusivity into multiple group.
I will talk about the challenge of choosing the best keystone for your panel, and then also give a demo of the software and show some of the advanced parameters. First major question is, “Which version of PanelPlex should I use?” They’re two different software. First one is PanelPlex Consensus. PanelPlex Consensus is software that is for detection of infectious diseases, usually viruses, bacteria, parasites; we’ve done all of those in the past.
This software is meant to do singleplex or multiplex detection of a single-pathogen panel. So, if you want to detect all the variants, for example, of Influenza A, that’s would you would use, PanelPlex Consensus to be able to detect all of those variants of the virus or bacteria without detecting any of the things that are in the exclusivity or background.
PanelPlex, regular PanelPlex, is meant for doing high-level multiplexing. There’s a variety of applications for that. One application is for genome enrichment, or next-gen sequencing, where you want to target many different genes within the same one genome; that’s one application of PanelPlex. It also allows you to do multiplex tiling assays where you want to cover a full exon, let’s say, with amplicons tiled throughout the amplicon. It allows you to do batch-mode singleplex. That is when, if you have 100 different targets, you want to get singleplex designs for each one. This allows you to do it… instead of running the program 100 times with a single target, you can run 100 targets in one run.
Detection of spliced junctions is in this section of PanelPlex, and also, PanelPlex can be used to do multiplex detection of multiple pathogens, but in this case, it’s not a consensus design if you’re doing a multiplex assay in this case. So, those are the two major flavors with PanelPlex, and we’ll be talking about both of them. We’re going to start right now talking about regular PanelPlex.
To talk about that, we’re going to use a case study that we did for two cancer genes, the HER2 and HER3 genes, with the epidermal growth factor receptor, type two and type three. These are located in chromosomes 17 and 12 of the human genome, and within chromosome 17, there’s 19 SNP sites that are of interest, and for the HER3, in this case, there’s 10 different snip sites that we’re interested in.
We’d like to develop a multiplex that will detect all of them. We want to make sure that whatever design we do does not hit any of the other low-frequency SNPs that are found outside of these regions, outside of these cancer-causing SNPs, the other minor variations in the human genome. We don’t want to attack those. So, there’s a bit of a challenge here that, use ddSNP to avoid the nearby SNPs and yet have our primers be able to detect all of these 29 of these SNPs. As we’ll see, the program will whittle this down from a 29-plex down to a 16-plex by combining the different [inaudible 000429].
Now, in this case, we’re dealing with assays for circulating tumor DNA, so those are relatively short pieces of DNA, around 140 nucleotides or shorter, so we’d like to have amplicons that are shorter than that, and that is a part… that consideration goes into our design. Most challenging part of setting up the runs for regular PanelPlex is just getting the list of all the SNPs, and these are SNPs and indels, by the way. We allow for specifying… we call the location of the SNP or the indel the junction location, and we specify a jmin and jmax for the junction maximum in order to allow us to specify a SNP site or indels.
So, for example, the first row of data here is data we got from a cosmic database for chromosome 17. There’s a SNP at position 952. So, since it’s a SNP, we put the same value in for both jmin and jmax. All right? And here’s a different SNP for the same location. Here’s a third SNP three nucleotides away at 955 here. So, that SNP is close, it’s too close; we couldn’t design two separate PCRs in a reaction at the same time would be able to separately amplify something at 952 and something at 955. They’re only three nucleotides away.
So, all three of these targets that are highlighted in yellow are targets that we would combine. We’d make a new jmin of 952 and a new jmax of 955, include all three of these. The program does that process of combining the assays automatically. It recognizes things that are close, so, all the ones in green are also a group; they form a group that are too close to have separate assays for each one. You can see in this case, they go from 3,966 to 4,008. So, they have a range there of 42 nucleotides. That’s not big enough for a separate amplicon. All those clusters that…
I’m going to review for you a little bit about how the software does this sort of clustering, and where the calculations come from for the design regions, for the forward primer and the reverse primer. Shown here is an example for the TP53 gene, a tumor suppressor gene, just as an example. This is a particular SNP site that we would like to be able to detect, and the software allows you to specify a number of things. One is it allows you to specify a minimum nucleotide, number of nucleotides, on either side of the junction. That’s abbreviated as Min Nucs in this particular slide.
That’s just spacing to sort of allow you to have regions to do sequencing. So, whatever result you get, you’re going to have enough sequence there to be sure that that wasn’t a false amplicon or false read. So, you can differentiate that.
Now, the other issue is, how do we determine where to start with the forward primer end design region, the start and end of that, and where do we make the reverse primer design region? The start and end of that. Those are really determined based on the length of the amplicons that you want, which I’ll show you in the next slide.
But while we’re on this slide, I do want to mention that outside of this critical region where this SNP site is, whatever these designs regions are for the forward primer, there just happens to be a SNP site, a high-occurring SNP site, within that site. So, we want to avoid having our primers land there because if they did, then there would be a certain segment of the population for which the assay might not work. That SNP variation might make the primer fail.
So, basically, the program automatically is interfaced with ddSNP to detect those sites. We have a default setting at one percent, so any SNP that occurs more than one percent in the human genome, we avoid having primers go in those locations. You can adjust that to a higher value or lower value if you desire. It’s in the advanced parameters, which I’ll show later on.
When you submit your job, any sites, any SNPs that were detected in the locality of your design, the email with your results shows you where those SNPs were that they avoid. Now, this slide is meant to show you a more detailed presentation about how the regions are calculated. So, jmin and jmax are what the user initially specified, and then beyond that, we have the maximum length of the gap and minimum nucleotides and the maximum [inaudible 000855]. Those all get computed to start where the forward primer start position is, where is the reversed primer ending position; all of them need to be specified.
But these details are given in the PanelPlex manual, but if there are questions specifically you have about these, because you can manually set these… if you see how PanelPlex sets them up and you want to change one of them, you can do that. You have to re-run the program, but let’s say there’s a SNP right near the end here and you want to avoid it, or some other reason why you want to move the FP start position, you can do that in that junctions file.
So, the junctions file is this file I showed you here. This is an actual picture of a junctions file. So, you have to give a name for each junction, the accession; that is the chromosome that it’s coming from, and then the position within that accession, where the junction is.
so, you can either specify the junction, or you can directly specify the forward primer start and end, probe start and end, revere primer start and end, and reverse transcription start and end if it’s a RNA. That’s what that file’s all about.
Now, in the event that you have clustered junctions like we have in this case with HER2 and HER3, we have multiples junctions that are all right on top of each other, it would be impossible to have individual PCRs for each one. That’s why we specify the jmin is the sort of first one in this group, and jmax is the land SNP in the group, and the program will automatically take the raw values that you gave it and figure out this clustering, and set up the forward primer region and the reverse primer region to accommodate.
There are certain cases where junctions, or the SNP sites, are far enough away that you could make separate amplicon. But they’re close enough that you can’t maybe… Well, if they’re very far away, then you’d just make two separate amplicons, but if they’re in that middle region where they’re close enough where you could make a choice of either amplification reactions or having
So, in this case here, you can see that the forward primer design region for junction one overlaps with the reverse design region for primer two, and we can’t have that. If we kept it like that, then it would be possible for the forward primer of one of them to be very close to the reverse primer of the other, like a mini amplicon. So, the program is aware of that problem, and it re-sets up the design one of two ways. It gives the user a choice. Do you want to combine, or do you want to split?
So, this is one of the common questions we have what’s going on when we do split versus combined?
What’s happening when you do split, we’ll take that choice first, is it’s trying to make two separate amplicons for junction one, and a separate amplicon for junction two. All right? And it does allow for a little bit of overlap between these design regions. In principle, if you had the five prime ends of your primers overlapped a little bit, that would not cause a mini amplicon; it’s only if you get the three prime ends of the primers overlap that you create problems. So, we allow up to a 13-nucleotide overlap, and then we get two separate amplicons.
Now, by choosing split, that will result in a larger plex size for your PCR, but it also limits the design’s base because everything is crunched up here. In order to accommodate these two sets of primers, instead of having a wide forward primer range and a wide reverse primer range, they had to be narrowed to prevent that sort of mini amplicon problem.
The other alternative is to use the combine option. Combine option will make it so that one amplicon will amplify both junction one and junction two into one bigger amplicon. But you can see here, this also restricts the design range, but you do get a smaller plex. And generally, we recommend using the combine option, but either one, it can work in different circumstance.
Another use case of the PanelPlex software is to detect mRNA splice junctions. So, for messenger RNA assays, it’s desirable to amplify the RNA in preference to not amplifying the genomic DNA. And one of the tricks to that is to choose the splice junction sites. The reason to choose a splice junction site is because in the genomic DNA, if you choose a splice junction between two exons… so, the forward primer’s in one exon, the reverse primer’s in the other exon, then for the genomic DNA, that large intron will separate the primers and make the amplification inefficient of the genomic DNA.
And then, second of all, the amplicon made by the mRNA is smaller and thus more efficient, but also we recommend generally putting the probe in a location to straddle that junction. So, that can make it so that the probe only binds to amplicons made from the messenger RNA. An amplicon made from the genomic DNA wouldn’t work because the genomic DNA does not contain the splice junction sequence. So, that makes it unique.
All right, so, that’s sort of a general strategy for setting up a splice junction, or… detecting mRNA, specifically, in the presence of genomic DNA. Now, one of the other challenges that comes up with this topic is that you need to find conserved splice junctions. That is, sometimes there are splice variants and you don’t want to choose a junction location that’s not conserved. There’s three different splice variants that are all biologically important for your particular purposes, you want to make sure you choose a splice junction that is actually found in all of them, and not left out of some of them.
So, I’ve made a little [inaudible 001437] is one particular one. This is a ribosomal RNA protein, [inaudible 001443] protein 13, that is encoded here. There’s three different copies of this messenger, or three different variants of this messenger RNA; it gets spliced differently, so all three of these different messenger RNAs have a spliced junction at these locations that are conserved.
But the splice junction at this location, in the middle here that’s shown in red, that one is not conserved. One of the splice junctions would be different because it does not contain one of the exons. One of the exons have been left out. So, two out of the three are the same, but one is different, so that one, it’s not conserved. That would not be a good site to target for an assay.
Some other ones are marginal, like you might have a site here that’s conserved; you’ll notice that the exon on the left-hand side is the same in two of them, and one of them’s a lot shorter. So, we would want to check that to see if that is a conserved splice site or not in more detail. But there’s some other tricks you can do. You can use these pictures here from…
This is a picture from GenBank looking at the RefSeek annotations here. But you can also go and actually get the actual positions of the exons within each messenger RNA, and then line them up and see, are the exons the same length? If they are, those are likely places where you could target your site.
Another challenge for these is dealing with pseudogenes. Pseudogenes can really throw a wrench in this because pseudogenes are often the result of taking that spliced messenger RNA and making a copy DNA, and then inserting that copy DNA into the human genome. But those look quite similar if not identical sometimes to the messenger RNAs, and that can be a concern and something to look for. Make sure that your assay is not picking up pseudogenes.
Doing a multiplex of those messenger RNAs is highly doable. What with recommend doing is collecting the conserves… different targets; in this case, I have four different targets. These ones that are highlighted in green are the junctions that are conserved. So, now we want [inaudible 001654] choose one splice junction from each of these different four targets and put them into a junctions file. So, here are these four different targets; for each one, I chose one of the splice junctions for each one.
And you just run that as a multiplex [inaudible 001711]. And if you get a good result, you’re set to go. On the other hand, you might find that your choice is not a good choice. Maybe [inaudible 001720] happens, you get a low [inaudible 001721] and that is an indication that a particular [inaudible 001727] make a mini amplicon that you want for these assays. And we found two of those in this case that I’ve highlighted in red. These were regions where the RNA was just so full that we couldn’t get the primers to bind tight enough within the region that we gave it.
So, we have other sites. I’d just choose a different splice junction until you find one where all of them are compatible. And I want to show you a little bit about a demo on PanelPlex and use that as a means to answer some other common questions that we get. This is right on the one that all of you guys use.
Choose the human genome as the target, and hit Next here. All right. So, typically you would give this a name like HER2, HER3; we’re going to do a little assay on that. Now, we have a number of choices here. We can choose to do a multiplex PCR or we could do… If we choose multiplex, then it’s going to try to design all the primers so they’re compatible with each other and can work in a single reaction.
If we choose batch mode singleplex, then what that does is allow you to get a single PCR assay for every single target. In this case, it would make 29 different PCR reactions for this HER2 and HER3. If I chose automated tiling, this is for when you have a single target and you want to have PCRs that are lined up one right after the other and design all of them to cover an exon, for example. I’m not going to cover that example any further today. I want to make sure we get through this multiplexing.
Then in the detection type, you want to choose… Sequencing means there’s going to be no probe design. If you want to design a TaqMan probe, that doesn’t make it any harder to do the design, but it will do both the primers and the probe if you do that. Right? So, sequencing is this sort of typical mode that people will use.
Minimum amplicon gap means that the three prime ends of the reverse primer and the forward primer must be at least 25 nucleotides away. So, all solutions that don’t satisfy that are not going to be shown to the user. Maximum amplicon gap 95 nucleotides. So, if the gap is 95 nucleotides, then the total amplicon would be 95 plus the forward primer length plus the reverse primer length. So, you get amplicons in this case around 140 nucleotides, if you choose that.
This is a setting that is a useful one to set to a bigger value. If you’re not dealing with circulating tumor DNA, then this is something that, in order to give the software a little bit more freedom to do the design, I would set that to 120 nucleotides.
All right, so, this next set, primer length range, this is also a sort of advanced user setting here. If you have a very AT-rich sequence, you know that it’s AT-rich, then this setting of 28 is usually big enough. But if you had something really crazy, you could set that up a little higher, maybe up to 32 nucleotides. If you know that your target is very GC-rich, then you might want to maybe set this region a little bit smaller than 23, down to 18 or so.
But this range would allow you to sort of capture every possibility. The default settings work, like, 99% of the time, but sometimes you have an extreme target where it’s useful to set those ranges a little bit. This, I just opened up this file that had the 29 targets from HER2 and HER3, and it automatically process it and asks us this question “Do you want to combine, or do you want to split?” So, there’s just a few where it could have this choice, and I wanted to choose combine, in this case.
So, when it did that, it took the 29 different HER2 and HER3 variants and it combined them, and it automatically set up the forward primer range, the junction range, the reverse primer range, so, these design settings are all there. And you can see which ones it combined. It combined this 488 mutation with the 385 mutation, for example. So, two of them are combined here. Here, there’s a cluster of eight of them. Here’s a cluster of six of them. Insertions and deletions as well as SNPs got combined here. But it was smart to combine them all in this sort of intelligent way.
The sort of default settings we have here for the primer concentration, temperatures, and for the buffer, these are generally applicable but if you have something in particular, your [inaudible 002149] temperature would be a particular one that folks would design… Maybe they do 60 degrees, some other number; whatever your assay temperature is, put that in there. That one is important.
The primers are designed to work at that 60-degree temperature. The probes we designed to work at the extension temperature, if you had probes in your design. In this case, 72 degrees. And then, if you have tails that you want to add to each of your primers, you can do that here. If you have a universal primer sequence and a spacer sequence, you can… generally, the spacer, we can’t put in a random nucleotide for a spacer. We just recommend putting a bunch of A’s. That’s a way to run that. So, you can put universal tails on your primers if you would like.
Then going to the advanced settings here, there’s lots of features here that are described in great detail in the user manual. I’m not going to go through every single one of them, but I’m going to give you some highlights here of some things that are settings you might want to consider changing.
Down here, the minimum SNP, the [inaudible 002247] frequency, I told you we have it set to one percent. So, rare SNPs in the human genome, anything more than one percent will be detected and the primers will avoid that. If you want to set that higher, make it more permissive, you’re welcome to do that. Or you can turn it off altogether by setting that value to one. Set it to one, then everything would be… it would just ignore this SNP.
Some other settings here are a little more advanced. This one, number of solutions to output; some users don’t want to get deluged with hundreds of different multiplex solutions, but we allow up to 254 solutions that you can output, and there’s no extra cost and time. It’s not any harder, it’s just amount of output of alternative multiplexes that you can bear to see. So, in this case, you can do 254, or one. One is the default setting. Most people just want to see that that’s multiplex and try it, see what happens.
Some other things here, this panel limit size… What happens is the space of multiplex, the number of possible multiplexes you can do can be [inaudible 002354] exploding. So, if you have 100 targets, if I saved 10 different primers for each of those 100, then I would have 10 to the 100 power, possible combinations. So, we generally recommend using four. This would mean for every single singleplex in your reaction, we’re going to save the four top designs and those are candidates for multiplexing.
If you find that your multiplex fails, you try it and it just cannot find a solution to it, then you could set that a little higher. You could bump it up to eight solutions per panel. But that will increase the runtime, approx… If you double the number of panel [inaudible 002436] you’re going to double, more or less, the runtime of the program.
Those are sort of the main features that I wanted to show you from the regular PanelPlex. So, we’ve come on exactly 30 minutes of the talk. I want to now take you back to the presentation and now begin talking a little bit about PanelPlex Consensus, all right? So, let’s spend another half hour just on that topic.
All right, so, the first topic with PanelPlex Consensus… Let me make this full-screen again for you. With PanelPlex Consensus, we have the challenge of gathering together all of the variant genomes that we want it to detect. For example, for the Zika virus, we were able to collect 168 complete genomes of the Zika virus, and we want to make sure that whatever diagnostic we develop is able to detect all of the variants of the Zika virus. So, that’s our inclusivity set.
The inclusivity are variants of what you want to detect. The exclusivity are near-neighbors, things you don’t want to detect; those are things that might cause a false positive in your assay. And generally, the things that we put into the exclusivity are organisms that are either phylogenetically related to the Zika virus, or are examples that might actually show the similar symptoms as the Zika virus, and we want to make sure that any test we have would not give a false positive to those.
So, in this case, we chose a series of fever-causing viruses like Dengue fever causing virus, the [inaudible 002611] virus [inaudible 002612] et cetera. So, we put those into a exclusivity plan. But it’s important that these are somehow near-neighbors. You don’t want to put the whole kitchen sink into the exclusivity, because that’s not how the program works.
And on the other hand, in the background are anything in your sample that is a contaminant, you do not want to detect. For example, if I’m trying to make a Zika virus diagnostic, I would not want a false positive [inaudible 002638] human genome, and since Zika virus is a RNA virus, we wouldn’t want a false positive to one of the human messenger RNAs. So, I would put the human RNA RefSeek in this background, as well. Gut microbiome, things like that, the human microbiome, those are things that also could be a contaminant in your sample that could cause a problem.
So, unrelated fever-causing virus, like Influenza A; it’s not related to Zika virus at all, but it could present in the clinic similarly in a way that you might want to say, “Well, let’s just make sure an assay doesn’t give a false positive for that.” So, that’s sort of the way we set it up. We have this inclusivity, exclusivity and background, and now I want to give you some more detail about best practices for each of those so that you have some good guidelines on how to do that.
Before I do that, I want to talk about the two key ideas. The quality of the inclusivity, exclusivity and background databases is absolutely essential to getting high-quality design. If you have a garbage database that you input, then no matter how good the software is, you are not going to get good designs out. So, it is really important for the user to spend some time to validate their databases to make sure that they’re good.
One bad sequence can spoil the whole design, so, one bad apple can spoil a whole bunch. Those are very important philosophies to keep in mind. We have become more and more aware, based on projects we’ve done for many customers, that this issue of database quality of is often the most time-consuming step of the whole process, and we are actually under development as we speak, developing methods to check your database quality. But nonetheless, let’s get into talking about some more best practices about this.
First of all, just the definition of a inclusivity playlist; this is from the user manual. The inclusivity playlist is the collection of target sequences that you want to detect. For example, variants of the Ebola. Now, I have a extensive slide here that gives all of our recommendations for having an optimal inclusivity list. We’re going to be providing these slides to anyone who wants them. We can go through those in detail, but I want to hit some of the highlights here of some of the recommendations.
The first recommendation is that the inclusivity list is very important to have high-quality sequences. You don’t want to put all sequences that you could find from all the databases in the world that are partial sequences. You want full-length, high-quality genomes here because, if you have poor-quality sequences in there, what will happen is the design will not be able to find a solution even though… you can have a SNP in a particular site that’s really not a real SNP; it’s actually a sequencing error. All right?
Or some of the sequences have sequence missing in them, and so, the program will try to avoid those regions because if some of the members of the inclusivity are missing a sequence, it’ll avoid that. And those might have been great locations to do a design. So, generally, you want to collect as many full-length genomes as you can possibly get, and here are some of the databases we commonly use. There’s the NCBI viral genomes databases, Virulogical is a wonderful organization that has curated databases for high-value viruses.
Virus Database and Viper. Those are the ones that we use very commonly for viruses. For bacteria, there aren’t as many resources, and we have to sort of live with GenBank and trying to wade through those, generally. Buyer beware, caveat emptor the databases are notoriously unreliable, and you should really be fully aware of that. For example, we recently did a project for Bacillus anthracis where some of the sequences that were called Bacillus anthracis were in fact misnamed, they were misclassified, they should have been called Bacillus cereus.
And also, some of the exactly sequences for Bacillus cereus, some of exclusivity sequences were in fact Bacillus anthracis. So, they were misnamed, and as a result of that, that sort of wreaked havoc with trying to find… So, what we saw was we ran these assays initially and found low coverage. We got 70% coverage, kind of thing. Then when we looked more carefully, we realized, “Oh, it’s because it’s got Bacillus cereus in there,” and once we weeded out the Bacillus cereus out of that inclusivity, and then we also did the exclusivity had some anthracis sequences in there, we had to get those into the inclusivity.
So, basically, we sort of had to do a little bit of work to curate those databases and make them more accurate than what were just off the shelf. So, that’s a definitely, really important to do that. Also, a lot of the annotations in GenBank are incorrect. For example, in the Brucella suis, Brucella abortus, those genomes have two chromosomes, chromosome one and chromosome two, and different groups, they did their sequences and labeled the chromosomes differently.
That created problems, because we had chromosome ones in our inclusivity, chromosome two in our exclusivity, and we don’t want to be mixing. We don’t want to put things that should be in the inclusivity in the exclusivity because the program is going to try to avoid hits to the exclusivity. And you don’t want things that should be in the exclusivity in your inclusivity. That’s just a very, very important general principle.
One of the things we do to check our playlist is we make a preliminary plan. So, we just take [inaudible 003225] our sort of data dump from as many databases as we can get, and we use ThermoBLAST to create that playlist, that list of sequence. If you use ThermoBLAST then download the playlist, and then that download, what’s nice about it is it’s a very short summary of everything in your playlist. And one of the nice features is it gives the length of that particular, each accession.
So, you can use that length, and I recommend sorting by the length. What you can find immediately is some of the sequences are very short, and if they’re very short, that’s something you can weed out of your list. For example, you could also do this from the graphical user interface. You can go to the playlist management part of our software, and it shows you, for each one, you can see the name of the accession, sort of the annotation, and you can see the length of those. You can see some of these HIV genomes are full-length, 9,500 nucleotides or so, and some are just very short fragments here.
And we don’t want to be mixing and matching short fragments with long fragments in the inclusivity, unless you’re targeting a specific gene. Generally, you want to have full-length genomes. In the care of HIV, there’s no shortage of genomes. There’s more than 10,000 full-length genomes that have been sequenced to date. So, you don’t want to have all this stuff.
The reason you don’t want the short sequences in there is that it will bias the design to try to cover as many of those short sequences as possible, which means that it will discount the whole rest of the genome. And if some of your sequences are [inaudible 003359] to one gene, and some are [inaudible 003401] to a different gene, you’ll get different results there. Or, poor results for the coverage. So, generally, in the inclusivity list, we want full-length is the best [inaudible 003412].
Generally we recommend that our users try to use the full genome search; let the program tell you what the best sequence is to go to, but sometimes people have strong feelings about a particular gene they want to go to, and you can use PanelPlex to say a particular gene site. I will tell you that generally, as a general rule, 16S ribosomal RNA is a very poor choice for a gene for making assays because it’s conserved. It’s widely conserved throughout the entire bacterial domain. So, it’s really not a good choice to try to get specific assays to a specific pathogen when so many things are going to look very similar.
A general rule of thumb is that members of the inclusivity should be about 90% sequence-identical to each other to form a group. If you find that your sequences have less sequence identity than that, if you have members of your inclusivity that are more divergent than that, then that is an indication that you might want to break your panel up into two or more panels, two or more separate designs, because at that point, they’re really not the same virus. If they’re more than 10% different, most biologists consider that to be a different species of the virus. So, you’re really dealing with two separate things.
For example, we did a test once for respiratory syncytial virus, RSV, and we at first just dumped all the RSVs we had into one list, and then we found coverage is around 50% and realized, “Oh, this is really RSV-A and RSV-B.” So, sometimes you don’t realize that until you run the test.
Another thing for quality is to look for these sequences… you know, if you actually look at your sequences, you can see they have a lot of ambiguity codes in them. Those are poorly sequenced, they’re usually not very reliable; the software will allow you to use them, but it will avoid those ends in the design. And if you have several of these with sort of a shotgun, parts of it that you can’t design to, and you have multiple of them like that, it will result in not getting a successful design.
Making a successful exclusivity list here, with the exclusivity… the exclusivity is much more tolerant. You can put as many poorly sequenced genomes or partial genomes in there as you want. The exclusivity cannot contain genomes that are greater than 90% similar to the inclusivity, all right? So, as I said, with the there, if you have a sequence that looks like the inclusivity but it’s in your exclusivity, that will really mess up the design as it’s trying to make things that are specific for the inclusivity.
So, as I mentioned earlier, this exclusivity, you want to put near-neighbors in this, but you don’t want them to be too near, okay? They have to be less than 90% similar.
All right, for the background, the background is even more permissive. This is a place where you put in… you can think about putting in the kitchen sink here. Human genome, human RefSeek, the human microbiome, soil microbes, things of that nature. You can put in a lot of things into this background that are things that might be contaminants in your sample that might cause a false positive.
You don’t want to put all of NR in here, the non-redundant database. Why? Because it’s very unlikely that the chimpanzee genome is going to be a contaminant in your actual sample. So, you don’t want the chimpanzee genome, or the rat genome, or the mouse genome, or all the other genomes that are in GenBank, you don’t want those here. You just want things that could be possibility a contaminant in your design, or in your sample.
Now, another topic, I alluded to it a little bit, is the idea of separating that inclusivity into groups. I had that idea with the respiratory syncytial virus. It really was two different viruses and needed to be split into two different ones. How do you know which ones to put in which?
Well, one way is just run PanelPlex and you see 50% coverage, okay, take those 50%; that’s one group. All the ones that were not covered, that’s a separate group, and we do now have a place where it tells you the accessions that were not covered. So, you could just do that.
Another is to do this up-front work with a multiple sequence alignment algorithm. Now, this is something, again, we’re working at DNA Software to make a version of multiple sequence alignment that would allow you to do this automatically. But for now, you would have to do this yourself. So, we recommend the tool at the European [inaudible 003838] Institute called [inaudible 003841]. This works great for up to 1,000 sequences. If you have more than 1,000, it’s not going to work. If you have long genomes, that also doesn’t work. So, this works mainly for viruses, not so well for bacteria.
So, this problem’s with limited number sequences, limited lengths, are one of the reasons we want to make a native multiple sequence alignment algorithm for PanelPlex. That will be coming out in the future. But one of the things you can do is if you have a [inaudible 003911] and it has not more than 1,000 of them, then you can do these sequence alignments, and these produces a phylogram. The phylogram shows you the groups that are most closely related; in this case, showing an example with the Lassa virus.
We found that there were seven different groups that Lassa… and [inaudible 003930] seven different groups, within the group, they had 93% similarity to each other, average similarity. That, I call Group A. Group B had 93% also. Group C, 97% similarity, et cetera. So, much better to start off trying to get these separate assays for each of these different groups rather than trying to do one assay for Lassa is impossible. The virus was in fact seven different viruses. They’re all called Lassa, but they’re really so different from each other that they really are different viruses.
So, doing the multiple sequence alignment is helpful for that. Another challenge is this idea of the keystone. How do you choose the best keystone sequence? First of all, what is a keystone? A keystone is the sequence that is going to be used to do design. When we do design of the primers, it actually is only designing to one sequence. That is the keystone sequence, and it’s going to make primers that are a perfect match to that keystone sequence.
Now, the design is only to one, but then it’s checking to see, we actually score each one of those primers, how many of the inclusivity does it cover? How many does the exclusivity does it falsely hit? Right? So, that design is actually taking into account the inclusivity/exclusivity in a indirect way. But the actual thermodynamics of designing primers is done on the keystone.
What you want to do is, the idea here is you want to find the sequence that is the most like all the other sequences in the list, the inclusivity list. And I have an analogy here to just distance. Consider these five dots here that are spaced away from one another. If I just asked you which of those five points is closest to all the other points, maybe by eye you would recognize that point three, point number three here, is the closest to all others.
But suppose you didn’t know that. How would you go about rigorously determining the point that’s closest to all other points? Well, I could measure the distance from one to two, from one to three, one to four, one to five, and get the average distance from one to all others. All right? If I do that, or some distance; in this case, I get 10.5 if I add up these distances. All right? If I then look at point number two and get all of its distances to everything else, well, it’s close to one, it’s a little further from three, four and five.
If I added those distances, point number two has a total sum distance of 7.5. point three has a sum distance [inaudible 004205] six. Point four to all others, 6.5; point five has 9.5. so, you can see the one with the shortest distance to all others, adding them all together, is point number three.
now, the equivalent of that idea, this point number three here is the one that we would call our keystone. That’s the one most similar to all others. We want to find something similar to that with our sequence databases, and we can also use a sequence alignment algorithm to do that. But one of the outputs of the multiple sequence alignment algorithm is this percent identity matrix; it’s a key output. This percent identity matrix compares, in this case, all 45 sequences to each other to figure out their percent identity for each percent identity.
If I take the first sequence and I compare it to all the other sequences in the list, those are their percent identities, and I average those, I got 91%. So, this sequence is 91% similar, on average, to all the others. If I do that for each one, I find out over here, this sequence, I look at its percent identity to all others, it averages 97% identity to all other sequences in the list. So, this one is the center of the phylogene. In other words, this sequence, like that point number three, it’s the more similar to all others, right? You can see this is great. 97% identity, 97.8, this would be…
Think about the challenge that we’re giving our target analysis or a fast compare algorithm. You want to find the conserved regions, right? So, finding conserved regions where all of them are within 97% of each other, is a lot better than finding conserved regions where they’re 90% identical to each other. This brings us to then talking about some of the challenges of bacteria versus viruses. Viruses, well, they’re small, so that makes them easier, between 1,000 and 40,000 [inaudible 004358] typically, unless you’re dealing with a virus. But most viruses are below 40,000.
The genomes are linear; they’re usually complete, high-quality genomes. But usually, there are a lot more virus genome variants, more variation viruses than there are in bacteria. So, that idea of more sequence variation requires a very thorough algorithm in order to find regions that are conserved, and our target analysis algorithm [inaudible 004423] is appropriate for dealing with those cases. But it has a limit. It only goes up to 40,000 nucleotides.
So, either if you were doing a virus because it’s less than 40,000… if you’re doing a bacteria, you’re going to need to specify your [inaudible 004436] what gene you want to do. Now, bacteria have their own challenges. For one thing, bacteria are much larger than viruses, about 1,000 times larger. Also, their genomes are usually circular genomes. That creates its own problems. So, there’s a circular permutation of the genes; that is, there’s not an agreed-upon place, if it’s a circle, where to start the genome sequence.
so, different groups will submit their genome sequences with different starting locations, and that is very bad for doing sequence alignments. In addition, there’s no agreed-to which strand is the sense strand, and which strand is the antisense strand. Well, circular genome, and you can’t really tell. So, unless everyone’s agreeing in a particular bacterial field which one’s sense or antisense, then you get a mixture of sequences, of bacterial submission genomes that some are sense, some are antisense.
This can wreak havoc with doing design, and we need to deal with all that. So, we’ve made this new algorithm called fast compare, which takes all of these issues into account and allows us to run whole genome search even for bacteria, and takes into account that they’re circular genomes. All these things are taken into account in a transparent way. All you need to do is there’ll be a new feature called [inaudible 004553] usage that, you just select Use Fast Compare.
If your sequence is longer than 40,000, it’ll force you to use fast compare. If your sequence is shorter than 40,000 you can choose either of the two methods. Here’s PanelPlex Consensus, when you just sort of choose a…
Here’s PanelPlex Consensus. Here we need to choose an inclusivity and an exclusivity and a background, so I would have had to pre-make these panels. I already made a Zika virus inclusivity panel, a Zika virus exclusivity, and you could choose the human genome as your background, for example, or whatever playlist you make up. All right? Next…
Here we have a choice of choosing different keystones. Sometimes there’s a good reason, you know a keystone… like, this Brazil isolate is one of the clinically most important isolates, and I might choose that as my keystone. Another idea would be for you just suggest a keystone, because we did that multiple sequence alignment ahead of time and you know which one you really want, but that’s a general principle there for choosing the keystone.
Beyond that, I’ll just continue showing you a few more of the details here. We can give a name for some bacteria. Oh, this is the Zika virus; I’ll give it a Zika name here. You can choose whether you want to have TaqMan detection, or whether you want to have the sequencing primers, can do either. If you choose whole target, then it will try to use the entire… Zika virus is about 10,000 nucleotides. You don’t need to find the gene; it will find the most conserved sequence among those Zika virus inclusivity that you gave that are conserved in the inclusivity, but not found in the exclusivity. And also that don’t hit the background.
You can also specify a design range; if you know a particular gene and you know that it occurred between nucleotides of 1,000 and 3,000, then you can limit it to a particular gene if you want. But generally, if your sequence is less than 40,000 nucleotides, we recommend that you use whole target. Now, this version of the software doesn’t have the fast compare in there yet. Once you have fast compare there, you’d be able to even do this for whole target even for bacteria, but you’ll have to use fast compare for it.
Moving forward, yes, again, we have some basic hybridization conditions and strand concentrations. I don’t think we need to belabor those points. In the advanced configuration, I do want to go into here and give you a few words about this particular advanced parameter. Some of the sort of highlights.
Number of candidates to output; here, the default is 100 candidates, so it’s giving me 100 primer pairs. Usually that is more than enough for you to find some really great primer solution. This setting right here is one that is very important setting number of primer pairs per solution. So, this is if… let’s say in the Zika virus case, if it’s able to find a set of primers that will cover 100%, all 168 variants of the Zika virus, then one solution is enough. A singleplex is enough to cover that assay.
If, on the other hand, let’s say 80% of the Zika viruses are covered with one primer set, but 20% are not. Then the program will cycle through and it will, as shown here, it will find a set of final designs, let’s say coverage is 80%, so the 20% that are not covered will get redesigned automatically. [inaudible 004917] come through and make another set of primers, so that’s what will happen on round two, a separate set of primers that are compatible with the first set of primers, so, a multiplex will be made here.
And this will allow up to three rounds of design. If it finds a good solution after one round, it’ll stop, but it’ll do up to three rounds. So, if it’s going three rounds, it’s literally doing three full rounds of design, so the program will take longer to run if you give it…
A minimum amplicon [inaudible 004947]. This says that a primer must bind at least 30% to be considered a valid primer. So, if you have a very folded target, like an RNA target, and it is impossible to bind that tightly, then you’ll get a failure. The program will say, “Low amplicon percent [inaudible 005006].” We’ve been working very hard, by the way, to make error messages more informative for the user, as we’ve heard your pain on those issues where, “Hey, it failed but I don’t know why.” We’re really trying to figure out the reasons why failures happen, and give informative answers to the user that allow them to adjust their panels appropriately.
So, this is one, if you have an RNA target, you could consider lowering that. It’s okay. If it’s 25%, it might be okay. It still works. But typically, these are only for that first round of PCR, you get a low binding; after it starts making the amplicon, then it is okay, but it’s getting that initial round of PCR to work is the most challenging part, and that filter is very important.
Changing the length ranges can also help. If you can tolerate longer amplicons, then by all means, give it a little freedom to have a larger amplicon gap here. Maximum amplicon gap of 200 even is just fine for pathogen diagnostics.
I think that’s pretty much the advanced parameters that I wanted to cover for you today for regular PanelPlex Consensus…