Described herein are methods and systems for predicting the genotype of one or more genes utilizing high throughput sequencing data. The provided methods and systems allow for accurate genotyping of genes, including ADME genes, and can be used to identify novel alleles.
Legal claims defining the scope of protection, as filed with the USPTO.
. A method for genotyping a gene, the method comprising:
. The method according to, wherein the method further comprises refining a genotype for an allele where two or more star-alleles were selected by assigning each of the selected star-alleles a penalty score based on minimizing a number of missing and additional non-functional variations in the allele in order to match the database as closely as possible, and calling a genotype for the allele as that genotype associated with one or more star-alleles having the lowest penalty score.
. The method according to, wherein the gene is an absorption, distribution, metabolism, and excretion (ADME) gene.
. The method according to any one of, wherein the method is repeated for one or more additional genes.
. The method according to any one of, wherein the high throughput data is targeted hybrid capture with consistent read distribution data or whole genome sequencing (WGS) data.
. The method according to, wherein the hybrid capture with consistent read distribution data is PGRNseq data.
. The method according to any one of, wherein the high throughput data is received as a FASTQ file, a uSAM file, or a uBAM file.
. The method according to any one of, wherein alignment of target sample reads of the high throughput sequencing data is accomplished by BWA-MEM, BWA-backtrack, BWA-SW, LAST, Partek Flow, Bowtie 2, Stampy, SHRiMP2, SNP-o-matic, CLC Workbench, NextGenMap, Mosaik, ERNE-MAP, mrFAST, or mrsFAST-Ultra.
. The method according to any one of, wherein alignment of target sample reads of the high throughput sequencing data comprises performing a local indel realignment.
. The method according to any one of, wherein nucleic acid sequence variants are identified by FreeBayes, MuTect2, or SAMtools.
. The method according to any one of, wherein the high throughput sequencing data coverage is consistent across samples but non-uniform across regions of the gene and sequencing depth is not known in advance.
. The method according to any one of, wherein identifying gene-disrupting mutations comprises comparing nucleic acid sequence variants identified in the allele and/or structural variation detected in the allele to the known alleles of the reference genome database.
. The method according to any one of, wherein selecting a set of star-alleles that most closely match the identified one or more gene-disrupting mutations or lack of gene-disrupting mutations in the allele comprises:
. The method according to, wherein gene structural arrangement is determined by the method of.
. The method according to any one of, wherein the method is executable using a suitably programmed computer.
. The method according to, wherein the method according to any one ofimproves the computational capacity of the suitably programmed computer.
. A system for predicting a genotype of one or more genes, the system comprising:
. The system of, wherein the at least one processor comprises a sequence aligner, a sequence variant identifier, a structural variant identifier, a gene-disrupting mutation identifier, a star-allele identifier, and a genotype caller.
. The system of, wherein:
. The system according to any one of, wherein elements of the system are integrated into a standalone system located at a single site.
. The system according to any one of, wherein one or more elements of the system are located remotely with respect to each other.
. A system for predicting a genotype of one or more genes, the system comprising:
Complete technical specification and implementation details from the patent document.
This PCT application claims priority to U.S. Provisional Application No. 62/508,870, filed on May 19, 2017, the entire disclosure of which is expressly incorporated herein by reference for all purposes.
This invention was made with government support under GM108348 awarded by National Institutes of Health. The government has certain rights in the invention.
The present disclosure relates generally to methods and systems for predicting the genotype of one or more genes utilizing high throughput sequencing data.
The use of genetic testing in personalized medicine is increasingly allowing for movement away from a standard of care that has been generalized for the general population and toward a more personalized, genome-based approached aimed at preventing, diagnosing, and treating disease in the individual.
The rapid development of high throughput sequencing (HTS) technologies has made a considerable impact on clinical genomics research. In principle, modern HTS platforms offer time-efficient, cost-effective and highly accurate means for genotyping clinically relevant genes. However, analyzing the sequencing data has posed problems in certain instances. Since many functionally and clinically important genes are highly polymorphic and have multiple copies as we as sequencewise-similar pseudogenes with which the frequently hybridize/fuse with to produce novel alleles, analyzing the sequence data is highly challenging from a computational point of view. In addition, some of these genes have been subject to structural alterations, making their allelic decomposition (i.e., determining the number of copies of a gene and the exact sequence content of each of its copies) computationally difficult.
Current computational tools are unable to utilize HTS data to perform allelic decomposition of genes that have been subject to structural alterations. Available structural variation detection tools aim to identify the type and locus of large “structure altering events” (e.g., large-scale deletions, novel sequence insertions, segmental duplications, and inversions), typically in uniquely mappable regions of the genome. In contrast, available copy number alteration detection/copy number phasing tools aim to identify the number of copies of a particular gene in each chromosome under the implicit assumption that the gene duplications or deletions always affect the entire (rather than a part of the) gene of interest, but do not reconstruct the exact sequence content of the gene. While certain methods have been developed that utilize small variants for copy number phasing, these methods are limited to detecting copy number changes only. They cannot also determine the exact sequence content of each copy of a gene that has been subject to structural alterations. No existing tool aims to find out what happens when structural alterations affect genes with multiple copies or those with highly homologous pseudogenes. Such genes are algorithmically difficult to resolve, as read that originate from such genes have high mapping ambiguity.
Those genes involved in the Absorption, Distribution, Metabolism, and/or Excretion (ADME) of pharmaceutical compounds are examples of highly polymorphic genes having multiple copies.
Accurately determining the genotypes of genes involved in the Absorption, Distribution, Metabolism, and/or Excretion (ADME) is essential to drug treatment and dosage decisions, and is highly recommended prior to treatment with certain drugs. Unfortunately, existing array-based genotyping assays are limited in scope in that they do not cover all genes and all potential variants of each gene, can be costly, and sometimes inaccurate.
Genotyping ADME genes can play an important role in identifying responders and non-responders to pharmaceutical compounds, avoiding adverse events, and optimizing drug dose, and assist treatment and dosage decisions for more than 90% of all prescribed drugs.
Targeted genotyping platforms, like Affymetrix DMET™ Plus arrays and the Illumina ADME assays are able to detect the common set of predefined variations and genotypes. However, rare variations are common across sites that impact drug response. The PGRNseq capture protocol was recently introduced, and offers a targeted sequencing platform for ADME gene. The PGRNseq protocol currently targets 84 ADME genes. Algorithmic challenges in exact ADME genotyping exist however, due to the presence of pseudo-genes, structural rearrangements, and copy number variation. This has resulted in a major roadblock to the use of HTS platforms in pharmacogenomics analysis. Additional obstacles such as the short read lengths offered by prominent sequencing technologies and non-uniformity of sequencing coverage for alternate sequencing technologies further complicate ADME genotyping.
The methods and systems described herein can be used to predict the genotype of one or more genes utilizing high throughput sequencing data. The provided methods and systems allow for accurate genotyping of genes, including ADME genes, and can be used to identify novel alleles.
In some embodiments, methods for genotyping a gene comprise receiving high throughput sequencing data for the gene from a target sample, wherein the high throughput sequencing data comprises a plurality of target sample reads; aligning the target sample reads of the high throughput sequencing data to one or more star-alleles of a reference genome allele database, wherein the reference genome allele database comprises nucleic acid sequences for known star alleles of the gene; identifying one or more nucleic acid sequence variants, or a lack of nucleic acid variants, in an allele of the gene relative to the one or more star-alleles of the reference genome allele database; detecting structural variants or a lack of structural variants in the allele; identifying one or more gene-disrupting mutations (i.e., functional mutations) or a lack of gene-disrupting mutations in the allele; selecting a set of one or more reference star alleles from the reference genome allele database that most closely match the identified one or more gene-disrupting mutations or a lack of gene-disrupting mutations in the allele; and calling, for an allele where a single star-allele was selected, a genotype associated with the selected star-allele.
In some embodiments, the method further comprises refining a genotype for an allele where two or more star-alleles were selected by assigning each of the selected star-alleles a penalty score based on minimizing a number of missing and additional non-functional variation in the allele in order to match the database as closely as possible, and calling a genotype for the allele as that genotype associated with one or more star-alleles having the lowest penalty score.
In some embodiments, the gene to be genotyped is an absorption, distribution, metabolism, and excretion (ADME) gene.
In some embodiments, the method is repeated for one or more additional genes.
In some embodiments, the high throughput data is targeted hybrid capture with consistent read distribution data or whole genome sequencing (WGS) data. In some embodiments, the hybrid capture with consistent read distribution data is PGRNseq data.
In some embodiments, the high throughput data is received as a FASTQ file, a uSAM file, or a uBAM file.
In some embodiments, alignment of target sample reads of the high throughput sequencing data is accomplished by BWA-MEM, BWA-backtrack, BWA-SW, LAST, Partek Flow, Bowtie 2, Stampy, SHRiMP2, SNP-o-matic, CLC Workbench, NextGenMap, Mosaik, ERNE-MAP, mrFAST, or mrsFAST-Ultra.
In some embodiments, alignment of target sample reads of the high throughput sequencing data comprises performing a local indel realignment.
In some embodiments, nucleic acid sequence variants are identified by FreeBayes, MuTect2, or SAMtools.
In some embodiments, the high throughput sequencing data coverage is consistent across samples but non-uniform across regions of the gene and sequencing depth is not known in advance.
In some embodiments, detecting structural variations in the allele comprises the steps of: estimating a gene copy number for one or more regions of the allele; determining an observed coverage for each of the one or more regions; and identifying an optimal gene arrangement by determining a minimal difference between the observed coverage for each of the one or more regions and coverage formed by one or more known possible gene arrangements, wherein a structural rearrangement is detected when the optimal gene arrangement is not a reference gene arrangement.
In some embodiments, identifying gene-disrupting mutations comprises the steps of comparing nucleic acid sequence variants identified in the allele and/or structural variation detected in the allele to the known alleles of the reference genome database.
In some embodiments, selecting a set of star-alleles that most closely match the identified one or more gene-disrupting mutations or lack of gene-disrupting mutations in the allele comprises the steps of: receiving a nucleic acid sequence for each known gene allele of the reference genome database; excluding nucleic acid sequences of known gene alleles that are not in agreement with a determined allele structural arrangement; excluding nucleic acid sequences of known gene alleles of the reference genome database that include neutral mutations; and selecting one or more known gene alleles of the reference genome database that most closely match the identified one or more gene-disrupting mutations or lack of gene-disrupting mutations in the allele.
In some embodiments, the methods described herein are executable using a suitably programmed computer. In some embodiments, the methods described herein improve the computational capacity of the suitably programmed computer.
Also described herein are systems for predicting a genotype of one or more genes. In some embodiments, a system comprises: a sample generator; a sequencer; at least one database having information regarding the one or more genes; and a sequence analyzer comprising a user interface and a system controller comprising at least one processer configured to perform the method according to an embodiment described herein. In some embodiments the at least one processor comprises a sequence aligner, a sequence variant identifier, a structural variant identifier, a gene-disrupting mutation identifier, a star-allele identifier, and a genotype caller.
In some embodiments, the sequence aligner is configured to align target sample reads of the high throughput sequencing data to a reference genome database; the sequence variant identifier is configured to identify nucleic acid sequence variants in a gene allele relative to the reference genome database; the structural variant identifier is configured to detect structural variants or a lack of structural variants in the gene allele; the gene-disrupting mutation identifier is configured to identify one or more gene-disrupting mutations or a lack of gene-disrupting mutations in the gene allele; the star-allele identifier is configured to identify one or more star-allele comprising the identified one or more gene-disrupting mutations or lack of gene-disrupting mutations; and the genotype caller is configured to determine the allele to have the genotype associated with the identified one or more star-alleles.
In some embodiments, elements of a system described herein are integrated into a standalone system located at a single site. In other embodiments, elements of a system described herein are located remotely with respect to each other.
Corresponding reference characters indicate corresponding parts throughout the several views
The embodiments disclosed herein are not intended to be exhaustive or limit the disclosure to the precise forms disclosed in the following detailed description. Rather, the embodiments are chosen and described so that others skilled in the art may utilize their teachings.
Described herein is a tool and methods that can utilize HTS data to perform allelic decomposition of genes that have been subject to structural alterations. In some embodiments, the methods comprise i) finding out how many copies of a gene there are and which HTS read belongs to which copy (i.e., mapping ambiguity resolution), and ii) implicitly or explicitly assembling each copy of the gene from the read set (this is inherently intermingled with mapping ambiguity resolution) and identify the gene copy's origins relative to a reference genome. In some embodiments, the methods further comprise a) identifying all structural alteration breakpoints and carefully reconstructing the sequence content of each breakpoint region, while taking into account all micro-structural alterations, indels, and single nucleotide variants (SNVs) each copy of the gene has been subject to, and b) identifying fusions/hybridizations between the gene and its highly homologous pseudogenes. The tools and methods available to date fail to address these issues.
Existing structural variation discovery tools are based on the following general strategies: detection of variants using discordantly mapping paired-end reads (e.g., Variation Hunter (Hormozdiari, F. et al. (2009), Genome Res, 19:1270-1278; Hormozdiari, F. et al. (2010), Bioinformatics, 26:i350-i357) and Hydra (Quinlan, A. R. et al. (2010), Genome Res, 20:623-635), which report only the rough loci of structural variants but not their sequence content); detection of variants using split-read mappings (e.g., Socrates (Schroeder, J. et al. (2014), Bioinformatics, 40:1064-1072)); and detection of variants by mapping de novo assembled contigs to a reference genome (e.g., Barnacle (Swanson, L. et al. (2013), BMC Genomics, 14:550) and Dissect (Yorukoglu, D. et al. (2012), Bioinformatics, 28:i179-i187), which are RNA-Seq analysis tools that can also be used to analyze genomic data).
There are several tools available that employ a combination of these strategies (e.g., Pindel (Ye, K. et al. (2009), Bioinformatics, 25:2865-2871), Delly (Rausch, T. et al. (2012), Bioinformatics, 28:i333-i339), novoBreak (Chong, Z. et al. (2017), Nat. Methods, 14:65-67), and GASVPro (Sindi, S. et al. (2012), Genome Biol, 13:R22), which only consider uniquely mapped reads and cannot identify alterations in repetitive DNA. No available tool, even those designed to identify gene fusions only (e.g., defuse (McPherson, A. et al. (2011), PLoS Comput Biol, 7:1-16)), aims to reconstruct the sequence content of a fusion between a gene and a highly similar pseudogene. Further, no existing tool aims to infer variants from targeted capture sequencing data which are highly non-uniform in coverage (e.g., PGRNseq, which is discussed herein). Even tools that aim to genotype a particular gene such as the ADME gene CYP2D6, namely Cypiripi (Numanagic, I. et al. (2015), Bioinformatics, 31:i27-i34) and Astrolabe (formerly Constellation; Twist, G. P. et al. (2016), NPJ Genomic Med, 1:15007), respectively work only on uniform coverage sequencing data, or can determine the gene's sequence content only if it differs from a reference sequence by SNVs but not structural variation.
The methods provided herein address these challenges, and provide for the first time a framework to perform allelic decomposition of any gene of interest in HTS data. In some embodiments, the methods provided herein can perform allelic decomposition of any gene that differs from a reference genome by i) SNVs, ii) short indels, iii) full gene duplications or deletions (leading to copy number alteration), iv) partial gene duplications or deletions, as well as v) “balanced” fusions (i.e., those that preserve the structure of a gene) with highly homologous pseudogenes (the fusions can have one or more breakpoints). In some embodiments, all possible combinations of genomic alterations are identified, and the sequence content of all copies of a gene are determined in whole genome or targeted genome sequencing data.
Over 300 genes have been identified to participate in some way in the Absorption, Distribution, Metabolism and/or Excretion (ADME) of pharmaceutical compounds. Of these, 32 have been identified as core ADME genes, essential to ADME of pharmaceutical compounds. An additional 184 genes have been identified as related ADME genes, which includes those genes determined to be related to ADME of pharmaceutical compounds. ADME genotyping plays an important role in identifying responders and non-responders to pharmaceutical compounds, avoiding adverse events, and optimizing drug dose. Over 230 FDA-approved drugs provide pharmacogenomic information in their labeling.
Advances in DNA sequencing over the past two decades made it possible to explore the human genome in unprecedented detail. Whole genome sequencing (WGS) is now routinely performed in less than a day, and the Illumina HiSeq X sequencing system has driven the cost of WGS under $1,000 dollars per sample. Furthermore, Illumina-style WGS data offers high coverage depth, uniform read distribution and low error rates, all of which are useful for genotyping purposes. However, WGS is still considered costly and time-consuming compared to the targeted genotyping panels. Whole exome sequencing (WES) provides cheaper alternative to WGS, but in its current iteration, it is not able to sequence non-coding regions. This makes WES unsuitable for genotyping of ADME pharmacogenes, where variations in the non-coding regions can significantly affect phenotype.
Much of clinical genotyping is still performed through targeted genotyping panels. These targeted genotyping panels, like Affymetrix DMET™ Plus arrays and the Illumina ADME assays are able to detect a common set of predefined variations and genotypes. However, rare or personal variants, while functionally significant, often cannot be captured by these panels. Rare variants of pharmacogenes (e.g., CYP2D6) can impact drug response. As a result, new HTS-based targeted captures are being introduced to help identify novel variants in a cost-effective manner. A prime example is the PGRNseq capture protocol, which was recently introduced and offers a targeted sequencing platform for ADME gene. The PGRNseq protocol currently targets 84 ADME genes (Table 1; see Gordon, A. S., et al., (April 2016)26(4):161-168 (Epub January 2016)), which is hereby incorporated by reference in its entirety). For each of these genes, PGRNseq covers at least its exonic region and a few kilobases upstream and downstream of gene's untranslated region (UTR), covering more than 960 KB of the human genome through its first iteration (PGRNSeq v.1). PGRNseq maintains backward-compatibility with previous DMET™ Plus and Illumina ADME assays by targeting all single nucleotide variations (SNV) included in those panels. In total, more than 960 KB of genome is covered by PGRNseq. PGRNseq capture products are sequenced on the Illumina HiSeq 2000 platform, which provides low error rates while maintaining very high depth of coverage (averaging 500× per chromosome). Most importantly, PGRNseq is significantly less expensive than WGS and even WES. For example, PGRNseq is up to ten times less expensive than WGS. Thus, PGRNseq offers a competitively-priced platform for clinical genotyping of targeted genes, while providing all the benefits of standard WGS sequencing. And even though PGRNSeq (or WGS) data from some of the pharmacogenes are relatively straightforward to interpret, other, more difficult genes such as CYP2D6 have proven difficult to analyze.
PGRNseq inherits some of the problems that come with WGS, including short read length and data interpretation issues. Genotype inference for ADME genes harboring various structural rearrangements still presents a major challenge. In order to assist the analysis of such structural variants, the second iteration of PGRNseq covers the whole genic clusters containing the targeted ADME genes (e.g., for CYP2D6, the entire 30 KB CYP2D cluster is sequenced, which includes CYP2D6 and pseudogenes CYP2D7 and CYP2D8).
The present disclosure provides the first computational tool to exactly genotype ADME genes based on PGRNseq and Illumina WGS data, with the capability to accurately handle gene duplications, fusions, and genomic deletions. Available CYP2D6 genotyping tools such as Cypiripi (see Numanagic, I. et al. (2015), Bioinformatics, 31:i27-i34 and Astrolabe (see Twist, G. P. et al. (2016), NPJ Genomic Med, 1:15007) are not only limited to uniform coverage WGS data, but are also unable to properly detect some structural rearrangements. And while genotyping CYP2D6 is valuable in that its encoded enzyme is involved in metabolism of 20-25% of clinically prescribed drugs, neither of these tools provide support for genotyping other ADME genes. Previous attempts to analyze PGRNseq data, which relied on SNP callers to infer genotype, resulting in Mendelian inconsistencies (see Gordon, A. S. et al. (April 2016), Pharmacogenet Genomics, 26(4):161-168 (Epub January 2016)). The methods and systems described herein are demonstrated on a large selection of WGS and PGRNseq samples to be a highly accurate and very fast tool for ADME genotyping.
The tool and methods described herein is not limited to the genotyping of ADME genes, and can be applied broadly to any gene or group of genes. While embodiments and examples provided herein describe the genotyping of ADME genes, the described methods can be similarly applied to other genes by those of skill in the art having the benefit of the present disclosure.
The tool and the methods described herein are capable of reconstructing the structure and sequence content of each copy of a particular gene present in a sample being analyzed. Following the well-established star-allele nomenclature in pharmacogenomics, a star-allele of a gene is defined as a gene sequence which differs from the “wild type” (or canonical) gene sequence by a (non-empty) set of mutations. Thus, reconstructing the sequence content of a gene copy is identical to identification of the gene copy's star-allele, which could either be already known or possibly novel.
Two types of mutations, and as a consequence, star-alleles, are distinguished. Any mutation that has an impact on the resulting protein product of the gene is referred to herein as a “gene-disrupting mutation” (also known as a “functional mutation”). Gene-disrupting mutations include codon-changing single nucleotide polymorphisms (SNPs) and indels, as well as mutations outside the coding regions that affect the protein enzyme activity. Star-alleles which are defined solely by gene-disrupting mutations are referred to as “major star-alleles,” and are assigned a unique number. For example the canonical “wild type” star-allele is always assigned *1, while *2 describes a star-allele that harbors one or more gene-disrupting mutations capered to the *1. If a new major star-allele is discovered that has not previously been reported in the literature, new star allele is identified by *n+1, whereis the number of major star-alleles known up to that point. It is possible that two major star-alleles can share a common mutation.
A mutation that does not impact the protein product is referred to herein as a “neutral mutation” (also known as a “non-functional mutation”). Any major star-allele can be extended with neutral mutations, and such extension is referred to as a “minor star-allele.” If a copy of a gene includes only neutral mutations, then it is considered to be an extension of the wild-type star-allele. In order to distinguish various minor star-alleles, a unique symbol, and in certain instances, a pair of symbols, is attached to the major star-allele's number for each such extension. For example, minor star-allele 2A is formed by taking the set of gene-disrupting mutations for major *2 allele and extending it with some neutral mutation; *2B is formed in a similar manner, however the sets describing the neutral mutations of *2A and *2B are not identical, although the sets describing their gene-disrupting mutations are. If a new minor star-allele that is an extension to the star-allele *κ is discovered, it is commonly called *κX where X is the lexicographically smallest letter which has not yet been used for minor alleles of *κ.
In view of these definitions, the tool and methods described characterize the sequence composition of each copy of a gene present in a sample, which is by definition equivalent to inferring the major and minor star-allele label of such gene copy. Where there is a need to define a new star-allele, this is done by minimizing the number of novel mutations and structural variations that need to be added to or subtracted from a known star-allele to describe the new star-allele. In certain aspects, and as depicted in, the methods for genotyping(i.e., characterizing the sequence composition of a gene) comprise the steps of: 1) read alignment and mutation detection, where high throughput sequencing (HTS) reads are aligned to a reference genome and mutations present in a target gene region are identified; 2) copy number and structural variation estimation, where copy number of the gene is identified, and if present, various structural variations are identified; 3) major star-allele identification, where the major star-allele of each gene copy is established; 4) genotype refinement, where the supporting set of neutral mutations is assigned to each major star-allele, and the “score” of such an assignment (see “Genotype Refining” section) is used to rank each allelic configuration identified; and 5) genotype calling, where the final genotype (i.e. minor star-allele) is obtained by choosing the set of allelic configurations with the best ranking score. In instances where multiple configurations have the same score, all configurations will be reported as equally likely genotypes.
In some embodiments, the primary input is HTS in SAM/BAM file format, as well as one or more databases comprised of information about the gene to be genotyped. Such databases will contain basic information about the gene (e.g., its location within a reference genome, locations of pseudogenes, intron/exon boundaries), as well as a listing of all known major and minor star-alleles for that gene. Each described as a unique set of gene-disrupting and neutral mutations. The databases will also contain a listing of all known structural variations involving the gene of interest; i.e., duplications and deletions, as well as all known hybridizations with its pseudogene, either in the form of fusions (when a prefix or suffix of the hybrid gene sequence is from the pseudogene) or gene conversions (when a segment other than a suffix/prefix of the hybrid gene is from the pseudogene). In some embodiments, all information about the gene to be genotyped is maintained by and obtained from a single database. In some embodiments, the information about the gene to be genotyped is maintained by and obtained from two or more separate databases.
The tool and methods described herein utilize the gene information from the one or more databases to “guide” star-allele discovery, aiming to assign a known major and minor star-allele label for each copy of the gene. In instances where no known star-allele description “matches” the input data, the tool and methods described will infer previously unknown major or minor star-allele descriptions. Mutations considered herein include SNVs and short indels; the structural variation considered include (partial) deletions or duplications of the gene, and hybridizations (a.k.a fusions) with a specified pseudogene.
depicts one embodiment of methods for genotyping, which follows the general methodof. As depicted in, in some embodiments, methods for genotypingcomprise the steps of: 1) receiving high throughput sequencing (HTS) data for a gene from a target sample, 2) aligning target sample reads from the HTS data to a reference genome allele database for the gene, 3) determining whether the alignment is acceptable for both alleles of the gene, and if yes, 4) calling the genotype for each allele, but if no, 5) identifying nucleic acid variants for each allele, 6) detecting structural variants or a lack of structural variants in each allele, 7) identifying one or more gene-disrupting mutations or a lack of gene-disrupting mutation in each allele, 8) for each allele, selecting a set of reference star-alleles which most closely match the identified (i.e., observed) set of gene-disrupting mutations in each allele, and 9), determining each allele of the gene to have the genotype associated with the identified set of reference star-alleles for that allele. In some embodiments, the set of reference star-alleles can include one or more different star-alleles. In some embodiments, the set of reference star-alleles includes a single star allele. In these embodiments, an allele genotype can be called based on that single star-allele. In other embodiments, the set of reference star-alleles includes two or more star alleles. In these embodiments, the genotyping methodcan further comprise a genotype refining step. In some embodiments, the genotype refining stepcomprises ranking each possible solution, or identified star-allele in the set of reference star-alleles, and the one or more solutions with the best ranking score are identified as the genotype for the allele. In some embodiments, the method for genotyping is repeated for each gene of interest. In some embodiments, one or more steps of the methods for genotyping are performed by a suitably programmed computer. In some embodiments, one or more genes of interest are genotyped simultaneously. Each of the steps is described herein in detail.
As depicted in, in some embodiments, a method for genotyping a gene comprises receiving HTS data for a gene from a target sample, aligning target sample reads from the HTS data to gene information from one or more reference databases for the gene, determining whether the alignment is acceptable for both alleles of the gene, and where the alignment is determined to be acceptable for each allele, calling the genotype for each allele. The genotype of each allele can optionally be confirmed. In some embodiments, the genotype of each allele is not confirmed, and the genotyping for each allele of the gene is complete. In other embodiments, the genotype of each allele is optionally confirmed. Where the genotype of each allele is to be confirmed, an allele sequence may optionally be surveyed to identify nucleic acid variantsand detect structural variants. If the optional steps of identifying nucleic acid variantsand detecting structural variantsare carried out, gene-disrupting mutations for each allele are identified. In some embodiments, gene-disrupting mutations for each allele are identifiedwithout identifying nucleic acid variantsor detecting structural variantsin the sequence of each allele. Where an acceptable alignment for both alleles has been achieved, these two optional steps can be skipped, as the exact sequence for each allele, and thus any nucleic acid variant or structural variant, is known. Following identification of gene-disrupting mutations for each allele, a set of reference star-alleles that most closely match the identified (i.e., observed) set of gene-disrupting mutations for each allele is selected, and it is determined whether more than one reference star-allele was selected for each allele. In some embodiments, the set of reference star-alleles can include one or more star-alleles. In some embodiments, the set of reference star-alleles includes a single star-allele. In these embodiments, an allele genotype can be called based on that single reference star-allele. In other embodiments, the set of reference star-alleles includes two or more star-alleles. In these embodiments, the genotyping method can further comprise a genotype refining step. In some embodiments, the genotype refining stepcomprises ranking each possible solution, or identified allele in the set of reference star-alleles, and the one or more solutions with the best ranking score are identified as the genotype for the allele.
As depicted in, in some embodiments, an acceptable alignment for both alleles of a gene is not present at step. In such embodiments, further reference-guided assembly of the HTS data for each allele is performed. Following alignment of the HTS data for each allele, nucleic acid variants are identified, structural variants are detected, and gene-disrupting mutations for each allele are identified. Following identification of gene-disrupting mutations for each allele, a set of reference star-alleles that most closely match the identified (i.e., observed) set of gene-disrupting mutations for each allele is selected, and it is determined whether more than one reference star-allele was selected for each allele. In some embodiments, the set of reference star-alleles can include one or more star-alleles. In some embodiments, the set of reference star-alleles includes a single star-allele. In these embodiments, an allele genotype can be called based on that single reference star allele. In other embodiments, the set of reference star-alleles includes two or more star-alleles. In these embodiments, the genotyping method can further comprise a genotype refining step. In some embodiments, the genotype refining stepcomprises ranking each possible solution, or identified star-allele in the set of reference star-alleles, and the one or more solutions with the best ranking score are identified as the genotype for the allele.
As depicted in, in some embodiments a genotyping method′ ignores whether an acceptable alignment for both alleles of a gene is found. In such embodiments, reference-guided assembly of the HTS data for each allele is performed′. Following alignment of the HTS data for each allele, nucleic acid variants are identified′, structural variants are detected′, and gene-disrupting mutations for each allele are identified′. Following identification of gene-disrupting mutations for each allele′, a set of reference star-alleles that most closely match the identified (i.e., observed) set of gene-disrupting mutations for each allele is selected′, and it is determined whether more than one reference star-allele was selected for each allele′. In some embodiments, the set of reference star-alleles can include one or more star-alleles. In some embodiments, the set of reference star-alleles includes a single star-allele. In these embodiments, an allele genotype can be called based on that single star-allele′. In other embodiments, the set of reference star-alleles includes two or more star-alleles. In these embodiments, the genotyping method can further comprise a genotype refining step′. In some embodiments, the genotype refining step′ comprises ranking each possible solution, or identified star-allele in the set of reference star-alleles, and the one or more solutions with the best ranking score are identified as the genotype for the allele.
Unknown
November 6, 2025
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.