Introduction Accurate genotyping of Killer cell Immunoglobulin-like Receptor (KIR) genes plays a pivotal role in enhancing our understanding of innate immune responses, disease correlations, and the advancement of personalized medicine. However, due to the high variability of the KIR region and high level of sequence similarity among different KIR genes, the generic genotyping workflows are unable to accurately infer copy numbers and complete genotypes of individual KIR genes from next-generation sequencing data. Thus, specialized genotyping tools are needed to genotype this complex region. Methods Here, we introduce Geny, a new computational tool for precise genotyping of KIR genes. Geny utilizes available KIR allele databases and proposes a novel combination of expectation-maximization filtering schemes and integer linear programming-based combinatorial optimization models to resolve ambiguous reads, provide accurate copy number estimation, and estimate the correct allele of each copy of genes within the KIR region. Results & Discussion We evaluated Geny on a large set of simulated short-read datasets covering the known validated KIR region assemblies and a set of Illumina short-read samples sequenced from 40 validated samples from the Human Pangenome Reference Consortium collection and showed that it outperforms the existing state-of-the-art KIR genotyping tools in terms of accuracy, precision, and recall. We envision Geny becoming a valuable resource for understanding immune system response and consequently advancing the field of patient-centric medicine.
SUMMARY Natural killer (NK) cells are essential components of the innate immune system, with their activity significantly regulated by Killer cell Immunoglobulin-like Receptors (KIRs). The diversity and structural complexity of KIR genes present significant challenges for accurate genotyping, essential for understanding NK cell functions and their implications in health and disease. Traditional genotyping methods struggle with the variable nature of KIR genes, leading to inaccuracies that can impede immunogenetic research. These challenges extend to high-quality phased assemblies, which have been recently popularized by the Human Pangenome Consortium. This paper introduces BAKIR (Biologically-informed Annotator for KIR locus), a tailored computational tool designed to overcome the challenges of KIR genotyping and annotation on high-quality, phased genome assemblies. BAKIR aims to enhance the accuracy of KIR gene annotations by structuring its annotation pipeline around identifying key functional mutations, thereby improving the identification and subsequent relevance of gene and allele calls. It uses a multi-stage mapping, alignment, and variant calling process to ensure high-precision gene and allele identification, while also maintaining high recall for sequences that are significantly mutated or truncated relative to the known allele database. BAKIR has been evaluated on a subset of the HPRC assemblies, where BAKIR was able to improve many of the associated annotations and call novel variants. BAKIR is freely available on GitHub, offering ease of access and use through multiple installation methods, including pip, conda, and singularity container, and is equipped with a user-friendly command-line interface, thereby promoting its adoption in the scientific community. AVAILABILITY AND IMPLEMENTATION BAKIR is available at github.com/algo-cancer/bakir. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Abstract Plant viral infections cause significant economic losses, totalling $350 billion USD in 2021. With no treatment for virus-infected plants, accurate and efficient diagnosis is crucial to preventing and controlling these diseases. High-throughput sequencing (HTS) enables cost-efficient identification of known and unknown viruses. However, existing diagnostic pipelines face challenges. First, many methods depend on subjectively chosen parameter values, undermining their robustness across various data sources. Second, artifacts (e.g. false peaks) in the mapped sequence data can lead to incorrect diagnostic results. While some methods require manual or subjective verification to address these artifacts, others overlook them entirely, affecting the overall method performance and leading to imprecise or labour-intensive outcomes. To address these challenges, we introduce IIMI, a new automated analysis pipeline using machine learning to diagnose infections from 1583 plant viruses with HTS data. It adopts a data-driven approach for parameter selection, reducing subjectivity, and automatically filters out regions affected by artifacts, thus improving accuracy. Testing with in-house and published data shows IIMI’s superiority over existing methods. Besides a prediction model, IIMI also provides resources on plant virus genomes, including annotations of regions prone to artifacts. The method is available as an R package (iimi) on CRAN and will integrate with the web application www.virtool.ca, enhancing accessibility and user convenience.
Natural killer (NK) cells are essential components of the innate immune system, with their activity significantly regulated by Killer cell Immunoglobulin-like Receptors (KIRs). The diversity and structural complexity of KIR genes present significant challenges for accurate genotyping, essential for understanding NK cell functions and their implications in health and disease. Traditional genotyping methods struggle with the variable nature of KIR genes, leading to inaccuracies that can impede immunogenetic research. These challenges extend to high-quality phased assemblies, which have been recently popularized by the Human Pangenome Consortium. This paper introduces BAKIR (Biologically-informed Annotator for KIR locus), a tailored computational tool designed to overcome the challenges of KIR genotyping and annotation on high-quality, phased genome assemblies. BAKIR aims to enhance the accuracy of KIR gene annotations by structuring its annotation pipeline around identifying key functional mutations, thereby improving the identification and subsequent relevance of gene and allele calls. It uses a multi-stage mapping, alignment, and variant calling process to ensure high-precision gene and allele identification, while also maintaining high recall for sequences that are significantly mutated or truncated relative to the known allele database. BAKIR has been evaluated on a subset of the HPRC assemblies, where BAKIR was able to improve many of the associated annotations and call novel variants. BAKIR is freely available on GitHub, offering ease of access and use through multiple installation methods, including pip, conda, and singularity container, and is equipped with a user-friendly command-line interface, thereby promoting its adoption in the scientific community.
Background Next-generation sequencing (NGS), including whole genome sequencing (WGS) and whole exome sequencing (WES), is increasingly being used for clinic care. While NGS data have the potential to be repurposed to support clinical pharmacogenomics (PGx), current computational approaches have not been widely validated using clinical data. In this study, we assessed the accuracy of the Aldy computational method to extract PGx genotypes from WGS and WES data for 14 and 13 major pharmacogenes, respectively. Methods Germline DNA was isolated from whole blood samples collected for 264 patients seen at our institutional molecular solid tumor board. DNA was used for panel-based genotyping within our institutional Clinical Laboratory Improvement Amendments- (CLIA-) certified PGx laboratory. DNA was also sent to other CLIA-certified commercial laboratories for clinical WGS or WES. Aldy v3.3 and v4.4 were used to extract PGx genotypes from these NGS data, and results were compared to the panel-based genotyping reference standard that contained 45 star allele-defining variants within CYP2B6, CYP2C8, CYP2C9, CYP2C19, CYP2D6, CYP3A4, CYP3A5, CYP4F2, DPYD, G6PD, NUDT15, SLCO1B1, TPMT, and VKORC1. Results Mean WGS read depth was >30x for all variant regions except for G6PD (average read depth was 29 reads), and mean WES read depth was >30x for all variant regions. For 94 patients with WGS, Aldy v3.3 diplotype calls were concordant with those from the genotyping reference standard in 99.5% of cases when excluding diplotypes with additional major star alleles not tested by targeted genotyping, ambiguous phasing, and CYP2D6 hybrid alleles. Aldy v3.3 identified 15 additional clinically actionable star alleles not covered by genotyping within CYP2B6, CYP2C19, DPYD, SLCO1B1, and NUDT15. Within the WGS cohort, Aldy v4.4 diplotype calls were concordant with those from genotyping in 99.7% of cases. When excluding patients with CYP2D6 copy number variation, all Aldy v4.4 diplotype calls except for one CYP3A4 diplotype call were concordant with genotyping for 161 patients in the WES cohort. Conclusion Aldy v3.3 and v4.4 called diplotypes for major pharmacogenes from clinical WES and WGS data with >99% accuracy. These findings support the use of Aldy to repurpose clinical NGS data to inform clinical PGx.
Domain-specific languages (DSLs) are able to provide intuitive high-level abstractions that are easy to work with while attaining better performance than general-purpose languages. Yet, implementing new DSLs is a burdensome task. As a result, new DSLs are usually embedded in general-purpose languages. While low-level languages like C or C++ often provide better performance as a host than high-level languages like Python, high-level languages are becoming more prevalent in many domains due to their ease and flexibility. Here, we present Codon, a domain-extensible compiler and DSL framework for high-performance DSLs with Python's syntax and semantics. Codon builds on previous work on ahead-of-time type checking and compilation of Python programs and leverages a novel intermediate representation to easily incorporate domain-specific optimizations and analyses. We showcase and evaluate several compiler extensions and DSLs for Codon targeting various domains, including bioinformatics, secure multi-party computation, block-based data compression and parallel programming, showing that Codon DSLs can provide benefits of familiar high-level languages and achieve performance typically only seen with low-level languages, thus bridging the gap between performance and usability.
Secure multiparty computation (MPC) is a cryptographic tool that allows computation on top of sensitive biomedical data without revealing private information to the involved entities. Here, we introduce Sequre, an easy-to-use, high-performance framework for developing performant MPC applications. Sequre offers a set of automatic compile-time optimizations that significantly improve the performance of MPC applications and incorporates the syntax of Python programming language to facilitate rapid application development. We demonstrate its usability and performance on various bioinformatics tasks showing up to 3–4 times increased speed over the existing pipelines with 7-fold reductions in codebase sizes.
High-throughput sequencing provides sufficient means for determining genotypes of clinically important pharmacogenes that can be used to tailor medical decisions to individual patients. However, pharmacogene genotyping, also known as star-allele calling, is a challenging problem that requires accurate copy number calling, structural variation identification, variant calling, and phasing within each pharmacogene copy present in the sample. Here we introduce Aldy 4, a fast and efficient tool for genotyping pharmacogenes that uses combinatorial optimization for accurate star-allele calling across different sequencing technologies. Aldy 4 adds support for long reads and uses a novel phasing model and improved copy number and variant calling models. We compare Aldy 4 against the current state-of-the-art star-allele callers on a large and diverse set of samples and genes sequenced by various sequencing technologies, such as whole-genome and targeted Illumina sequencing, barcoded 10x Genomics, and Pacific Biosciences (PacBio) HiFi. We show that Aldy 4 is the most accurate star-allele caller with near-perfect accuracy in all evaluated contexts, and hope that Aldy remains an invaluable tool in the clinical toolbox even with the advent of long-read sequencing technologies.
Pharmacogenomics (PGx)-guided drug treatment is one of the cornerstones of personalized medicine. However, the genes involved in drug response are highly complex and known to carry many (rare) variants. Current technologies (short-read sequencing and SNP panels) are limited in their ability to resolve these genes and characterize all variants. Moreover, these technologies cannot always phase variants to their allele of origin. Recent advance in long-read sequencing technologies have shown promise in resolving these problems. Here we present a long-read sequencing panel-based approach for PGx using PacBio HiFi sequencing. A capture based approach was developed using a custom panel of clinically-relevant pharmacogenes including up- and downstream regions. A total of 27 samples were sequenced and panel accuracy was determined using benchmarking variant calls for 3 Genome in a Bottle samples and GeT-RM star(*)-allele calls for 21 samples.. The coverage was uniform for all samples with an average of 94% of bases covered at >30×. When compared to benchmarking results, accuracy was high with an average F1 score of 0.89 for INDELs and 0.98 for SNPs. Phasing was good with an average of 68% the target region phased (compared to ~20% for short-reads) and an average phased haploblock size of 6.6kbp. Using Aldy 4, we compared our variant calls to GeT-RM data for 8 genes (CYP2B6, CYP2C19, CYP2C9, CYP2D6, CYP3A4, CYP3A5, SLCO1B1, TPMT), and observed highly accurate star(*)-allele calling with 98.2% concordance (165/168 calls), with only one discordance in CYP2C9 leading to a different predicted phenotype. We have shown that our long-read panel-based approach results in high accuracy and target phasing for SNVs as well as for clinical star(*)-alleles.
High-throughput sequencing provides sufficient means for determining genotypes of clinically important pharmacogenes that can be used to tailor medical decisions to individual patients. However, pharmacogene genotyping, also known as star-allele calling, is a challenging problem that requires accurate copy number calling, structural variation discovery, variant calling and phasing within each pharmacogene copy present in the sample. Here we introduce Aldy 4, a fast and efficient tool for genotyping pharmacogenes that utilizes combinatorial optimization for accurate star-allele calling across different sequencing technologies. Aldy 4 adds support for long reads and ships with a novel phasing model and improved copy number and variant calling models. We compare Aldy 4 against the current state-of-the-art star-allele callers on a large and diverse set of samples and genes sequenced by various sequencing technologies, such as whole-genome and targeted Illumina sequencing, barcoded 10X Genomics and PacBio HiFi. We show that Aldy 4 is the most accurate star-allele caller with near-perfect accuracy in all evaluated contexts. We hope that Aldy remains an invaluable tool in the clinical toolbox even with the advent of long-read sequencing technologies. Availability Aldy 4 is available at https://github.com/0xTCG/aldy.
Background Pharmacogenomics (PGx) testing can reduce toxicities and improve efficacy of several drugs used to treat cancer and associated symptoms. PGx results can be determined from germline whole-exome sequencing (WES), but somatic mutations may cause discordance between tumor and germline DNA. Since clinical diagnostic sequencing in oncology frequently only includes tumor DNA, there would be clinical value in calling germline PGx genotypes from tumor DNA. Thus, the purpose of this study was to assess the feasibility of using somatic WES data to call germline PGx genotypes. Methods Germline and somatic WES data were obtained as part of the clinical workflow for 64 patients treated at the solid molecular tumor board clinic at Indiana University. Aldy v3.3 was implemented in LifeOmic’s Precision Health Cloud™ to call PGx genotypes from somatic WES. Somatic Aldy calls were compared with previously validated Aldy germline calls for 8 genes: CYP2C9, CYP2C19, CYP2D6, CYP3A4, CYP3A5, CYP4F2, DPYD, and TPMT. Somatic read depth was >100x, except for the intronic CYP3A4*22 variant, which was >30x. Results Somatic and germline Aldy calls were compared for a total of 512 genotypes and 56 (11%) calls were discordant. Discordant calls were most common for CYP2B6 (23.4%), followed by CYP2D6 (14.1%), CYP2C19 (10.9%), CYP2C8 (6.3%), and DPYD (6.3%). In contrast, all Aldy calls were concordant for CYP3A5 and TPMT. 38 out of 64 subjects (59%) had discordant calls for at least one gene. The most common first cancer diagnoses in our cohort were colorectal (9.3%), breast (7.8%), and pancreatic (7.8%), and the rates of discordant Aldy calls did not differ by cancer type (p>0.05 for all cancer types). Based on our analyses of discordant calls, we anticipate that adjusting Aldy’s thresholds for variant calling may allow Aldy to determine genotypes from somatic WES data. Conclusion In most cases, genotype calls of drug metabolism genes from tumor DNA reflected the germline genotypes; however, additional work needs to be done to determine if the remaining discordant calls can be corrected by modifying the informatics tools or if they are due to somatic mutations. Citation Format: Wilberforce A. Osei, Tyler Shugg, Reynold C. Ly, Steven M. Bray, Benjamin A. Salisbury, Ryan R. Ratcliff, Victoria M. Pratt, Ibrahim Numanagić, Todd Skaar. Pharmacogenomics genotyping from clinical somatic whole exome sequencing: Aldy, a computational tool [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13. Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr 1151.
Genomic data leaks are irreversible. Leaked DNA cannot be changed, stays disclosed indefinitely, and affects the owner's family members as well. The recent large-scale genomic data collections [1], [2] render the traditional privacy protection mechanisms, like the Health Insurance Portability and Accountability Act (HIPAA), inadequate for protection against the novel security attacks [3]. On the other hand, data access restrictions hinder important clinical research that requires large datasets to operate [4]. These concerns can be naturally addressed by the employment of privacy-enhancing technologies, such as a secure multiparty computation (MPC) [5]–[10]. Secure MPC enables computation on data without disclosing the data itself by dividing the data and computation between multiple computing parties in a distributed manner to prevent individual computing parties from accessing raw data. MPC systems are being increasingly adopted in fields that operate on sensitive datasets [11]–[13], such as computational genomics and biomedical research [14]–[22].
The increasing availability of high-quality genome assemblies raised interest in the characterization of genomic architecture. Major architectural elements, such as common repeats and segmental duplications (SDs), increase genome plasticity that stimulates further evolution by changing the genomic structure and inventing new genes. Optimal computation of SDs within a genome requires quadratic-time local alignment algorithms that are impractical due to the size of most genomes. Additionally, to perform evolutionary analysis, one needs to characterize SDs in multiple genomes and find relations between those SDs and unique (non-duplicated) segments in other genomes. A naïve approach consisting of multiple sequence alignment would make the optimal solution to this problem even more impractical. Thus there is a need for fast and accurate algorithms to characterize SD structure in multiple genome assemblies to better understand the evolutionary forces that shaped the genomes of today. Here we introduce a new approach, BISER, to quickly detect SDs in multiple genomes and identify elementary SDs and core duplicons that drive the formation of such SDs. BISER improves earlier tools by (i) scaling the detection of SDs with low homology to multiple genomes while introducing further 7–33×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} speed-ups over the existing tools, and by (ii) characterizing elementary SDs and detecting core duplicons to help trace the evolutionary history of duplications to as far as 300 million years. BISER is implemented in Seq programming language and is publicly available at https://github.com/0xTCG/biser.
Nema pronađenih rezultata, molimo da izmjenite uslove pretrage i pokušate ponovo!
Ova stranica koristi kolačiće da bi vam pružila najbolje iskustvo
Saznaj više