US-12597491-B2

Method and apparatus for compressing fastq data through character frequency-based sequence reordering

PublishedApril 7, 2026

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

A method and apparatus for decompressing FASTQ data through character frequency-based sequence reordering implemented by a computer apparatus, the method including separating genome sequencing data into components of an identifier, a nucleotide sequence read, and prediction quality information; measuring character frequency for the entire data of each of the nucleotide sequence read and the prediction quality information; producing a score by applying the measured character frequency for the nucleotide sequence read and the prediction quality information; reordering the nucleotide sequence read and the prediction quality information based on a condition that is preset based on the score; and compressing at least one of information of the identifier, an identifier of the nucleotide sequence read, and an identifier of the prediction quality information through a compression program by including the reordered nucleotide sequence read and the reordered prediction quality information and generating compressed genome sequencing data.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

. A method of compressing genome sequencing data in FASTQ format through character frequency-based sequence reordering implemented by a computer apparatus, the method comprising:

. The method of, wherein the separating of the genome sequencing data comprises separating again the identifier into a unique number of the identifier and additional information of the identifier, and

. The method of, wherein the producing of the score comprises:

. The method of, wherein the producing of the score comprises producing a score based on at least one of priority information and exclusion target information obtained through the measured character frequency for the nucleotide sequence read and the prediction quality information, and producing a score using all a letter distribution value that is character frequency information and a distribution value that is obtained by rounding the letter distribution value.

. The method of, wherein the reordering of the nucleotide sequence read and the prediction quality information comprises combining the all nucleotide sequence reads with the respective corresponding identifiers and performing lexicographic order based on the produced score.

. The method of, wherein the reordering of the nucleotide sequence read and the prediction quality information comprises combining the prediction quality information with the identifier and performing lexicographic order based on the produced score.

. The method of, wherein the generating of the compressed genome sequencing data comprises storing the reordered nucleotide sequence read in combination with an identifier of the nucleotide sequence read through the compression program to remember order and storing the reordered prediction quality information in combination with an identifier of the prediction quality information through the compression program.

. The method of, further comprising:

. An apparatus for compressing genome sequencing data in FASTQ format through character frequency-based sequence reordering implemented by a computer apparatus, the apparatus comprising:

. The apparatus of, wherein the genome sequencing data separator is configured to separate again the identifier into a unique number of the identifier and additional information of the identifier, and

. The apparatus of, wherein the character frequency measurer is configured to measure a letter distribution for the entire data of each of the nucleotide sequence read and the prediction quality information and excluding a corresponding letter if the measured letter distribution is below a threshold.

. The apparatus of, wherein the score producer is configured to measure character frequency for a single nucleotide sequence read and produce a score, and to repeat scoring for all nucleotide sequence reads including repetition of the single nucleotide sequence read.

. The apparatus of, wherein the score producer is configured to measure character frequency for single prediction quality information and produce a score, and to repeat scoring for all prediction quality information including repetition of the single prediction quality information.

. The apparatus of, wherein the score producer is configured to produce a score based on at least one of priority information and exclusion target information obtained through the measured character frequency for the nucleotide sequence read and the prediction quality information, and to produce a score using all a letter distribution value that is character frequency information and a distribution value that is obtained by rounding the letter distribution value.

. The apparatus of, wherein the score-based sorter is configured to combine the all nucleotide sequence reads with the respective corresponding identifiers and to perform lexicographic order based on the produced score.

. The apparatus of, wherein the score-based sorter is configured to combine the prediction quality information and the identifier and to perform lexicographic order based on the produced score.

. The apparatus of, wherein the genome sequencing data compressor is configured to store the reordered nucleotide sequence read in combination with an identifier of the nucleotide sequence read through the compression program to remember order and to store the reordered prediction quality information in combination with an identifier of the prediction quality information through the compression program.

. The apparatus of, further comprising:

Detailed Description

Complete technical specification and implementation details from the patent document.

This application claims the priority benefit of Korean Patent Application No. 10-2020-0179632, filed on Dec. 21, 2020, in the Korean Intellectual Property Office, the disclosure of which is incorporated herein by reference.

Example embodiments of the following description relate to a dedicated compression technology for efficiently storing and transmitting FASTQ data that is a representative format of genomic data, and more particularly, to a method and apparatus for compressing FASTQ data through a character frequency-based sequence reordering.

Genome sequencing data is rapidly increasing due to a reduction in production cost and development of a cell analysis scheme by Next Generation Sequencing (NGS) technology. Due to the NGS technology that appeared in 2008, production capacity of genome sequencing data is improving. With the advent of 3G and 4G sequencing technologies and single cell analysis that appeared since then, the production capacity of genome sequencing data is doubling every 7 months beyond Moore's law. Due to this trend, it is estimated that genome sequencing data will enter the realm of bigdata along with texts and images by 2025. As a result, cost of storing and transmitting the genome sequencing data becomes an issue.

Although some general-purpose compression techniques are applied to outperform the above issue, the general-purpose compression techniques have a degraded compression ratio since genome sequencing data is stored in a special format. To solve this, compression programs dedicated for genome sequencing are developing a novel method using the following structure of genome sequencing data.

The genome sequencing data uses FASTQ as a text-based format for storing nucleotide sequences and quality scores corresponding thereto, and includes ASCII characters. In general, the FASTQ uses 4 lines per nucleotide sequence. Line 1 starts with ‘@’ and includes a nucleotide sequence identifier and an additional explanation and line 2 generally includes alphabetical characters (A, C, T, G, N) as nucleotide sequence read information, but may be displayed in a form of a number (0, 1, 2, 3) according to a device. Line 3 starts with ‘+’ and may include the same data as that of line 1. Line 4 refers to a quality value for the nucleotide sequence read of line 2 and uses Phred quality score as accuracy of a base. If Q=10, the accuracy of the base indicates 90% and if Q=20, the accuracy of the base indicates 99%. The higher a value of Q, the more accurate the output base. This value is displayed in ASCII code and generally displayed as 40 types of letters. However, recently, devices that express four types of letters are appearing.

Existing genome sequencing compression programs using the above features compress genome sequencing data into consideration of each component, an identifier, a nucleotide sequence read, and distribution and meaning of prediction quality scores. However, with the recent development of new sequencing technology, types of data have increased as a length of data varies and production platforms are diversified. Due to a change in data, the existing genome sequence compression program may not operate depending on a size of data and a type of a production platform.

As described above, although FASTQ data is widely used as a representative standard format of genomic data, capacity of the FASTQ data is very large and storage is not easy accordingly and cost of storage is very high. Although there are existing technologies to reduce compression capacity to outperform the above issue, the existing technologies simply use an overlapping ratio of nucleotide sequences and a compression ratio is not high. Also, prediction quality information of data called a quality value may be compressed to not be decompressed in order to increase a compression ratio. Therefore, a compression method dedicated for genome sequencing data that stably operates regardless of a variety of data is required.

Example embodiments provide a method and apparatus for compressing FASTQ data through a character frequency-based sequence reordering, and more particularly, provide compression technology dedicated for genome sequencing data that stably operates in a variety of genome sequencing data and exhibits excellent performance for recent generated data.

Example embodiments provide a method and apparatus for compressing FASTQ data through a character frequency-based sequence reordering that may perform lossless compression on prediction quality information called a quality value and improve a compression ratio accordingly and may prevent damage of quality of data itself by improving the compression ratio using a new reordering scheme based on character frequency of a nucleotide sequence instead of using an overlapping ratio of nucleotide sequences or a lexicographic order scheme.

According to an aspect of at least one example embodiment, there is provided a method of compressing FASTQ data through character frequency-based sequence reordering implemented by a computer apparatus, the method including separating genome sequencing data into components of an identifier, a nucleotide sequence read, and prediction quality information; measuring character frequency for the entire data of each of the nucleotide sequence read and the prediction quality information; producing a score by applying the measured character frequency for the nucleotide sequence read and the prediction quality information; reordering the nucleotide sequence read and the prediction quality information based on a condition that is preset based on the score; and compressing at least one of information of the identifier, an identifier of the nucleotide sequence read, and an identifier of the prediction quality information through a compression program by including the reordered nucleotide sequence read and the reordered prediction quality information and generating compressed genome sequencing data.

The separating of the genome sequencing data may include separating again the identifier into a unique number of the identifier and additional information of the identifier, and the additional information of the identifier may be used as information of the identifier when performing compression through the compression program to generate the compressed genome sequencing data.

The measuring of the character frequency may include measuring a letter distribution for the entire data of each of the nucleotide sequence read and the prediction quality information and excluding a corresponding letter if the measured letter distribution is below a threshold.

The producing of the score may include measuring character frequency for a single nucleotide sequence read and producing a score; and repeating scoring for all nucleotide sequence reads including repetition of the single nucleotide sequence read.

The producing of the score may include measuring character frequency for single prediction quality information and producing a score; and repeating scoring for all prediction quality information including repetition of the single prediction quality information.

The producing of the score may include producing a score based on at least one of priority information and exclusion target information obtained through the measured character frequency for the nucleotide sequence read and the prediction quality information, and producing a score using all a letter distribution value that is character frequency information and a distribution value that is obtained by rounding the letter distribution value.

The reordering of the nucleotide sequence read and the prediction quality information may include combining the all nucleotide sequence reads with the respective corresponding identifiers and performing lexicographic order based on the produced score.

The reordering of the nucleotide sequence read and the prediction quality information may include combining the prediction quality information with the identifier and performing lexicographic order based on the produced score.

The generating of the compressed genome sequencing data may include storing the reordered nucleotide sequence read in combination with an identifier of the nucleotide sequence read through the compression program to remember order and storing the reordered prediction quality information in combination with an identifier of the prediction quality information through the compression program.

The FASTQ data compression method may further include decompressing the compressed genome sequencing data through the compression program; reordering the decompressed all nucleotide sequence reads and all prediction quality information based on the respective corresponding identifiers; and producing original genome sequencing data by separating and then combining the reordered all nucleotide sequence reads and all prediction quality information from the respective corresponding identifiers.

According to another aspect of at least one example embodiment, there is provided an apparatus for compressing FASTQ data through character frequency-based sequence reordering implemented by a computer apparatus, the apparatus including a genome sequencing data separator configured to separate genome sequencing data into components of an identifier, a nucleotide sequence read, and prediction quality information; a character frequency measurer configured to measure character frequency for the entire data of each of the nucleotide sequence read and the prediction quality information; a score producer configured to produce a score by applying the measured character frequency for the nucleotide sequence read and the prediction quality information; a score-based sorter configured to reorder the nucleotide sequence read and the prediction quality information based on a condition that is preset based on the score; and a genome sequencing data compressor configured to compress at least one of information of the identifier, an identifier of the nucleotide sequence read, and an identifier of the prediction quality information through a compression program by including the reordered nucleotide sequence read and the reordered prediction quality information and to generate compressed genome sequencing data.

The genome sequencing data separator may be configured to separate again the identifier into a unique number of the identifier and additional information of the identifier, and the additional information of the identifier may be used as information of the identifier when performing compression through the compression program to generate the compressed genome sequencing data.

The character frequency measurer may be configured to measure a letter distribution for the entire data of each of the nucleotide sequence read and the prediction quality information and excluding a corresponding letter if the measured letter distribution is below a threshold.

The score producer may be configured to measure character frequency for a single nucleotide sequence read and produce a score, and to repeat scoring for all nucleotide sequence reads including repetition of the single nucleotide sequence read.

The score producer may be configured to measure character frequency for single prediction quality information and produce a score, and to repeat scoring for all prediction quality information including repetition of the single prediction quality information.

The score producer may be configured to produce a score based on at least one of priority information and exclusion target information obtained through the measured character frequency for the nucleotide sequence read and the prediction quality information, and to produce a score using all a letter distribution value that is character frequency information and a distribution value that is obtained by rounding the letter distribution value.

The score-based sorter may be configured to combine the all nucleotide sequence reads with the respective corresponding identifiers and to perform lexicographic order based on the produced score.

The score-based sorter may be configured to combine the prediction quality information and the identifier and to perform lexicographic order based on the produced score.

The genome sequencing data compressor may be configured to store the reordered nucleotide sequence read in combination with an identifier of the nucleotide sequence read through the compression program to remember order and to store the reordered prediction quality information in combination with an identifier of the prediction quality information through the compression program.

The FASTQ data compression apparatus may further include a genome sequencing data decompressor configured to decompress the compressed genome sequencing data through the compression program, to reorder the decompressed all nucleotide sequence reads and all prediction quality information based on the respective corresponding identifiers, and to produce original genome sequencing data by separating and then combine the reordered all nucleotide sequence reads and all prediction quality information from the respective corresponding identifiers.

According to example embodiments, there may be provided a method and apparatus for compressing FASTQ data through a character frequency-based sequence reordering that may perform lossless compression on prediction quality information called a quality value and improve a compression ratio accordingly and may prevent damage of quality of data itself by improving the compression ratio using a new reordering scheme based on character frequency of nucleotide sequence instead of using an overlapping ratio of nucleotide sequences or a dictionary-type reordering scheme.

Further areas of applicability will become apparent from the description provided herein. The description and specific examples in this summary are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.

Hereinafter, some example embodiments will be described in detail with reference to the accompanying drawings. The following detailed structural or functional description of example embodiments is provided as an example only and various alterations and modifications may be made to the example embodiments. Accordingly, the example embodiments are not construed as being limited to the disclosure and should be understood to include all changes, equivalents, and replacements within the technical scope of the disclosure.

The terminology used herein is for describing various example embodiments only, and is not to be used to limit the disclosure. The singular forms “a”, “an”, and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises/comprising” and/or “includes/including” when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof.

Terms, such as first, second, and the like, may be used herein to describe components. Each of these terminologies is not used to define an essence, order or sequence of a corresponding component but used merely to distinguish the corresponding component from other components). For example, a first component may be referred to as a second component, and similarly the second component may also be referred to as the first component, without departing from the scope of the disclosure.

Unless otherwise defined, all terms, including technical and scientific terms, used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. Terms, such as those defined in commonly used dictionaries, are to be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art, and are not to be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Regarding the reference numerals assigned to the elements in the drawings, it should be noted that the same elements will be designated by the same reference numerals, wherever possible, even though they are shown in different drawings. Also, in the description of embodiments, detailed description of well-known related structures or functions will be omitted when it is deemed that such description will cause ambiguous interpretation of the present disclosure.

Hereinafter, example embodiments will be described with reference to the accompanying drawings. However, the example embodiments may be modified in various forms and the scope of the disclosure is not limited by the following example embodiments. In addition, some example embodiments are provided to further completely explain the disclosure for those skilled in the art. Shapes and sizes, etc., of components in the drawings may be exaggerated for further clear explanation.

Due to next generation sequencing (NGS), genome sequencing data has an increased size and includes many overlapping portions. Therefore, an amount of time and cost used to store and transmit the genome sequencing data increases. Although many data-only compression programs are studied to solve the above issues, they do not properly work on modified data due to the advent of new sequencing and analysis technology.

The present disclosure is conceived to solve the aforementioned issues found in the related art and may improve a compression ratio using a new reordering scheme (e.g., based on character frequency of a nucleotide sequence) instead of using an overlapping ratio of a nucleotide sequence or lexicographic order scheme. In the case of adopting this scheme, prediction quality information of data called a quality value may be compressed without loss and the compression ratio may be improved and quality of data itself may not be damaged accordingly.

The example embodiments are provided to properly work on genome sequencing data having a long nucleotide sequence (long-read) that has newly appeared. The example embodiments are based on order, which is similar to that of the existing genome sequencing data compression program, but differ in that reordering is performed using character frequency of a nucleotide sequence instead of using an overlapping portion. Here, various types of data were used as benchmark data and were verified to well work regardless of a size and a length of a nucleotide sequence. In particular, in the case of long nucleotide sequencing data, higher compression performance is observed compared to other compression programs.

Most genome sequencing data-only compression schemes internally use a general compression scheme. Performance of each program depends on a data preprocessing process. Basically, a preprocessing process of genome sequencing data starts with a separation process into components of a FASTQ format. Next, the preprocessing process improves compression performance through additional preprocessing processes, such as tokenization of an identifier (read-identifier tokenization), bit encoding of a nucleotide sequence read (bit nucleotide encoding), a reference genome comparison scheme of the nucleotide sequence read, loss compression of prediction quality information (quality score) about the nucleotide sequence read, and reordering of the entire nucleotide sequence.

Each identifier has a unique value for each nucleotide sequence. Also, additional information, such as a device name, an execution identifier, and tile coordinates used to generate the data may be recorded in an identifier area. The information may be similar or duplicated in all the nucleotide sequences. Such unnecessary duplicate information may cause genome sequencing data capacity to increase. To solve this, identifiers are separated into a more detailed area (tokens), which is referred to as tokenization of an identifier. The identifier area is separated by point (.), underscore (_), space ( ), affix (-), colon (:), equals (=), and deflected (/). Through this method, unnecessarily duplicate tokens are removed and compression performance is improved through delta encoding or run-length encoding. This method is used by programs LFastqC, LFQC, DSRC2, FQC, and Fastqz.

Most of a nucleotide sequence read may include bases A, C, G, and T, or may additionally use base N. Therefore, the nucleotide sequence read is expressed by using not a byte but a bit for each base. This approach is not currently used to improve a compression ratio since a general compression scheme basically includes entropy encoding. However, there is a study to effectively improve a compression ratio using the same. Since this program is produced for a more stable and high compression ratio rather than a compression speed, a bit encoding method is not used.

In the case of genome sequencing data, reference genome is present. The reference genome is data that minimizes rare variation with multiple sequence composites not a single nucleotide sequence, a hypothetical complete nucleotide sequence representing a species. Comparing the separated nucleotide sequence to the completed reference genome, there are many similarities. Using this, the approach compresses data by recording a position of reference genome similar to the separated nucleotide sequence and also separately recording a difference. However, this approach has an issue of storing reference genome with compressed data, restoring the reference genome improperly when the reference genome changes. Although the recent study uses a method of generating and using virtual reference genome using genome sequencing data, it is slow. Therefore, the current study prefers loss compression of prediction quality information or reordering rather than the aforementioned approaches.

The prediction quality information may be displayed using a number of letters greater than that of the nucleotide sequence read and may not be readily compressed due to a difference for each device that produces data. Therefore, the existing compression program regards that compressing the prediction quality information is an important factor to reduce capacity of genome sequencing data. The prediction quality information may be mixed with noise in the process of generating genome sequencing data and may express similar values for adjacent scores. Due to this feature, there is a study that the existing prediction quality information is not perfect and although new prediction quality information combined with a loss compression scheme is used, it does not affect a subsequent study. A binary threshold method expresses a quality value in a specific byte pattern If prediction quality information is higher than 30, and otherwise, expresses a quality value as 2. Also, there is a study that uses a method of replacing prediction quality information having low frequency with prediction quality information having high frequency using frequency of prediction quality information of the entire genome sequencing data. In addition, a program using a loss compression of prediction quality information uses a method, such as, for example, RQS, QVZ, QualComp, BEETL, and PBlock. This method may improve a compression ratio in previous generated data, but may be not required for recent generated short nucleotide sequencing data (short-read) since the loss compression scheme is applied.

Results of HiSeq 2000 having generated a short nucleotide sequence in the past are variously distributed ranging from 2 to 40. However, NovaSeq 6000 generating a recent short nucleotide sequence uses four or less letters. Therefore, a loss compression scheme of SPRING that is one of genome-only compression programs may improve performance of a compression ratio in past HiSeq 2000 data, but show insignificant performance in recent NovaSeq 6000 data.

In the case of genome sequencing data stored in a FASTQ format, order of a nucleotide sequence is randomly determined and thus, a high compression ratio may be obtained through reordering. This approach may be efficient and enhance locality of data if a nucleotide sequence read of genome sequencing data is largely duplicated and thus, may show good performance in a general compression scheme based on LZ-77. There is a binary threshold method that expresses a compression using reordering as a pattern of genome sequencing data and expresses as 2 if low. Also, there is a study that uses a method of replacing prediction quality information having low frequency with prediction quality information having high frequency using frequency of prediction quality information of the entire genome sequencing data. In addition, a program using a loss compression of prediction quality information uses a method, such as, for example, RQS, QVZ, QualComp, BEETL, and PBlock. This method may improve a compression ratio in previous generated data, but may be not required for recent generated short nucleotide sequencing data (short-read) since the loss compression scheme is applied.

However, there are limitations that the existing researches do not properly work on modified data due to development of new technology and do not properly work if a size of data increases due to the appearance of 3G sequencing technology and analysis of cancer genome sequencing data. Representatively, it is verified that LFastqC and LFQC do not properly operate for samples greater than or equal to 17,696 MB.

Patent Metadata

Filing Date

Unknown

Publication Date

April 7, 2026

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search