The method and system for identifying a biological sample generates a data set indicative of the composition of the biological sample. In a particular example, the data set is DNA spectrometry data received from a mass spectrometer. The data set is denoised, and a baseline is deleted. Since possible compositions of the biological sample may be known, expected peak areas may be determined. Using the expected peak areas, a residual baseline is generated to further correct the data set. Probable peaks are then identifiable in the corrected data set, which are used to identify the composition of the biological sample. In a disclosed example, statistical methods are employed to determine the probability that a probable peak is an actual peak, not an actual peak, or that the data are too inconclusive to call.
Legal claims defining the scope of protection, as filed with the USPTO.
1. An automated method for identifying a component in a DNA sample, comprising: using a mass spectrometer to generate a computer readable data set comprising data representing components in the biological sample for analysis by a computer, and using the computer to: denoise the data set to generate denoised data; correct a baseline from the denoised data to generate an intermediate data set, the intermediate data set having a plurality of data values associated with respective points in an array of data; compress the intermediate data set to obtain compressed data; define putative peaks in the compressed data, wherein the putative peaks represent components in the DNA sample; generate a residual baseline by removing the putative peaks from the compressed data; remove the residual baseline from the compressed data to generate a corrected data set; locate a putative peak in the corrected data set; and identify the component that corresponds to the located putative peak; wherein the compressed data comprises compressed data points and wherein a compressed data point is a real number that includes a whole number portion that is determined by calculating the difference between the whole number portions of two consecutive points in the array of data.
2. An automated method for identifying a component in a DNA sample, comprising: using a mass spectrometer to generate a computer readable data set comprising data representing components in the biological sample for analysis by a computer, and using the computer to: denoise the data set to generate denoised data; correct a baseline from the denoised data to generate an intermediate data set, the intermediate data set having a plurality of data values associated with respective points in an array of data; compress the intermediate data set to obtain compressed data; define putative peaks in the compressed data, wherein the putative peaks represent components in the DNA sample; generate a residual baseline by removing the putative peaks from the compressed data; remove the residual baseline from the compressed data to generate a corrected data set; locate a putative peak in the corrected data set; and identify the component that corresponds to the located putative peak; wherein the compressed data comprises compressed data points and wherein a compressed data point is a real number that includes a decimal portion representing the difference between a maximum value of all the data values and a value at a particular point in the array.
3. An automated method for identifying a component in a DNA sample, comprising: using a mass spectrometer to generate a computer readable data set comprising data representing components in the biological sample for analysis by a computer, and using the computer to: denoise the data set to generate denoised data; correct a baseline from the denoised data to generate an intermediate data set; define putative peaks in the intermediate data set, wherein the putative peaks represent components in the DNA sample; generate a residual baseline by removing the putative peaks from the intermediate data set, comprising the steps of a) identifying the center line of each putative peak; b) removing an area to the right of the center line of each putative peak; and c) removing an area equal to twice the width of the Gaussian curve fit to each putative peak from the left of the center line of each putative peak; remove the residual baseline from the intermediate data set to generate a corrected data set; locate a putative peak in the corrected data set; and identify the component that corresponds to the located putative peak.
4. An automated method for identifying a component in a DNA sample, comprising: using a mass spectrometer to generate a computer readable data set comprising data representing components in the biological sample for analysis by a computer, and using the computer to: denoise the data set to generate denoised data; correct a baseline from the denoised data to generate an intermediate data set; define putative peaks in the intermediate data set, wherein the putative peaks represent components in the DNA sample; generate a residual baseline by removing the putative peaks from the intermediate data set, comprising the steps of a) identifying the center line of each putative peak; b) removing an area equal to the area corresponding to 50 Daltons along the x-axis to the right of the center line of each putative peak; and c) removing an area to the left of the center line of each putative peak; remove the residual baseline from the intermediate data set to generate a corrected data set; locate a putative peak in the corrected data set; and identify the component that corresponds to the located putative peak.
5. The method of claim 1 , 2 , 3 , or 4 further comprising: determining a peak probability for the putative peak; and multiplying the peak probability by an allelic penalty to obtain a final peak probability.
6. The method of claim 1 , 2 , 3 , or 4 further comprising: calculating a peak probability that a putative peak in the corrected data is a peak indicating composition of the DNA sample; calculating a peak probability for each of a plurality of putative peaks in the corrected data; and comparing the highest peak probability is to a second-highest peak probability to generate a calling ratio.
7. The method according to claim 6 wherein the calling ratio is used to determine if the composition of the DNA sample will be called.
8. The method according to claim 5 , wherein the peak probability is determined from a probability profile.
9. The method according to claim 5 , comprising determining an allelic ratio, wherein the allelic ratio is a comparison of two peak heights in the corrected data, and assigning the allelic penalty to the allelic ratio.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
September 19, 2000
March 29, 2011
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.