Inferences acquired by applying clustering analysis cannot be reliably assessed before data-originated errors are quantified, an exacting task that is often not performed. This invention presents a clustering method suited for this purpose. Designed for systems with normally distributed error, a common trait to many data systems, and built on a framework of agglomerative hierarchical clustering, this invention treats each observation as a Gaussian distribution function, uses an exact mathematical relation to track error, and gives results from which quantitative statistics are easily extracted.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A computer-implemented method for clustering data, comprising: obtaining a refined dataset, wherein the refined dataset comprises a plurality of refined datums, each of the refined datums comprises a refined mean and a refined variance; calculating a plurality of refined distance values of a plurality of refined datum pairs, wherein each of the refined datum pairs is formed by two of the refined datums, the refined distance values of the refined datum pairs are calculated using the refined means and the refined variances of the refined datums which form the refined datum pairs; selecting one of the refined datum pairs with the least distance value; clustering the refined datums, which form the selected refined datum pair, into a new datum; and replacing the refined datums, which form the selected refined datum pair, with the new datum.
2. The computer-implemented method for clustering data of claim 1 , further comprising: obtaining a plurality of raw datums; and refining the raw datums into the refined datums which have normally distributed errors.
3. The computer-implemented method for clustering data of claim 1 , wherein the refined distance values of the refined datum pairs are calculated as follows: |t i,j |=|μ i −μ j |/√{square root over (σ i 2 +σ j 2 )}, wherein |t i,j | is the refined distance values of the refined datum pair formed by two of the refined datums i and j, μ i and μ j are the refined means of the refined datums i and j respectively, σ i 2 and σ j 2 are the refined variances of the refined datums i and j respectively.
4. The computer-implemented method for clustering data of claim 1 , wherein clustering the refined datums, which form the selected refined datum pair, into the new datum comprises: calculating a new variance of the new datum according to the refined variances of the refined datums which form the selected refined datum pair.
5. The computer-implemented method for clustering data of claim 4 , wherein the new variance of the new datum is calculated as follows: (σ 2 ) −1 =(σ i 2 ) −1 +(σ j 2 ) −1 , wherein σ 2 is the new variance of the new datum, σ i 2 and σ j 2 are, respectively, the refined variances of the refined datums i and j which form the selected refined datum pair.
6. The computer-implemented method for clustering data of claim 5 , wherein clustering the refined datums, which form the selected refined datum pair, into the new datum further comprises: calculating a new mean of the new datum according to the new variance of the new datum and the refined means of the refined datums which form the selected refined datum pair.
7. The computer-implemented method for clustering data of claim 6 , wherein the new mean of the new datum is calculated as follows: μ(σ 2 ) −1 =+μ i (σ i 2 ) −1 +μ j (σ j 2 ) −1 , wherein μ is the new mean of the new datum, μ i and μ j are, respectively, the refined means of the refined datums i and j which form the selected refined datum pair.
8. The computer-implemented method for clustering data of claim 1 , further comprises: obtaining a distance threshold; before clustering the refined datums, which form the selected refined datum pair, into the new datum, determining if the distance value of the selected refined datum pair is less than the distance threshold; clustering the refined datums, which form the selected refined datum pair, into the new datum if the distance value of the selected refined datum pair is less than the distance threshold.
9. The computer-implemented method for clustering data of claim 8 , further comprises: not clustering the refined datums, which form the selected refined datum pair, into the new datum if the distance value of the selected refined datum pair is not less than the distance threshold.
10. The computer-implemented method for clustering data of claim 1 , wherein obtaining the refined dataset comprises: obtaining log 2-ratios of fluorescent intensities measured by probesets of a first microarray to fluorescent intensities measured by corresponding probesets of a second microarray; and taking the obtained log 2-ratios as the refined means of the refined datums of the refined dataset.
11. The computer-implemented method for clustering data of claim 10 , wherein the log 2-ratios of the fluorescent intensities measured by the probesets of the first microarray to the fluorescent intensities measured by the corresponding probesets of the second microarray is calculated as follows: log 2-ratio i =log 2 (I i T /I i N ), wherein log 2-ratio i is the log 2-ratio of probeset i, I i T and I i N are, respectively, the fluorescent intensity measured by probeset i of the first microarray T and the fluorescent intensity measured by probeset i of the second microarray N.
12. The computer-implemented method for clustering data of claim 10 , wherein calculating the refined distance values of the refined datum pairs further comprises: sorting the refined datums by their genomic positions, wherein each of the refined datum pairs is formed by two contiguous refined datums of the same chromosome.
13. The computer-implemented method for clustering data of claim 10 , wherein calculating the refined distance values of the refined datum pairs further comprises: sorting the refined datums by their genomic positions, wherein each of the refined datum pairs is formed by two refined datums of the same exon.
14. The computer-implemented method for clustering data of claim 10 , wherein calculating the refined distance values of the refined datum pairs further comprises: sorting the refined datums by their genomic positions, wherein each of the refined datum pairs is formed by two refined datums of the same promoter region.
15. A computer-readable medium encoded with a computer program to execute a method for clustering data, wherein the method for clustering data comprises: obtaining a refined dataset, wherein the refined dataset comprises a plurality of refined datums, each of the refined datums comprises a refined mean and a refined variance; calculating a plurality of refined distance values of a plurality of refined datum pairs, wherein each of the refined datum pairs is formed by two of the refined datums, the refined distance values of the refined datum pairs are calculated using the refined means and the refined variances of the refined datums which form the refined datum pairs; selecting one of the refined datum pairs with the least distance value; clustering the refined datums, which form the selected refined datum pair, into a new datum; and replacing the refined datums, which form the selected refined datum pair, to with the new datum.
16. The computer-readable medium of claim 15 , wherein the method for clustering data further comprises: obtaining a plurality of raw datums; and refining the raw datums into the refined datums which have normally distributed errors.
17. The computer-readable medium of claim 15 , wherein the refined distance values of the refined datum pairs are calculated as follows: |t i,j |=|μ i −μ j |/√{square root over (σ i 2 +σ j 2 )}, wherein |t i,j | is the refined distance values of the refined datum pair formed by two of the refined datums i and j, μ i and μ j are the refined means of the refined datums i and j respectively, σ i 2 and σ j 2 are the refined variances of the refined datums i and j respectively.
18. The computer-readable medium of claim 15 , wherein clustering the refined datums, which form the selected refined datum pair, into the new datum comprises: calculating a new variance of the new datum according to the refined variances of the refined datums, which form the selected refined datum pair.
19. The computer-readable medium of claim 18 , wherein the new variance of the new datum is calculated as follows: (σ 2 ) −1 =(σ i 2 ) −1 +(σ j 2 ) −1 , wherein σ 2 is the new variance of the new datum, σ i 2 and σ j 2 are, respectively, the refined variances of the refined datums i and j which form the selected refined datum pair.
20. The computer-readable medium of claim 19 , wherein clustering the refined datums, which form the selected refined datum pair, into the new datum further comprises: calculating a new mean of the new datum according to the new variance of the new datum and the refined means of the refined datums, which form the selected refined datum pair.
21. The computer-readable medium of claim 20 , wherein the new mean of the new datum is calculated as follows: μ(σ 2 ) −1 =μ i (σ i 2 ) −1 +μ j (σ j 2 ) −1 , wherein μ is the new mean of the new datum, μ i and μ j are, respectively, the refined means of the refined datums i and j which form the selected refined datum pair.
22. The computer-readable medium of claim 15 , wherein the method for clustering data further comprises: obtaining a distance threshold; before clustering the refined datums, which form the selected refined datum pair, into the new datum, determining if the distance value of the selected refined datum pair is less than the distance threshold; clustering the refined datums, which form the selected refined datum pair, into the new datum if the distance value of the selected refined datum pair is less than the distance threshold.
23. The computer-readable medium of claim 22 , wherein the method for clustering data further comprises: not clustering the refined datums, which form the selected refined datum pair, into the new datum if the distance value of the selected refined datum pair is not less than the distance threshold.
24. The computer-readable medium of claim 15 , wherein obtaining the refined dataset comprises: obtaining log 2-ratios of fluorescent intensities measured by probesets of a first microarray to fluorescent intensities measured by corresponding probesets of a second microarray; and taking the obtained log 2-ratios as the refined means of the refined datums of the refined dataset.
25. The computer-readable medium of claim 24 , wherein the log 2-ratios of the fluorescent intensities measured by the probesets of the first microarray to the fluorescent intensities measured by the corresponding probesets of the second microarray is calculated as follows: log 2-ratio i =log 2 (I i T /I i N ), wherein log 2-ratio is the log 2-ratio of probeset i, I i T and I i N are, respectively, the fluorescent intensity measured by probeset i of the first microarray T and the fluorescent intensity measured by probeset i of the second microarray N.
26. The computer-readable medium of claim 24 , wherein calculating the refined distance values of the refined datum pairs further comprises: sorting the refined datums by their genomic positions, wherein each of the refined datum pairs is formed by two contiguous refined datums of the same chromosome.
27. The computer-readable medium of claim 24 , wherein calculating the refined distance values of the refined datum pairs further comprises: sorting the refined datums by their genomic positions, wherein each of the refined datum pairs is formed by two refined datums of the same exon.
28. The computer-readable medium of claim 24 , wherein calculating the refined distance values of the refined datum pairs further comprises: sorting the refined datums by their genomic positions, wherein each of the refined datum pairs is formed by two refined datums of the same promoter region.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
June 23, 2010
May 15, 2012
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.