Data Profiling Method and System

PublishedNovember 10, 2015

Assigneenot available in USPTO data we have

InventorsHongLei Guo Zhi Li Guo Zhong Su

Technical Abstract

Patent Claims

7 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A non-transitory computer program storage device storing computer program product comprising program codes for performing the steps of a data profiling method for profiling new input data entries not previously processed for data profiling and data integration, wherein said method comprises the steps of: an automatic data processing step of reading a new free-text input data entry set, extracting fragments of each input data entry, assigning semantic features to the fragments, and labeling semantic fragments of each of said input data entries according to said fragments and semantic features of said fragments, wherein said fragment comprises a token sequence representing an independent semantic concept and information unit; and an automatic data analyzing step of, based on said labeled semantic fragments, performing a semantic-level data analysis on the input data entry set to obtain analysis results; wherein the data processing step comprises: a semantic feature extracting step of extracting the fragments of each input data entry and the semantic features of said fragments; a scoring step of scoring said fragments according to the semantic features of said fragments; and a fragment labeling step of labeling semantic fragments according to the scores of said fragments, and wherein said scoring step further comprises: clustering the data entries into a multi-aspect data entry community based on clustering of the fragments; and scoring headwords of each data entry community of the multi-aspect data entry community, wherein said scoring comprises, for a given word unit W i in a data entry e, calculating a score (W i ,e) by: Score ⁡ ( W i , e ) = ∑ C j ∈ C ⁡ ( e ) & ⁢ W j ∈ Headword ⁡ ( C j ) ⁢ Weight ⁡ ( W i , C j ) * Weight ⁡ ( C j , e ) CommunityNum ⁡ ( e ) ( 1 ) Weight ⁡ ( C j , e ) = CommunitySize ⁡ ( C j ) ∑ C i ∈ C ⁡ ( e ) ⁢ CommunitySize ⁡ ( C i ) ( 2 ) where, Weight(W i , C j ) in equation (1) denotes a weight of word unit W i in Headword(C j ), Weight(C j ,e) denotes a weight of community C j in C(e), CommunitySize(C i ) denotes a number of data entries in the community C i , and CommunityNum(e) denotes a number of communities in which the data entry e is involved.

2. A data profiling method for profiling new data input entries not previously processed for data profiling and data integration comprising: an automatic data processing step of reading a new free-text input data entry set, extracting fragments of each input data entry, assigning semantic features to the fragments, and labeling semantic fragments of each of said input data entries according to said fragments and semantic features of said fragments, wherein said fragment comprises a token sequence representing an independent semantic concept and information unit; and an automatic data analyzing step of, based on said labeled semantic fragments, performing a semantic-level data analysis on the input data entry set to obtain analysis results; wherein the data processing step comprises: a semantic feature extracting step of extracting the fragments of each input data entry and the semantic features of said fragments; a scoring step of scoring said fragments according to the semantic features of said fragments; and a fragment labeling step of labeling semantic fragments according to the scores of said fragments, and wherein said scoring step further comprises: clustering the data entries into a multi-aspect data entry community based on clustering of the fragments; and scoring headwords of each data entry community of the multi-aspect data entry community, wherein said scoring comprises, for a given word unit W i in a data entry e, calculating a score (W i ,e) by: Score ⁡ ( W i , e ) = ∑ C j ∈ C ⁡ ( e ) & ⁢ W j ∈ Headword ⁡ ( C j ) ⁢ Weight ⁡ ( W i , C j ) * Weight ⁡ ( C j , e ) CommunityNum ⁡ ( e ) ( 1 ) Weight ⁡ ( C j , e ) = CommunitySize ⁡ ( C j ) ∑ C i ∈ C ⁡ ( e ) ⁢ CommunitySize ⁡ ( C i ) ( 2 ) where, Weight(W i , C j ) in equation (1) denotes a weight of word unit W i in Headword(C j ), Weight(C j ,e) denotes a weight of community C j in C(e), CommunitySize(C i ) denotes a number of data entries in the community C i , and CommunityNum(e) denotes a number of communities in which the data entry e is involved.

3. The method according to claim 2 , wherein said semantic feature extracting step further comprises: performing segmentation on the input data entries to obtain a plurality of segmentation units; obtaining a fragment set of said data entries according to said segmentation units; and extracting the semantic features of each fragment in said fragment set to obtain a semantic feature set of said fragment set.

4. The method according to claim 2 , wherein said fragment labeling step further comprises: obtaining unique fragments and general fragments according to the scores of said fragments; merging continuous unique fragments into a larger unique fragment; and labeling semantic types of said fragments according to the semantic features of each fragment.

5. A computer data profiling system for profiling new input data entries not previously processed for data profiling and data integration comprising: a central processing unit (CPU) for implementing processor and analyzer units; a data processor unit for reading a new free-text input data entry set, extracting fragments of each input data entry, assigning semantic features to the fragments, and labeling semantic fragments of each of said input data entries according to said fragments and semantic features of said fragments, wherein said fragment comprises a token sequence representing an independent semantic concept and information unit; and a data analyzer unit connected to said data processor unit and for, based on said labeled semantic fragments from the data processor unit, performing a semantic-level data analysis on the input data entry set to obtain analysis results; wherein said data processor unit operates on components comprises: semantic feature extracting component for extracting the fragments of each input data entry and the semantic features of said fragments; scoring component connected with said semantic feature extracting component and for scoring said fragments according to the semantic features of said fragments from said semantic feature extracting component; and fragment labeling component connected with said scoring component and for labeling semantic fragments according to the scores of said fragments from said scoring component, and wherein said scoring component further comprises: data entry clustering component for clustering the data entries into a multi-aspect data entry community based on clustering of the fragments; and headword scoring component for scoring headwords of each data entry community of the multi-aspect data entry community, wherein said scoring component calculates, for a given word unit W i in a data entry e, a score (W i ,e) by: Score ⁡ ( W i , e ) = ∑ C j ∈ C ⁡ ( e ) & ⁢ W j ∈ Headword ⁡ ( C j ) ⁢ Weight ⁡ ( W i , C j ) * Weight ⁡ ( C j , e ) CommunityNum ⁡ ( e ) ( 1 ) Weight ⁡ ( C j , e ) = CommunitySize ⁡ ( C j ) ∑ C i ∈ C ⁡ ( e ) ⁢ CommunitySize ⁡ ( C i ) ( 2 ) where, Weight(W i , C j ) in equation (1) denotes a weight of word unit W i in Headword(C j ), Weight(C j ,e) denotes a weight of community C j in C(e), CommunitySize(C i ) denotes a number of data entries in the community C i , and CommunityNum(e) denotes a number of communities in which the data entry e is involved.

6. The system according to claim 5 , wherein said semantic feature extracting component further comprises: segmentation component for performing segmentation on the input data entries to obtain a plurality of segmentation units; fragmentation component for obtaining a fragment set of said data entries according to said segmentation units; and semantic feature extraction component for extracting the semantic features of each fragment in said fragment set to obtain a semantic feature set of said fragment set.

7. The system according to claim 5 , wherein said fragment labeling component further comprises: fragment identifying component for obtaining unique fragments and general fragments according to the scores of said fragments; merging component for merging continuous unique fragments into a larger unique fragment; and semantic labeling component for semantically labeling types of said fragments according to the semantic features of each fragment.

Patent Metadata

Filing Date

Unknown

Publication Date

November 10, 2015

Inventors

HongLei Guo

Zhi Li Guo

Zhong Su

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search