The present disclosure discloses a method and system for constructing a knowledge graph of a standard data element of a biomedical dataset, comprising collecting relevant standard texts of data elements of different types of biomedical datasets and data of a relevant standard of the biomedical dataset; analyzing and summarizing the relevant standard texts of the data elements of the different types of biomedical datasets and the data of the relevant standard of the biomedical dataset; constructing a knowledge model of the knowledge graph of the standard data element of the biomedical dataset; extracting entity type data and attribute data from structured data and an unstructured text in the structured data; and obtaining the knowledge graph of the standard data element of the biomedical dataset by performing knowledge fusion on a plurality of types of data based on an a plurality of types of semantic associative relationships between one or more entity types.
Legal claims defining the scope of protection, as filed with the USPTO.
. The method of, further comprising:
. The method of, further comprising storing and performing quality inspection on the knowledge graph, wherein
. (canceled)
. The method of, wherein a process for extracting the entity type data and the attribute data from the unstructured text includes:
. The method of, wherein the performing the knowledge fusion on the plurality of types of data includes:
Complete technical specification and implementation details from the patent document.
The present disclosure claims priority to Chinese Patent Application 202410595015.0, filed on May 14, 2024, the entire contents of which are incorporated herein by reference.
The present disclosure relates to the field of medical data processing technology, and in particular, to methods and systems for constructing knowledge graphs of standard data elements of biomedical datasets.
Currently, sharing biomedical data has the potential to improve the efficiency of medical research and enhance the transparency of studies. The academic community has also set stringent requirements for research reproducibility and data openness. As a result, more and more biomedical researchers are opting to publicly share their raw data. However, the complexity of biomedical data, particularly in terms of semantics, often leads to challenges such as synonyms and ambiguities. Moreover, the lack of standardized regulations and unified guidelines for data fields and value domains contributes to unclear data semantics, making it difficult to compare datasets or conduct joint analyses across different sources. For example, the English name of the field or variable “gender” in a dataset can be represented as either “gender” or “sex”. In terms of value range, it can be directly represented by text as “male” or “female”, or it can be represented numerically with 0 for male and 1 for female. Without standardized names of data elements and value range specifications, it becomes impossible to integrate or jointly analyze fields or variables with the same semantics across different datasets. Researchers also face difficulties in understanding data semantics, which hampers their ability to effectively utilize the data for analysis, which significantly obstructs data sharing. Therefore, data elements and data element standards of datasets are crucial as they can standardize and unify data structure and semantic expression. However, current data standards are often published in unstructured forms such as PDFs. Many dataset standards in clinical specialties involve 200 to 300 data elements, and different data elements may be defined differently or use different value domains. At present, these standards only provide text-based search, reading, and understanding, making it difficult to effectively utilize them when creating data elements. They are not machine-readable and have poor process ability, which is why these standards are difficult to apply and implement.
Therefore, how to improve machine readability and semantic interoperability while enhancing the usability and utilization of metadata, data elements, classifications, and value domain standards in field-specific datasets is an urgent problem that needs to be addressed by professionals in this field.
In view of this, the present disclosure provides a method and system for constructing a knowledge graph of a standard data element of a biomedical dataset, in an aim to collect dataset standards, classifications of data standards, and value domain standards in the field of biomedical science data, performing fragmented and standardized processing, and merging semantic meanings of data elements through part-of-speech and semantic calculations to establish effective associations. Subsequently, a knowledge model of the standard data element of the biomedical dataset is designed and the knowledge graph is constructed to support the standardization of data fields/variables and their value domains. The present disclosure takes the standard data element of the biomedical dataset as an example, and the method and system disclosed can be generalized to the design and implementation of knowledge graphs of data elements of datasets in other fields. On one hand, the method and system disclosed can enhance the field-specific sets of data elements, classification of data elements, and usability and utilization of value domain standards. On the other hand, it is conducive to achieving the unification of data elements and the standardization of establishment of the sets of data elements, refinement, and enrichment of the association between different dataset standards, sets of data elements, data elements, concepts of data elements, and value domains of data elements, thereby improving the machine readability and semantic interoperability.
In order to realize the above purposes, the present disclosure adopts a following technical solution:
One of the embodiments of the present disclosure provides a method for constructing a knowledge graph of a standard data element of a biomedical dataset, comprising:
In some embodiments, the structured data and the unstructured text in the structured data are obtained by performing optical character recognition (OCR) on the relevant standard texts of the data elements of the different types of the biomedical datasets and parsing the relevant standard texts using a natural language processing (NLP) manner.
In some embodiments, the method further comprises storing and performing quality inspection on the knowledge graph. The storing includes establishing a plurality of entity attribute tables and a plurality of entity triple relationship tables, performing batch conversion, importing and converting triple data to UTF-8, and storing the knowledge graph using a Neo4j graph database. The quality inspection includes after importing the triple data into the neo4j graph database, performing data sampling to verify correctness of the triple data to ensure correctness of an entity type and a relevance relationship.
In some embodiments, a process for extracting the entity type data and the attribute data from the structured data is as follows:
In some embodiments, a process for extracting the entity type data and the attribute data from the unstructured text in the structured data is as follows:
In some embodiments, the types of semantic associative relationships between the one or more entity types include: a relationship between data standards, a relationship between a set of data elements and the data elements, a relationship between the data elements and concepts of the data elements, a relationship between the date elements, a relationship between the data elements and value domains of the data elements, a relationship between a dataset standard and a medical scale/questionnaire, and a relationship between the data elements and the medical scale/questionnaire. The relationship between data standards is pluralistic; the data standards and the dataset of data elements are in an inclusion relationship, and the dataset of data elements and the data elements are in an inclusion relationship, the dataset of data elements includes a plurality of data elements. The relationship between the data elements includes a synonymous relationship, a relevant relationship, and an irrelevant relationship. The value domains of the data elements are classified into four types including an enumeration with external reference type, an enumeration with internal reference type, an enumeration defined within a standard type, and a non-enumerated type based on a source of the value domains and a usage manner. The medical scale is used in the dataset standard, and a scale name and information are extracted from a text, and a connection between a specific medical scale and the data element is established by complementing resources of the medical scale; each of the data elements is a storage name in a standardized dataset of the medical scale, and an association between the data elements and the specific medical scale is established.
In some embodiments, a process for determining the relationship between the data elements includes:
In some embodiments, a process for determining the relationship between the data elements and the value domains of the data elements and determining the types of the value domains includes:
In some embodiments, performing the knowledge fusion on the plurality of types of data specifically includes:
One of the embodiments of the present disclosure provides a system for constructing a knowledge graph of a standard data element of a biomedical dataset, comprising following modules:
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present disclosure, and it is clear that the embodiments described are only a portion of the embodiments of the present disclosure, and not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by a person of ordinary skill in the art without making creative labor fall within the scope of protection of the present disclosure.
is a flowchart illustrating a method for constructing a knowledge graph of a standard data element of a biomedical dataset according to some embodiments of the present disclosure. The method may be executed by a processor. The processor may be located on any terminal device. The terminal device may include a computer, a tablet, a cell phone, or the like.
As shown in, embodiments of the present disclosure disclose the method for constructing a knowledge graph of a standard data element of a biomedical dataset (hereinafter referred to as the method), including stepto step.
Step, collecting relevant standard texts of data elements of different types of biomedical datasets and data of a relevant standard of the biomedical dataset.
The biomedical dataset may be a set including various data within a biomedical field. For example, the biomedical dataset may include various forms of data collected in processes of biomedical research, clinical practice, health monitoring, or the like.
Different types of biomedical datasets may be understood as biomedical datasets from different data sources.
In some embodiments, the biomedical dataset may include a plurality of data elements.
The data element may be a storage name in a standardized database of a medical scale, and is configured to establish an association between the data element and a particular medical scale. The medical scale may be used to assist in evaluating the severity of a patient's disease. The data element may standardize and normalize various data in the biomedical dataset, implementing data exchange between different systems more convenient.
The standard text refers to a text that has been normalized or standardized. For example, the standard text may include documents in formats such as.pdf, .doc, and so on.
The data of the relevant standard of the biomedical dataset may be data related to the relevant standard of the biomedical dataset. The relevant standard of the biomedical dataset may include, but not limited to, a dataset standard, a classification standard, a coding standard, and a value domain code standard, as well as relevant external resources involved in a relevant standard of an extended biomedical dataset.
The relevant external resources may include scientific literature, medical glossaries (ICD, UMLS, etc.), etc. A level of the relevant standard of the biomedical dataset may include a national standard, an industry standard, a local standard, and a group standard. The dataset standard may include standards of various types of general datasets, such as a directory of health information data elements, a value domain code of health information data elements, a basic dataset for disease control, a basic information dataset, a basic dataset of medical service, and a basic dataset of electronic medical record (EMR), and may also include standards of specialized disease datasets, such as those for orthopedics, traditional Chinese medicine, hypertension, or the like.
Step, analyzing and summarizing the relevant standard texts of the data elements of the different types of biomedical datasets and the data of the relevant standard of the biomedical dataset to construct a knowledge model of the knowledge graph of the standard data element of the biomedical dataset, parse the data, and extract fine-grained content.
The standard data element refers to a data element that has undergone standardization. The knowledge graph refers to a data structure that organizes standard data elements in a form of nodes and relationships. The knowledge graph can help users better understand and work with complex standard data elements.
In some embodiments, the processor may analyze and summarize raw data involved in the relevant standard texts regarding provisions in the data of the relevant standards of the biomedical dataset, so as to obtain analyzed and summarized data. For example, if the data of the relevant standard of the biomedical dataset includes Compilation Standard of Basic Health Information Dataset, the processor may, according to the Compilation Standard of Basic Health Information Dataset, analyze and summarize raw data from relevant standard documents of the data elements to obtain analyzed and summarized data.
Parsing the data refers to converting the analyzed and summarized data into data in a form that is easy to understand and process. For example, parsing the data involves cleaning, transforming, or formatting the analyzed and summarized data.
Extracting fine-grained content refers to a process for decomposing the parsed data into a plurality of portions and extracting feature information. The feature information may include, for example, differences in images displayed on different image data. The feature information may be used for fine-grained analysis to reveal a pattern in data.
is a schematic diagram illustrating an exemplary structure of a knowledge model of a knowledge graph of a standard data element of a biomedical dataset according to some embodiments of the present disclosure.
In some embodiments, the knowledge model may be a knowledge model as shown in.
Step, constructing the knowledge model of the knowledge graph of the standard data element of the biomedical dataset, defining one or more entity types, establishing an attribute of each of the one or more entity types and types of semantic associative relationships between the one or more entity types.
The entity type in the knowledge model of the knowledge graph of the standard data element of the biomedical dataset may include, but not limited to, 21 types including a standard, terminology, abbreviation, specified content, applicable scope, preface, introduction, sets of data elements, data element, concept of data elements, value domain code, disease, domain, department, publication, responsible institution, proposing institution, drafting institution, etc. At the same time, the attribute of each of the one or more entity types and the types of semantic associative relationships between the one or more entity types may be established. More content about the set of data elements can be referred to the related descriptions later.
The attribute of the entity type refers to a characteristic that the entity type has. For example, the attribute of the entity type includes that a data standard and a set of data elements are in an inclusion relationship, the set of data elements and the data elements are in an inclusion relationship, the set of data elements includes a plurality of data elements, or the like.
In some embodiments, a defined entity type and a defined attribute of each entity type may be predefined for those skilled in the art based on experience.
In some embodiments, the types of semantic associative relationships between the one or more entity types include: a relationship between data standards, a relationship between a set of data elements and the data elements, a relationship between the data elements and concepts of the data elements, a relationship between the date elements, a relationship between the data elements and value domains of the data elements, a relationship between a dataset standard and a medical scale/questionnaire, and a relationship between the data elements and the medical scale/questionnaire.
In some embodiments, more content about a process for establishing the types of semantic associative relationships between the one or more entity types can be found in the description below entitled “a process for establishing types of semantic associative relationships between one or more entity types”.
In some embodiments, a processor may construct the knowledge model of the knowledge graph of the standard data element of the biomedical dataset through stepto step. Step, extracting entity type data and attribute data from structured data and an unstructured text in the structured data.
The structured data refers to a database including a set of data of a specific data type, for example, a medical Hospital Information System (HIS) database. The unstructured text refers to textual content that does not have a fixed format or regularity. The unstructured text includes, for example, mail, news, blogs, emails, and so on, which can exist in text formats such as .pdf and .doc.
The entity type data refers to data related to the entity type, e.g., a preface, an introduction, etc., in the unstructured text. The attribute data may be data related to the attribute of the entity type. For example, since the patient's age has only one value and belongs to a scalar attribute, if the patient's age is recorded in the unstructured text, the patient's age is determined as the attribute data.
In some embodiments, the processor may obtain the structured data and the unstructured text in the structured data by performing optical character recognition (OCR) on relevant standard texts of data elements of different types of biomedical datasets and parsing the relevant standard texts using a natural language processing (NLP) manner.
Optical character recognition (OCR) refers to a technology that scans and recognizes a text on documents and converts the recognized text into a digital text format that may be edited and processed by a computer. The documents may include word documents, pdf documents, or the like.
The natural language processing (NLP) manner enables language interaction between humans and computers, as well as implements text processing, language analysis, text mining, and other tasks.
For more content about how to extract the entity type data and the attribute data from the structured data and the unstructured text in the structured data, please refer to related descriptions of “a specific process for extracting entity type data and attribute data from structured data” and related descriptions of “a specific process for extracting entity type data and attribute data from an unstructured text in structured data”.
Step, obtaining the knowledge graph of the standard data element of the biomedical dataset by performing knowledge fusion on a plurality of types of data based on the types of semantic associative relationships between the one or more entity types.
The knowledge fusion refers to a process for integrating information of a same data element in biomedical datasets from different data sources to obtain more comprehensive information about the data element. More content about how to perform the knowledge fusion on the plurality of types of data can be referred to step c1 to step c4 below.
Unknown
November 20, 2025
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.