Systems and methods are disclosed for creating a governance graph representing data set interconnection. The data set interconnections may be based on common fields, sources, databases, applications, or patterns of usage. For example, the interconnections may be direct connections, where one data set is directly downstream from another data set. Alternatively, the interconnections may be indirect connections based patterns showing the data sets are commonly used together. For example, given data sets “A”, “B”, and “C”, if “B” is directly connected to “A” because it is downstream from “A”, and a particular group of users commonly use “B” and “C” together, “A” may be indirectly related to “C” based on the pattern of usage. In this example, the governance graph is configured to indicate the connection between “A” and “B” is stronger than the connection between “A” and “C”, whilst still showing the connection.
Legal claims defining the scope of protection, as filed with the USPTO.
. A computing system, comprising:
. The system of, wherein the one or more data asset characteristics further include at least one characteristic selected from the group consisting of user interactions, downstream reports, and presence in workflows.
. The system of, wherein the representation further includes edges and labels.
. The system of, wherein the representation further indicates a size of a network node of the nodes, and wherein the lines are between the nodes, a thickness of the lines between the nodes is indicative of a degree of the interconnections between the multiple data assets, a thicker line being representative of a greater degree of connection and a thinner line being representative of a lesser degree of connection, and respective lengths of the lines between the nodes varies for the multiple data assets.
. The system of, wherein the representation comprises a first line between a first node and a second node and a second line between the second node and a third node, wherein the first line has a thickness greater than the second line.
. The system of, wherein the first line thickness that is greater than the second line thickness indicates that the first node is more closely associated with the second node than the second node is with the third node.
. The system of, wherein the one or more data asset characteristics further include either a direct connection or an inferred connection, wherein the determining of the one or more data asset characteristics further includes determining the direct connection or the inferred connection and the comparing of the input or output connections further includes comparing the direct connection or the inferred connection to identify whether there is a common characteristic associated with the direct connection or the inferred connection.
. The system of, wherein the common characteristic comprises a first data asset of the one or more data assets having the common input or output connections being directly associated with a second data asset of the one or more data assets having the common input or output connections, wherein the first data asset and the second data asset are directly related based on common metadata fields.
. The system of, wherein the common characteristic comprises the second data asset of the one or more data assets having the common input or output connections being indirectly associated with a third data asset of the one or more data assets having the common input or output connections, wherein the second data asset and the third data asset are indirectly associated based on the second data asset and the third data asset being commonly used together.
. The system of, wherein changes made to the third data asset affect the first data asset and the second data asset.
. A computer-implemented method, comprising:
. The method of, wherein the representation further indicates a size of a network node of the nodes, and wherein the lines are between the nodes, a thickness of the lines between the nodes is indicative of a degree of the interconnections between the multiple data assets, a thicker line being representative of a greater degree of connection and a thinner line being representative of a lesser degree of connection, and respective lengths of the lines between the nodes varies for the multiple data assets.
. The method of, wherein the representation comprises a first line between a first node and a second node and a second line between the second node and a third node, wherein the first line has a thickness greater than the second line.
. The method of, wherein the first line thickness that is greater than the second line thickness indicates that the first node is more closely associated with the second node than the second node is with the third node.
. The method of, wherein the one or more data asset characteristics further include at least one characteristic selected from the group consisting of user interactions, downstream reports, and presence in workflows.
. The method of, wherein the representation further includes edges and labels.
. A computing system, comprising:
. The system of, wherein first data set is directly related to the second data set based on the at least one common input or output connection to the machine learning models.
. The system of, wherein the executable code, when executed, further causes the processor to:
. The system of, wherein governance graph further includes a depiction of the second connection between the second data set and the third data set.
Complete technical specification and implementation details from the patent document.
This application is a continuation of and claims priority to co-pending U.S. patent application Ser. No. 18/517,106, filed Nov. 22, 2023, entitled, SYSTEMS AND METHODS FOR A DATA ECOSYSTEM, the entire contents of which are hereby expressly incorporated herein by reference.
This invention relates generally to the field of data governance, and more particularly embodiments of the invention relate to systems and methods for creating data governance graphs and implementing them to manage data sets, whether they are user.
The Data Governance Institute defines data governance as “a system of decision rights and accountabilities for information-related processes, executed according to agree-upon models which describe who can take what actions with what information, and when, under what circumstances, using what methods” at https://datagovernance.com/defining-data-governance/. In any organization, as new data sources emerge from various customer touch points, being able to leverage them to create a master customer profile in a unified repository is key towards providing better products and services, and at the same time increasing loyalty, and reducing churn. Organizations would like to leverage the wealth of data created within their enterprise and generated across their network, for operational and commercial use cases. Using this data as part of the digital transformation program enables better customer satisfaction and promotes sales of existing and emerging products through enhanced merchandising of goods and services. This type of initiative requires creating master records using a Master Data Management (MDM) approach. It is the goal of any MDM solution to enable organizations and their partners to both identity and know their customers and products better in order to provide: Better customer service; Make better bespoke decisions for customers; Identify further opportunities for ancillary sales; and Identify customer preferred interactions and touch points.
Machine learning techniques help integrate customer data silos even in the absence of unique Identifiers from various operational systems. Such systems can use probabilistic matching for record linkage, data clustering and classification techniques along with reinforcement learning for automation on scale out platforms to add significant value to how data can be leveraged as an asset. Delivering MDM functionality can be done on a big data scale by various unified data governance platforms. These platforms provide a Spark-based scale out implementation for matching, linking and mastering, with support for pluggable machine learning libraries that will enable end users to master customer, product and additional data domains using a set of consistent processes and methodologies. The model is flexible based on an organization's business requirements and does not require a specific type of data model for the data entities to be mastered. Spark-based machine learning has several advantages over traditional data matching. It matches all types of data domains, it has “live” training that provide unlimited flexibility, and it scales to volumes that weren't previously attainable. The end result is an agile master data management capability.
A key component of MDM is the classification of datasets to enable users to locate stored data relevant to a work task. Various known algorithms have been used in the classification process. Some of the commonly used types of classification algorithms are described below with advantages and disadvantages.
Logistic Regression is a machine learning algorithm for classification. In this algorithm, the probabilities describing the possible outcomes of a single trial are modelled using a logistic function. An advantage of logistic regression is that it designed for classification purposes and is most useful for understanding the influence of several independent variables on a single outcome variable. Disadvantages are that it works only when the predicted variable is binary, assumes all predictors are independent of each other and assumes data is free of missing values.
The Naive Bayes algorithm is based on Bayes' theorem with the assumption of independence between every pair of features. Naive Bayes classifiers work well in many real-world situations such as document classification and spam filtering. Advantages of this algorithm are that it requires a small amount of training data to estimate the necessary parameters. Naive Bayes classifiers are extremely fast compared to more sophisticated methods. A disadvantage is that Naive Bayes is known to be a bad estimator.
Stochastic Gradient Descent is a simple and very efficient approach to fit linear models. It is particularly useful when the number of samples is very large. It supports different loss functions and penalties for classification. Advantages are efficiency and case of implementation. Disadvantages are that it requires a number of hyper-parameters and it is sensitive to feature scaling.
K-Nearest Neighbors classification is a type of lazy learning as it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the K nearest neighbors of each point. Advantages are that this algorithm is simple to implement, robust to noisy training data, and effective if training data is large. Disadvantages are the need to determine the value of K and the computation cost is high as it needs to compute the distance of each instance to all the training samples.
Decision Tree—Given a data of attributes together with its classes, a decision tree produces a sequence of rules that can be used to classify the data. Advantages are that it is simple to understand and visualize, requires little data preparation, and can handle both numerical and categorical data. Disadvantages are that it can create complex trees that do not generalize well, and decision trees can be unstable because small variations in the data might result in a completely different tree being generated.
The Random Forest classifier is a meta-estimator that fits a number of decision trees on various sub-samples of datasets and uses average to improve the predictive accuracy of the model and controls over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement. Advantages are a reduction in over-fitting and is more accurate than decision trees in most cases. Disadvantages are slow real time prediction, difficult to implement, and a complex algorithm.
The Support Vector Machine is a representation of the training data as points in space separated into categories by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall. Advantages are effective in high dimensional spaces and uses a subset of training points in the decision function so it is also memory efficient. Disadvantages are that the algorithm does not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.
The invention described herein provides a dataset classification method that improves the classification accuracy as compared to known dataset classification methods.
Owners and managers of data assets in an organization are required to properly control how data is handled and ensure that certain data governance policies, practices, and processes are in place to manage their data effectively and ensure its quality, security, and compliance. Key components of proper data governance include data ownership, data quality, data security, data privacy, data cataloging, data lifecycle management, data access and permissions, data governance compliance and auditing, data documentation, data stewardship, data governance framework, and data training and awareness. Thus, proper data governance allows organizations to maximize the value of their data assets, minimize risks such as data misuse, and enhance decision-making processes by using the data to better understand the needs of the organization or its customers. Although data or digital assets have inherent traits such as metadata and data fields, without a proper visual or structural representation of the interconnectedness of the data, a data steward may spend unnecessary time and resources trying to understand the data. Thus, a need exists for improved systems and methods for managing and governing digital assets that are streamlined and automated to addresses these shortcomings.
Shortcomings of the prior art are overcome and additional advantages are provided through the provision of systems and methods for implementing representations of digital assets and their connectedness through the use of a data governance graph based on various similarities in the data to improve efficiency, security, and overall data management.
A computer-implemented method for creating a representation of interconnections between data sets includes: (1) receiving a plurality of data sets from a plurality of sources using a computer, the data sets including a plurality of traits; (2) storing the plurality of data sets into a data catalog; (3) determining at least one common trait for a first data set and a second data set from the plurality of data sets; (4) generating a representation of a first interconnection between the first data set and the second data set based on the at least one common trait, wherein the representation of the first interconnection comprises a first value; (5) determining at least one common trait for the second data set and a third data set from the plurality of data sets; (6) generating a representation of a second interconnection between the second data set and the third data set based on the at least one common trait, wherein the representation of the second interconnection comprises a second value; and (7) displaying, via a graphical user interface, a governance graph comprising the first interconnection and the second interconnection.
For the computer-implemented method, at least one of the first interconnection and the second interconnection include at least one of a interconnection between data sets, data policies, data procedures, and data usage patterns.
The method further includes the at least one of the first interconnection being based on data usage patterns.
The method further includes generating and displaying via graphical user interface, a recommendation of one or more additional data sets having at least one common trait with at least one of the first data set, the second data set, and the third data set.
For the interconnection value, the first value represents a stronger connection than the second connection. Likewise, the first value may be indicated by a short line on the displayed governance graph and the second value may be indicated by a long line on the displayed governance graph.
The method further includes determining at least one common trait for the first data set and the third data set from the plurality of data sets, generating a representation of a third interconnection between the first data set and the third data set based on the at least one common trait, wherein the representation of the third interconnection includes a third value, and displaying via the graphical user interface, the governance graph comprising the first interconnection, the second interconnection, and the third interconnection.
The computing system for creating a representation of associations between data sets, the system including at least one processor, a communication interface communicatively coupled to the at least one processor, and a memory device storing executable code that, when executed, causes the at least one processor to, in part, receive a plurality of data assets from a plurality of data sources, the data assets including a plurality of data sets having one or more characteristics. The system stores the plurality of data assets into a data catalog and determines at least one common characteristic for a first data asset and a second data asset from the plurality of data assets. The system then generates a representation of a first interconnection between the first data asset and the second data asset based on the at least one common characteristic and displays, via a graphical user interface, a governance graph comprising the first interconnection.
In various embodiments, the processor is further caused to determine at least one common characteristic for the second data asset and the third data asset based on the at least one common characteristic; and display, via a graphical user interface, the governance graph including the first interconnection and the second interconnection.
In particular embodiments, the system determines at least one common characteristic for the first data asset and the third data asset from the plurality of data sets, generates a representation of a third interconnection between the first data asset and the third data asset based on the at least one common characteristic, and displays, via the GUI, the governance graph including the first interconnection, the second interconnection, and the third interconnection.
In some embodiments, the system determines a connection value for the first interconnection, the second interconnection, and the third interconnection. For example, the connection value for the first interconnection may be indicative of a stronger connection than the connection value for the third interconnection. In come embodiments, at least one of the first interconnection and the second interconnection include at least one of an interconnection between data sets, data policies, data procedures, and data usage patterns.
Additionally, disclosed herein is a system for creating a representation of interconnections between data sets, the system including at least one processor, a communication interface communicatively coupled to the at least one processor, and a memory device storing executable code that, when executed, causes the at least one processor to, in part, receive a plurality of data sets from a plurality of data sources; determine one or more data set characteristics fro each of the plurality of data sets, wherein the one or more data set characteristics comprise a data asset; and compare the one or more data set characteristics to determine one or more data sets having common characteristics. In response to determining one or more data sets having common characteristics, the system generates a representation of one or more interconnections between the data sets.
In various embodiments, the system is further configured to display, via a graphical user interface, a governance graph comprising representation of the one or more interconnections between the data sets.
In some embodiments, the one or more interconnections comprise at least one of an interconnection between data sets, data policies, data procedures, and data usage patterns. When determining one or more data sets having common characteristics, the system determines at least one of a common field, a common usage, a common source, a common database, a common generating application, and a common pattern of usage.
According to example embodiments, a system is disclosed herein for data asset access governance, the system including at least one processor, a communication interface communicatively coupled to the at least one processor, and a memory device storing executable code that, when executed, causes the at least one processor to, receive a plurality of data assets from a plurality of data sources and determine one or more data asset characteristics for each of the plurality of data assets. The system compares the one or more data asset characteristics to determine one or more data assets having common characteristics. In response to determining one or more data assets having common characteristics, the system generates a representation of one or more interconnections between the data assets having common characteristics, wherein the representation comprises a governance graph. After receiving, via a user device, a request to access the governance graph, the system displays, via a graphical user interface associated with the user device, the governance graph. The system receives, via the user device, a user selection of at least one data asset of the one or more data assets and, in response, displays via the graphical user interface, the at least one data asset and at least one additional data asset determined to have common characteristics with the at least one data asset selected by the user.
Various embodiments disclose a computer-implemented method for data governance implementation using a governance graph, where the method includes: (1) receiving a plurality of data assets from a plurality of data sources; (2) determining one or more data asset characteristics for each of the plurality of data assets; (3) comparing the one or more data asset characteristics to determine one or more data assets having common characteristics; (4) in response to determining one or more data assets having common characteristics, generating a representation of one or more interconnections between the data assets having common characteristics, wherein the representation comprises a governance graph; (5) receiving, via a user device, a request to access the governance graph; (6) displaying, via a graphical user interface associated with the user device, the governance graph; (7) receiving, via the user device, a user selection of at least one data asset of the one or more data assets; (8) in response to receiving the user selection of the at least one data asset of the one or more data assets, displaying, via the graphical user interface, the at least one data asset and at least one additional data asset determined to have common characteristics with the at least one data asset selected by the user.
In various embodiments, the representation of the one or more interconnections between the data assets having common characteristics displayed on the governance graph comprises a size of a network node, a line between nodes, a thickness of lines between the data assets, and a length of lines between the data assets.
In particular embodiments, a system is disclosed herein for creating a representation of interconnections between data set, the system including at least one processor, a communication interface communicatively coupled to the at least one processor, and a memory device storing executable code that, when executed, causes the at least one processor to, receive a plurality of sources using a computer, the data sets including a plurality of fields and stores the plurality of data sets into a data catalog. The system determines at least one common field for a first data set and a second data set from the plurality of data sets and generates a representation of a first connection between the first data set and the second data set based on the at least one common field. In response to receiving, via a user device, a request to access the first data set, the system displays, via a graphical user interface, a governance graph depicting the first data set, the second data set, and the first connection between the first data set and the second data set.
According to example embodiments, a system is disclosed herein for creating user-specific representations of associations between data assets, the system including at least one processor, a communication interface communicatively coupled to the at least one processor, and a memory device storing executable code that, when executed, causes the at least one processor to, determine that a first user has accessed, via a first user device, a user profile associated with an entity. The system is further configured to receive a plurality of data assets from a plurality of sources, each of the plurality of data assets having one or more characteristics. The system stores the plurality of data assets into a data catalog and compares the one or more data asset characteristics to determine one or more data assets having common characteristics, wherein the one or more common characteristics indicate a connection between the one or more data assets. In response to determining one or more data assets having common characteristics, the system generates a governance graph depicting the one or more connections between the one or more data assets and displays, via a graphical user interface associated with the first user device, the governance graph depicting the one or more connections. The system then determines that a second user has accessed, via a second us, an executive-level profile associated with the entity. The system generates an executive-level governance graph depicting one or more factors associated with the plurality of data assets and displays, via a graphical user interface associated with the second user device, the executive-level governance graph depicting the one or more factors associated with the plurality of data assets.
In example embodiments, the first user has a first level of security clearance associated with the entity and the second user has a second level of security clearance associated with the entity, wherein the second level of security clearance is higher than the first level of security clearance.
In particular embodiments, the executive-level governance graph depicts one or more usage patterns as nodes, wherein a first node having a first size depicts a first level of usage and a second node having a second size depicts a second level of usage, wherein the first node is larger than the second node, and wherein the larger node depicts a higher or greater level of usage of a data set.
In various embodiments, the one or more factors associated with the plurality of data assets comprises one or more relationships between each of the plurality of data assets, and wherein the one or more relationships are depicted as lines. In particular embodiments, the length of each of the lines depicting the one or more relationships between the plurality of data assets indicates the degree of connection between each of the plurality of data assets, wherein a longer line indicates a lesser degree of connection and a shorter line indicates a greater degree of connection between the data sets. Similarly, the thickness of each of the lines may also be used to indicate the degree of connection, where a thinner line indicates a lesser degree of connection and a thicker line indicates a greater degree of connection between the assets.
In example embodiments, the executive-level governance graph depicts a lineage of the plurality of data assets, wherein the lineage of the plurality of data assets tracks flows and changes of the plurality of data assets over time.
In some embodiments, the system may be access by a data steward to clean up the data sets by eliminating any duplicate data assets stored in the data catalog.
According to example embodiments, a computer-implemented method for creating user-specific representations of associations between data assets is disclosed including the steps of: (1) determining that a first user has accessed, via a first user device, a user profile associated with an entity; (2) receiving a plurality of data assets from a plurality of sources, the data assets having one or more characteristics; (3) storing the plurality of data assets into a data catalog; (4) determining one or more data asset characteristics for each of the plurality of data assets; (5) comparing the one or more data asset characteristics to determine one or more data assets having common characteristics, wherein the one or more common characteristics indicate a connection between the one or more data assets; (6) in response to determining one or more data assets having common characteristics, generating a governance graph depicting the one or more connections between the between the one or more data assets; (7) displaying, via a graphical user interface associated with the first user device, the governance graph depicting the one or more connections; (8) determining that a second user has accessed, via a second user device, an executive-level profile associated with the entity; (9) generating an executive-level governance graph depicting one or more factors associated the plurality of data assets; and (10) displaying, via a graphical user interface associated with the second user device, the executive-level governance graph depicting the one or more factors associated with the plurality of data assets.
In example embodiments, the executive-level governance graph depiction of the one or more factors associated with the plurality of data asset is at least one of a pattern of usage, a level of risk, and a degree of confidentiality for each of the data assets of the plurality of data assets.
In various embodiments, the one or more usage patterns of the plurality of data assets includes at least one of compliance with a governance policy associated with the entity and non-compliance with the governance entity associated with the entity.
In some embodiments, the one or more data assets have common characteristics that are direct connections and some data assets have indirect common characteristics.
According to various embodiments, a computer-implemented method for creating a user-specific representation of associations between data assets includes the steps of: (1) receiving a plurality of data assets from a plurality of sources, the data assets having one or more characteristics; (2) storing the plurality of data assets into a data catalog; (3) determining that a user has accessed, via a user device, an executive-level profile associated with an entity; (4) generating an executive-level governance graph depicting one or more factors associated the plurality of data assets; and (5) displaying, via a graphical user interface associated with the user device, the executive-level governance graph depicting the one or more factors associated with the plurality of data assets.
The features, functions, and advantages that have been described herein may be achieved independently in various embodiments of the present invention including computer-implemented methods, computer program products, and computing systems or may be combined in yet other embodiments, further details of which can be seen with reference to the following description and drawings.
Embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all, embodiments of the invention are shown. Indeed, the invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Like numbers refer to like elements throughout. Unless described or implied as exclusive alternatives, features throughout the drawings and descriptions should be taken as cumulative, such that features expressly associated with some particular embodiments can be combined with other embodiments. Unless defined otherwise, technical and scientific terms used herein have the same meaning as commonly understood to one of ordinary skill in the art to which the presently disclosed subject matter pertains.
The exemplary embodiments are provided so that this disclosure will be both thorough and complete, and will fully convey the scope of the invention and enable one of ordinary skill in the art to make, use, and practice the invention.
The terms “coupled,” “fixed,” “attached to,” “communicatively coupled to,” “operatively coupled to,” and the like refer to both (i) direct connecting, coupling, fixing, attaching, communicatively coupling; and (ii) indirect connecting coupling, fixing, attaching, communicatively coupling via one or more intermediate components or features, unless otherwise specified herein. “Communicatively coupled to” and “operatively coupled to” can refer to physically and/or electrically related components.
Embodiments of the present invention described herein, with reference to flowchart illustrations and/or block diagrams of methods or apparatuses (the term “apparatus” includes systems and computer program products), will be understood such that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a particular machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create mechanisms for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture including instructions, which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions, which execute on the computer or other programmable apparatus, provide steps for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. Alternatively, computer program implemented steps or acts may be combined with operator or human implemented steps or acts in order to carry out an embodiment of the invention.
Unknown
November 20, 2025
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.