System and method for visual Bayesian data fusion are disclosed. In an example, a plurality of datasets associated with a topic are obtained from a data lake. Each of the plurality of datasets include information corresponding to various attributes of the topic. Further, the plurality of datasets are joined to obtain a joined dataset. Furthermore, distribution associated with a target attribute is predicted using Bayesian modeling by selecting a plurality of attributes (k) based on mutual information with the target attribute in the joined dataset, learning a minimum spanning tree based Bayesian structure using the selected attributes and the target attribute, learning conditional probabilistic tables at each node of the minimum spanning tree based Bayesian structure; and predicting the distribution associated with the target attribute by querying the conditional probabilistic tables, thereby facilitating visual Bayesian data fusion.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A processor-implemented method comprising: obtaining, by one or more hardware processors, a plurality of datasets associated with a topic from a data lake, wherein each of the plurality of datasets comprise information corresponding to various attributes of the topic; joining, by the one or more hardware processors, the plurality of datasets to obtain a joined dataset; predicting, by the one or more hardware processors, distribution associated with a target attribute using Bayesian modeling by selecting a plurality of attributes (k) based on mutual information with the target attribute in the joined dataset; learning a minimum spanning tree based Bayesian structure on a feature graph that is created by calculating pairwise mutual information between selected attributes and the target attributes; learning conditional probabilistic tables at each node of the minimum spanning tree based Bayesian structure; predicting the distribution associated with the target attribute by querying the conditional probabilistic tables, thereby facilitating visual Bayesian data fusion; and automatically generating a plurality of tags, based on column headers, to index files in the conditional probabilistic tables.
2. The method of claim 1 , wherein the plurality of datasets are joined based on a type of join and wherein the type of join comprises an inner join, an outer join, a left join, and a right join.
3. The method of claim 1 , wherein learning a minimum spanning tree based Bayesian structure on the feature graph that is created by calculating the pairwise mutual information between the selected attributes and the target attributes comprises: learning the minimum spanning tree on the plurality of attributes and the target attribute using the pairwise mutual information as a threshold; initializing each edge in the minimum spanning tree to random direction and dropping edge with mutual information less than the threshold; flipping each edge direction to compute 2^(k) directed graphs; calculating the cross entropy of each graph; and selecting a graph with least cross entropy as the minimum spanning tree based Bayesian structure.
4. The method of claim 1 , wherein the plurality of attributes and the target attribute comprise discrete and continuous variables.
5. The method of claim 4 , wherein learning the conditional probabilistic tables at each node of the minimum spanning tree based Bayesian structure comprises: discretizing the continuous variables by fixed size binning; and learning the conditional probabilistic tables at each node of the minimum spanning tree based Bayesian structure upon discretizing the continuous variables.
6. The method of claim 1 , further comprising: computing a confidence score for the distribution associated with the target attribute predicted by querying the conditional probabilistic tables using ideal distribution and probabilistic distribution; predicting distribution associated with the target attribute using textual similarity; and selecting one of a) the distribution associated with the target attribute predicted using the textual similarity and b) the distribution associated with the target attribute predicted by querying the conditional probabilistic tables based on the computed confidence score.
7. A system comprising: one or more memories; and one or more hardware processors, the one or more memories coupled to the one or more hardware processors, wherein the one or more hardware processors are configured to execute programmed instructions stored in the one or more memories to: obtain a plurality of datasets associated with a topic from a data lake, wherein each of the plurality of datasets comprise information corresponding to various attributes of the topic; join the plurality of datasets to obtain a joined dataset; predict distribution associated with a target attribute using Bayesian modeling by selecting a plurality of attributes (k) based on mutual information with the target attribute in the joined dataset; learning a minimum spanning tree based Bayesian structure on a feature graph that is created by calculating pairwise mutual information between selected attributes and the target attributes; learning conditional probabilistic tables at each node of the minimum spanning tree based Bayesian structure; predicting the distribution associated with the target attribute by querying the conditional probabilistic tables, thereby facilitating visual Bayesian data fusion; and automatically generating a plurality of tags, based on column headers, to index files in the conditional probabilistic tables.
8. The system of claim 7 , wherein the plurality of datasets are joined based on a type of join and wherein the type of join comprises an inner join, an outer join, a left join, and a right join.
9. The system of claim 7 , wherein one or more hardware processors are further configured to execute the programmed instructions to: learn the minimum spanning tree on the plurality of attributes and the target attribute using pairwise mutual information as a threshold; initialize each edge in the minimum spanning tree to random direction and dropping edge with mutual information less than the threshold; flip each edge direction to compute 2^(k) directed graphs; calculate the cross entropy of each graph; and select a graph with least cross entropy as the minimum spanning tree based Bayesian structure.
10. The system of claim 7 , wherein the plurality of attributes and the target attribute comprise discrete and continuous variables.
11. The system of claim 10 , wherein one or more hardware processors are further configured to execute the programmed instructions to: discretize the continuous variables by fixed size binning; and learn the conditional probabilistic tables at each node of the minimum spanning tree based Bayesian structure upon discretizing the continuous variables.
12. The system of claim 7 , wherein one or more hardware processors are further configured to execute the programmed instructions to: compute a confidence score for the distribution associated with the target attribute predicted by querying the conditional probabilistic tables using ideal distribution and probabilistic distribution; predict distribution associated with the target attribute using textual similarity; and select one of a) the distribution associated with the target attribute predicted using the textual similarity and b) the distribution associated with the target attribute predicted by querying the conditional probabilistic tables based on the computed confidence score.
13. A non-transitory computer readable medium embodying a program executable in a computing device, said program comprising: a program code for obtaining a plurality of datasets associated with a topic from a data lake, wherein each of the plurality of datasets comprise information corresponding to various attributes of the topic; a program code for joining the plurality of datasets to obtain a joined dataset; a program code for predicting distribution associated with a target attribute using Bayesian modeling by selecting a plurality of attributes (k) based on mutual information with the target attribute in the joined dataset; learning a minimum spanning tree based Bayesian structure on a feature graph that is created by calculating pairwise mutual information between selected attributes and the target attributes; learning conditional probabilistic tables at each node of the minimum spanning tree based Bayesian structure; predicting the distribution associated with the target attribute by querying the conditional probabilistic tables, thereby facilitating visual Bayesian data fusion; and automatically generating a plurality of tags, based on column headers, to index files in the conditional probabilistic tables.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
March 9, 2017
October 1, 2019
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.