Embodiments are directed towards generating a representative sampling as a subset from a larger dataset that includes unstructured data. A graphical user interface enables a user to provide various data selection parameters, including specifying a data source and one or more subset types desired, including one or more of latest records, earliest records, diverse records, outlier records, and/or random records. Diverse and/or outlier subset types may be obtained by generating clusters from an initial selection of records obtained from the larger dataset. An iteration analysis is performed to determine whether a sufficient number of clusters and/or cluster types have been generated that exceed at least one threshold and when not exceeded, additional clustering is performed on additional records. From the resultant clusters, and/or other subtype results, a subset of records is obtained as the representative sampling subset.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A computer implemented method for managing selection of a representative data subset, comprising: receiving, from a user via a graphical user interface, selections of: (i) a data source type from which to generate the representative data subset, (ii) one or a combination of subset types, of a plurality of defined event subset types, for identifying events to include in the subset, and (iii) a number of desired representative events to be included in the subset; retrieving events from the selected data source according to the received selection of subset type; clustering to identify similarities between the retrieved events to determine whether the particular events can be characterized as forming a group; extracting from the retrieved, clustered events a number of events corresponding to the user-selected number of desired representative events, wherein the events are extracted based on a field-extraction rule that specifies how to extract values from raw machine data included in each of the one or more events; and causing display of the subset of representative events in the graphical user interface.
2. The method of claim 1 , wherein clustering further comprises placing events into a same cluster based on similarities in the machine data in each of the events.
3. The method of claim 1 , wherein extracting further comprises selecting events from one or more populous clusters.
4. The method of claim 1 , wherein the plurality of defined event subset types corresponds to a plurality of subtype processes that include one or more of a diverse event-identification process, an outlier event-identification process, a random event identification process, an earlier event-identification process, or a later event-identification process.
5. The method of claim 1 , wherein clustering further comprises: clustering a group of events in the plurality of events to form a plurality of clusters; determining that a number of clusters in the plurality of clusters is not of a sufficiently large number; and clustering a larger group of events in the plurality of events than the group of events.
6. The method of claim 1 , wherein each event in the plurality of events is associated with a time stamp.
7. The method of claim 1 , wherein each event in the plurality of events is associated with a time stamp that has been extracted from the portion of raw machine data in that event.
8. The method of claim 1 , wherein retrieving events from the selected data source according to the received selection of subset types includes using a process to identify outlier events.
9. The method of claim 1 , wherein retrieving events from the selected data source according to the received selection of subset types includes using a process to identify events associated with earliest events in the plurality of events.
10. The method of claim 1 , wherein retrieving events from the selected data source according to the received selection of subset types includes using a process to identify events associated with latest events in the plurality of events.
11. A non-transitory, computer-readable storage medium storing instructions, an execution of which in a computer system causes the computer system to perform operations comprising: receiving, from a user via a graphical user interface, selections of: (i) a data source type from which to generate the representative data subset, (ii) one or a combination of subset types, of a plurality of defined event subset types, for identifying events to include in the subset, and (iii) a number of desired representative events to be included in the subset; retrieving events from the selected data source according to the received selection of subset type; clustering to identify similarities between the retrieved events to determine whether the particular events can be characterized as forming a group; extracting from the retrieved, clustered events a number of events corresponding to the user-selected number of desired representative events, wherein the events are extracted based on a field-extraction rule that specifies how to extract values from raw machine data included in each of the one or more events; and causing display of the subset of representative events in the graphical user interface.
12. The computer-readable storage medium of claim 11 , wherein clustering further comprises placing events into a same cluster based on similarities in the machine data in each of the events.
13. The computer-readable storage medium of claim 11 , wherein extracting further comprises selecting events from one or more populous clusters.
14. The computer-readable storage medium of claim 11 , wherein the plurality of defined event subset types corresponds to a plurality of subtype processes that include one or more of a diverse event-identification process, an outlier event-identification process, a random event identification process, an earlier event-identification process, or a later event-identification process.
15. The computer-readable storage medium of claim 11 , wherein clustering further comprises: clustering a group of events in the plurality of events to form a plurality of clusters; determining that a number of clusters in the plurality of clusters is not of a sufficiently large number; and clustering a larger group of events in the plurality of events than the group of events.
16. The computer-readable storage medium of claim 11 , wherein each event in the plurality of events is associated with a time stamp.
17. The computer-readable storage medium of claim 11 , wherein each event in the plurality of events is associated with a time stamp that has been extracted from the portion of raw machine data in that event.
18. The computer-readable storage medium of claim 11 , wherein retrieving events from the selected data source according to the received selection of subset types includes using a process to identify outlier events.
19. The computer-readable storage medium of claim 11 , wherein retrieving events from the selected data source according to the received selection of subset types includes using a process to identify events associated with earliest events in the plurality of events.
20. The computer-readable storage medium of claim 11 , wherein retrieving events from the selected data source according to the received selection of subset types includes using a process to identify events associated with latest events in the plurality of events.
21. A computer system comprising: computer memory for storing machine data; and a processor for: receiving, from a user via a graphical user interface, selections of: (i) a data source type from which to generate the representative data subset, (ii) one or a combination of subset types, of a plurality of defined event subset types, for identifying events to include in the subset, and (iii) a number of desired representative events to be included in the subset; retrieving events from the selected data source according to the received selection of subset type; clustering to identify similarities between the retrieved events to determine whether the particular events can be characterized as forming a group; extracting from the retrieved, clustered events a number of events corresponding to the user-selected number of desired representative events, wherein the events are extracted based on a field-extraction rule that specifies how to extract values from raw machine data included in each of the one or more events; and causing display of the subset of representative events in the graphical user interface.
22. The computer system of claim 21 , wherein clustering further comprises placing events into a same cluster based on similarities in the machine data in each of the events.
23. The computer system of claim 21 , wherein extracting further comprises selecting events from one or more populous clusters.
24. The computer system of claim 21 , wherein the plurality of defined event subset types corresponds to a plurality of subtype processes that include one or more of a diverse event-identification process, an outlier event-identification process, a random event identification process, an earlier event-identification process, or a later event-identification process.
25. The computer system of claim 21 , wherein clustering further comprises: clustering a group of events in the plurality of events to form a plurality of clusters; determining that a number of clusters in the plurality of clusters is not of a sufficiently large number; and clustering a larger group of events in the plurality of events than the group of events.
26. The computer system of claim 21 , wherein each event in the plurality of events is associated with a time stamp.
27. The computer system of claim 21 , wherein each event in the plurality of events is associated with a time stamp that has been extracted from the portion of raw machine data in that event.
28. The computer system of claim 21 , wherein retrieving events from the selected data source according to the received selection of subset types includes using a process to identify outlier events.
29. The computer system of claim 21 , wherein retrieving events from the selected data source according to the received selection of subset types includes using a process to identify events associated with earliest events in the plurality of events.
30. The computer system of claim 21 , wherein retrieving events from the selected data source according to the received selection of subset types includes using a process to identify events associated with latest events in the plurality of events.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
January 31, 2017
March 10, 2020
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.