US-8862522

Incremental machine learning for data loss prevention

PublishedOctober 14, 2014

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

A computing device receives a document that was incorrectly classified as sensitive data based on a machine learning-based detection (MLD) profile. The computing device modifies a training data set that was used to generate the MLD profile by adding the document to the training data set as a negative example of sensitive data to generate a modified training data set. The computing device then analyzes the modified training data set using machine learning to generate an updated MLD profile.

Patent Claims

18 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method comprising: receiving a plurality of first documents that were incorrectly classified as sensitive data based on a machine learning-based detection (MLD) profile; modifying a training data set that was used to generate the MLD profile by adding the first documents to the training data set as negative examples of sensitive data to generate a modified training data set; determining that there are at least a threshold number of the first documents; and analyzing, by a processing device, the modified training data set using machine learning to generate an updated MLD profile in response to determining that there are at least the threshold number of the first documents.

2. The method of claim 1 , wherein the analyzing is performed on a periodic basis in accordance with an MLD profile retraining schedule.

3. The method of claim 1 , further comprising: assigning a first quality rating to the training data set and a second quality rating to the modified training data set; determining that the first quality rating is lower than or equal to the second quality rating; and analyzing the modified training data set in response to determining that the first quality rating is lower than or equal to the second quality rating.

4. The method of claim 1 , further comprising: assigning a first quality rating to the training data set and a second quality rating to the modified training data set; determining that the second quality rating is higher than or equal to the first quality rating; and deploying the updated MLD profile in response to determining that the second quality rating is higher than or equal to the first quality rating.

5. The method of claim 3 , wherein the first quality rating and the second quality rating are assigned based on performing at least one of latent semantic indexing or k-fold cross validation.

6. The method of claim 1 , further comprising: receiving a second document that was incorrectly classified as non-sensitive data based on the MLD profile; and modifying the training data set that was used to generate the MLD profile by adding the second document to the training data set as a positive example of sensitive data to generate the modified training data set.

7. A non-transitory computer-readable storage medium having instructions stored therein that, when executed by a processing device, cause the processing device to perform operations comprising: receiving a plurality of first documents that were incorrectly classified as sensitive data based on a machine learning-based detection (MLD) profile; modifying a training data set that was used to generate the MLD profile by adding the first documents to the training data set as negative examples of sensitive data to generate a modified training data set; determining that there are at least a threshold number of the first documents; and analyzing, by the processing device, the modified training data set using machine learning to generate an updated MLD profile in response to determining that there are at least the threshold number of the first documents.

8. The non-transitory computer-readable storage medium of claim 7 , wherein the analyzing is performed on a periodic basis in accordance with an MLD profile retraining schedule.

9. The non-transitory computer-readable storage medium of claim 7 , wherein the operations further comprise: assigning a first quality rating to the training data set and a second quality rating to the modified training data set; determining that the first quality rating is lower than or equal to the second quality rating; and analyzing the modified training data set in response to determining that the first quality rating is lower than or equal to the second quality rating.

10. The non-transitory computer-readable storage medium of claim 7 , wherein the operations further comprise: assigning a first quality rating to the training data set and a second quality rating to the modified training data set; determining that the second quality rating is higher than or equal to the first quality rating; and deploying the updated MLD profile in response to determining that the second quality rating is higher than or equal to the first quality rating.

11. The non-transitory computer-readable storage medium of claim 9 , wherein the first quality rating and the second quality rating are assigned based on performing at least one of latent semantic indexing or k-fold cross validation.

12. The non-transitory computer-readable storage medium of claim 7 , wherein the operations further comprise: receiving a second document that was incorrectly classified as non-sensitive data based on the MLD profile; and modifying the training data set that was used to generate the MLD profile by adding the second document to the training data set as a positive example of sensitive data to generate the modified training data set.

13. A system comprising: a memory to store instructions; and a processing device, coupled to the memory, to execute the instructions to: receive a plurality of first documents that were incorrectly classified as sensitive data based on a machine learning-based detection (MLD) profile; modify a training data set that was used to generate the MLD profile by adding the first documents to the training data set as negative examples of sensitive data to generate a modified training data set; determine that there are at least a threshold number of the first documents and analyze the modified training data set using machine learning to generate an updated MLD profile in response to the determination that there are at least the threshold number of the first documents.

14. The computing device of claim 13 , wherein the processing device is further to: assign a first quality rating to the training data set and a second quality rating to the modified training data set; determine that the first quality rating is lower than or equal to the second quality rating; and analyze the modified training data set in response to the determination that the first quality rating is lower than or equal to the second quality rating.

15. The system of claim 13 , wherein the analyzing is performed on a periodic basis in accordance with an MLD profile retraining schedule.

16. The system of claim 13 , wherein the processing device is further to: assign a first quality rating to the training data set and a second quality rating to the modified training data set; determine that the second quality rating is higher than or equal to the first quality rating; and deploy the updated MLD profile in response to the determination that the second quality rating is higher than or equal to the first quality rating.

17. The system of claim 14 , wherein the first quality rating and the second quality rating are assigned based on performing at least one of latent semantic indexing or k-fold cross validation.

18. The system of claim 13 , wherein the processing device is further to: receive a second document that was incorrectly classified as non-sensitive data based on the MLD profile; and modify the training data set that was used to generate the MLD profile by adding the second document to the training data set as a positive example of sensitive data to generate the modified training data set.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

G06F H04L

Patent Metadata

Filing Date

December 14, 2011

Publication Date

October 14, 2014

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search