Spam Identification Using an Algorithm Based on Histograms and Lexical Vectors (one-Pass Algorithm)

PublishedAugust 16, 2011

Assigneenot available in USPTO data we have

Technical Abstract

Patent Claims

13 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method for identifying spam in an email, the method comprising: (a) normalizing an email text morphologically and identifying unique words in the email text; (b) filtering words from the email text, including filtering multi-symbol meaningless human-language words and noise human-language words; (c) determining a number of occurrences of each unique word in the email text; (d) creating a unique numerical identifier for each unique word, the identifier being based on a numerical value corresponding to the unique word; (e) assigning an unique numerical identifier to each unique word in the email text; (f) generating a lexical vector of the email text as a plurality of the assigned identifiers and a frequency of occurrence of each corresponding unique word in the email text; (g) generating a histogram of the lexical vector for each unique numerical identifier of each corresponding unique word in the email text; (h) performing only a single comparison of the histogram of the lexical vector to histograms of lexical vectors of known spam texts; and (i) determining if the email text is spam based on a result of comparison of the histograms.

2. The method of claim 1 , further comprising calculating a length of the lexical vector and comparing it to lengths of the lexical vectors of the known spam texts prior to comparing the histograms.

3. The method of claim 2 , further comprising excluding from consideration the lexical vectors of the known spam texts having a length that does not coincide with the length of the lexical vector within a pre-determined threshold range.

4. The method of claim 3 , further comprising generating the histograms of the lexical vectors of known spam texts that remain after comparison of the lengths.

5. The method of claim 1 , wherein the result of comparison of the histograms is a control value.

6. The method of claim 5 , wherein the email text is considered to be spam if the control value is within the pre-set threshold range.

7. The method of claim 1 , wherein the numerical values corresponding to the unique words are stored in a database.

8. The method of claim 1 , wherein the lexical vectors of the known spam texts are stored in a lexical vector database.

9. A system for identifying spam in an email text, the system comprising: a processor; a memory; and computer code stored in the memory and executed on the processor for implementing the steps (a)-(i) of claim 1 .

10. A system for identifying spam in an email text, the system comprising a processor, a memory, and computer code loaded into the memory for implementing: (a) a lexical vector module coupled to a database containing numerical values corresponding to unique words of the email text, the lexical vector module being configured to generate a lexical vector of the email text as a plurality of the unique numerical values corresponding to a unique word and a number of occurrences of each corresponding unique word in the email text; (b) a histogram module for generating histograms of lexical vectors for each unique numerical identifier of each corresponding unique word in the email text; (c) a lexical vector database accessible by the histogram module; (d) a length calculation module coupled to the lexical vector module and connected to the lexical vector database; and (e) a comparison nodule coupled to the histogram module, (f) wherein the histogram of the lexical vector of the incoming email text is generated in the histogram module and compared only a single time to histograms of lexical vectors of known spam texts stored in the lexical vector database, and (g) wherein the lexical vector is generated after the email text is normalized morphologically and after meaningless and noise words are filtered out from the email text, filtering multi-symbol meaningless human-language words and noise human-language words.

11. The system of claim 10 , wherein the length of the lexical vector of the incoming email text is compared to lengths of the lexical vectors of known spam texts stored in the lexical vector database in the length calculation module.

12. The system of claim 10 , wherein the comparison module produces a control value.

13. The system of claim 12 , wherein if the control value is within a pre-set threshold, the incoming email text is considered to be spam.

Patent Metadata

Filing Date

Unknown

Publication Date

August 16, 2011

Inventors

ANDREY L. KALININ

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search