Automatic Genre Classification Determination of Web Content to Which the Web Content Belongs Together with a Corresponding Genre Probability

PublishedOctober 23, 2018

Assigneenot available in USPTO data we have

InventorsDirk Harz Ralf Iffert Mark Keinhoerster Mark Usher

Technical Abstract

Patent Claims

7 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method, in a data processing system comprising a hardware processor and a memory coupled to the hardware processor, the memory comprising instructions executed by the hardware processor to cause the hardware processor to implement a training process and a classification process for automatic genre determination of web content, the method comprising: the training process, wherein for each of at least one type of web content genre to be trained in the training process, comprises the steps of: collecting first labeled example data representing a first type of training material reflecting the type of web content genre to be trained; collecting second labeled example data representing a second type of training material not reflecting the type of web content genre to be trained; extracting a set of feature types comprising genre features and non-genre features from the collected first type of training material and second type of training material, wherein the genre features and the non-genre features are represented by tokens consisting of fixed length character strings extracted from content strings of the first and second type of training material; and storing each token in a corresponding feature database together with a first integer count (C G ) representing a frequency of appearance of the token in the first type of training material and a second integer count (C NG ) representing a frequency of appearance of the token in the second type of training material; and the classification process, wherein the classification process comprises the steps of: providing web content, wherein the web content is a HyperText Markup Language (HTML) document, which is parsed to generate HTML document object model (DOM) data providing a tree representation of the HTML document, where each tag, attribute and text data of the web content is represented as a node in the tree, wherein a first feature type is generated by joining together attribute values of all HTML meta data tags to form a single content string, wherein each attribute value is separated by a single space character, further text content from a HTML title tag and HTML anchor tags is extracted and appended to the content string, wherein characters are converted to lower case and only alpha-numeric and space characters are added to the content string and sequences of space characters are compressed to a single space character; wherein a second feature type is generated by joining together attribute values of all HTML anchor tags and all link tags to form a single content string, wherein each attribute value is separated by a single space character, and wherein characters are converted to lower case and only alpha-numeric and space characters are added to the content string and sequences of space characters are compressed to a single space character; extracting fixed length tokens for each feature type of the set of feature types from different text and structural elements of the web content; looking up frequencies of appearance in the corresponding feature database for each extracted token; calculating for each feature type of the set of feature types a corresponding feature probability that the web content belongs to a corresponding specific trained web content genre by combining probabilities of the genre features and non-genre features; combining the feature probabilities to an overall genre probability that the web content belongs to a specific trained web content genre; and outputting a genre classification result comprising at least one specific trained web content genre to which the web content belongs together with a corresponding genre probability.

2. The method according to claim 1 , wherein at least one of the following genres are trained as type of web content genre: blog, forum, chat room, social media site, or internet discussion site.

3. The method according to claim 1 , wherein a set of tokens is extracted from a content string of the first and second type of training material or the web content by passing a fixed length sliding window over the content string forming the token from characters of the content string which lie within the fixed length sliding window, starting from a left-most character of the content string, wherein the fixed length sliding window is shifted by one character to the right until an end of the fixed length sliding window lies at a right-most character of the content string.

4. The method according to claim 1 , wherein the first feature type denotes the fixed length tokens extracted from meta data contained in the first type of training material and second type of training material or the web content.

5. The method according to claim 1 , wherein the second feature type denotes the fixed length tokens extracted from uniform resource locator (URL) data contained in the first type of training material and second type of training material or the web content.

6. The method according to claim 1 , wherein a third feature type denotes the fixed length tokens extracted from structural information of the first type of training material and second type of training material or the web content, wherein the structural information comprises numeric codes in a defined range, each code word representing a tag node of web content structure.

7. The method according to claim 1 , wherein a third feature type is generated by traversing the HTML document object model (DOM) tree and converting each HTML tag node to a numeric code in a range from 0 to 255 which represents this tag, wherein resulting codes are concatenated to form a sequence of tag codes, and this tag code sequence is used as content string to extract a set of tokens by passing the fixed length sliding window over the content string.

Patent Metadata

Filing Date

Unknown

Publication Date

October 23, 2018

Inventors

Dirk Harz

Ralf Iffert

Mark Keinhoerster

Mark Usher

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search