Systems and methods are described for efficiently processing, searching and/or rewriting variable width encoded data, such as UTF-8 encoded data, will be described. Embodiments of the systems and methods modify and adapt search algorithms, such as the Horspool and Wu-Manber algorithms, to efficiently process and manage searching of variable width encoded text in large blocks of text, such as text that may be carried via a stream of packets thru a network device, such as an intermediary device.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A method for case insensitive searching of a variable width encoded pattern in a block of text, the method comprising: (a) determining, by a device, for each character in a pattern for which to search for a match within a block of text, a corresponding lower case Unicode value, the pattern comprising variable-width encoded characters; (b) establishing, by the device, an index table of jump values for the pattern, the index table comprising a hash to each corresponding lower case Unicode value that identifies a number of byte lengths for the corresponding character; (c) jumping, by the device responsive to the index table of jump values, a pointer to the block of text to a pivot element in the block of text based on a byte length of the pattern and the byte length of a last character of the pattern; and (d) comparing, by the device, a lower case Unicode value of the pivot element to the corresponding lower case Unicode value of the character of the last character of the pattern.
2. The method of claim 1 , wherein step (a) further comprises determining, by the device, that a character in the pattern is upper case and calculating a lower case Unicode value.
3. The method of claim 1 , wherein the pattern comprises variable-width encoded characters that are UTF-8 encoded.
4. The method of claim 1 , wherein step (b) further comprises establishing, by the device, index tables for each of size of variable-width encoded characters.
5. The method of claim 1 , wherein step (c) further comprises setting, by the device, the pointer at a beginning of the block of text.
6. The method of claim 1 wherein step (c) further comprises determining, by the device, that the pointer has jumped to a middle of a character in the block of text.
7. The method of claim 6 , further comprising moving, by the device responsive to the determination that the pointer has jumped to the middle of the character in the block of text, the pointer to beginning of the character.
8. The method of claim 1 , wherein step (d) further comprises determining, by the device, that the jump to the pivot element is a jump to a character boundary and responsive to determining that the jump is to the character boundary, the device performs the comparing.
9. The method of claim 1 , further comprising determining, by the device, that the lower case Unicode value of the pivot element matches the corresponding lower case Unicode value of the character of the last character of the pattern.
10. The method of claim 9 , further comprising jumping, by the device, the pointer to the byte location in the block of text that corresponds to the beginning of the pattern and comparing the pattern against a corresponding portion of the block of text identified by the pointer.
11. A method for simultaneously performing case insensitive searches of variable width encoded patterns in a block of text, the method comprising: (a) converting, by a device, each of the patterns to be searched within a block of text to a corresponding lower case pattern, each pattern comprising variable-width encoded characters; (b) establishing, by the device for each of the patterns, a shift table comprising a hash of a predetermined number of bytes of the corresponding lower case pattern and a jump value; (c) jumping, by the device responsive to the shift table, a pointer to a pivot block in the block of text; (d) identifying, by the device, an encoded string within the pivot block that comprises bytes from the predetermined number of bytes of the pivot block; (e) computing, by the device, a hash of the bytes of the lower case of the encoded string corresponding to the predetermined number of bytes; and (f) obtaining, by the device using the hash of the bytes, the jump value from the shift table.
12. The method of claim 1 , wherein the pattern comprises variable-width encoded characters that are UTF-8 encoded.
13. The method of claim 11 , wherein step (b) further comprises maintaining, by the device, a list of patterns that have zero jumps at each hash.
14. The method of claim 11 , wherein step (c) further comprises initially jumping by the device, the pointer to an initial pivot block based on a minimum byte length across each of the patterns.
15. The method of claim 11 , wherein step (d) further comprises identifying, by the device, a minimal valid encoded string comprising bytes from the predetermined number of bytes of the pivot block.
16. The method of claim 11 , wherein step (f) further comprises identifying, by the device, that the jump value is zero.
17. The method of claim 16 , further comprising determining, by the device, whether any patterns associated with the jump value of zero match corresponding text in the block of text.
18. The method of claim 16 , further comprising moving, by the device for each pattern, the pointer to the block of text back a number of byte lengths of the pattern.
19. The method of claim 18 , further comprising determining, by the device, that the pattern does not match a corresponding portion of text identified by the pointer responsive to identifying that the pointer is not a character boundary.
20. The method of claim 18 , further comprising comparing, by the device, the pattern to the text of the block of text identified by the pointer responsive to identifying that the pointer is at a character boundary.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
March 27, 2012
August 26, 2014
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.