Novel Arabic Spell Checking Error Model

PublishedDecember 24, 2019

Assigneenot available in USPTO data we have

InventorsSabri A. Mahmoud Wasfi G. Al-Khatib Tamim Alnethary

Technical Abstract

Patent Claims

19 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A data-driven spell checking computer system for a language, the computer system including one or more processors, the computer system comprising: an error-correct patterns generator that learns types and forms of the language's morphological patterns from an annotated corpus, and analyzes types of errors using the morphological patterns to generate error encodings, which are a string that specifies the positions of changes and change types in error patterns; an error-correct patterns database managed by the one or more processors; and a correction candidates generator performed by the one or more processors, wherein the error-correct patterns generator generates the error patterns based on the analysis of the types of errors, wherein the error patterns contain at least one of the language's characters and at least one affix position symbol, the error-correct patterns database stores the error patterns and the error encodings, the correction candidates generator generates correction candidates by, for a particular word error, matching all the error patterns, having a length equal to the word error, against the word error, and generating all the correction candidates according to the matched error patterns' error encodings.

2. The data-driven computer system of claim 1 , wherein generated correction candidates are checked to determine if they contain a valid root, the correction candidates generator outputting correction candidates that have a valid root.

3. The data-driven computer system of claim 1 , wherein the error-correct patterns generator for a word error, morphologically analyzes the corresponding correction to get one or more word error's morphological patterns; generates one or more error patterns using the pairs of morphological patterns and the corresponding word error; and generates the error encoding and correction codes.

4. The data-driven computer system of claim 3 , wherein the generated error patterns, error encodings and correction codes are stored in the error-correct patterns database.

5. The data-driven computer system of claim 3 , wherein the error encoding includes at least one of no change, transposition of two characters, insertion, deletion, and substitution.

6. The data-driven computer system of claim 1 , wherein the error-correct patterns database is created by: storing error patterns generated by the error-correct patterns generator; for each correct stem, examining all possible combinations of prefixes and suffixes, where at least one of the combinations has an error, adding combinations that satisfy the condition that the stem is compatible with the correct affixes to the database; and storing the error patterns with and correction information using dictionaries according to the length of the error pattern.

7. The data-driven computer system of claim 1 , wherein the language is Arabic.

8. The data-driven computer system of claim 1 , wherein the language is a Semitic language.

9. The data-driven computer system of claim 1 , wherein the correction candidates generator generates correction candidates by, for a particular word error, matching all the error patterns, having a length equal to the word error, against the word error in order to minimize the number of correction candidates in which the correction candidates contain a correct correction.

10. A non-transitory computer-readable storage medium storing a spell checking program and a database, when executed by a computer, the spell checking program performs an error-correct patterns generation process, including: learning types and forms of the language's morphological patterns from an annotated corpus; analyzing types of errors using the morphological patterns to generate error encodings, which are a string that specifies the positions of changes and change types in an error pattern; generating the error patterns based on the analysis of the types of errors, wherein the error patterns contain at least one of the language's characters and at least one affix position symbol; and storing the error patterns and the error encodings in the database, and a correction candidate generation process, including: inputting a particular word error; and generating correction candidates for the word error by, matching all the stored error patterns, having a length equal to the word error, against the word error, and generating all the correction candidates according to the matched error patterns' error encodings.

11. The computer-readable storage medium of claim 10 , wherein the language is Arabic.

12. A spell checking method performed by one or more processors, wherein the one or more processors manage a database that stores error patterns, the method performed by the one or more processors comprising: learning types and forms of the language's morphological patterns from an annotated corpus; analyzing types of errors using the morphological patterns to generate error encodings, which are a string that specifies the positions of changes and change types in an error pattern; generating the error patterns based on the analysis of the types of errors, wherein the error patterns contain at least one of the language's characters and at least one affix position symbol; storing the error patterns and the error encodings in the database; inputting a particular word error; and generating correction candidates for the word error by, matching all the stored error patterns, having a length equal to the word error, against the word error, and generating all the correction candidates according to the matched error patterns' error encodings.

13. The method of claim 12 , wherein the generating of correction candidates further comprises checking the generated correction candidates to determine if they contain a valid root; and outputting correction candidates that have a valid root.

14. The method of claim 12 , wherein the learning includes: for a word error, morphologically analyzing the corresponding correction to get one or more word error's morphological patterns; and the analyzing the types of errors includes: generating one or more error patterns using the pairs of morphological patterns and the corresponding word error; generating the error encoding; and generating corrections corresponding to the error encoding that need to be applied to the word error, denoted by correction codes.

15. The method of claim 14 , wherein the generated error patterns, error encodings and correction codes are stored in the database.

16. The method of claim 14 , wherein the error encoding includes at least one of no change, transposition of two characters, insertion, deletion, and substitution.

17. The method of claim 12 , wherein the database is created by: storing the generated error patterns; for each correct stem, examining all possible combinations of prefixes and suffixes, where at least one of the combinations has an error, adding combinations that satisfy the condition that the stem is compatible with the correct affixes to the database; and storing the error patterns with and correction information using dictionaries according to the length of the error pattern.

18. The method of claim 12 , wherein the language is Arabic.

19. The method of claim 12 , wherein the language is a Semitic language.

Patent Metadata

Filing Date

Unknown

Publication Date

December 24, 2019

Inventors

Sabri A. Mahmoud

Wasfi G. Al-Khatib

Tamim Alnethary

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search