Legal claims defining the scope of protection, as filed with the USPTO.
1. A system for identifying entities in email signature blocks, the apparatus comprising: one or more processors; and a non-transitory computer readable medium storing a plurality of instructions, which when executed, cause the one or more processors to: create a plurality of scores for each token, in a sequence of tokens from an email signature block, based on a corresponding independent probability distribution that has been previously trained for a plurality of entity types, wherein each token comprises one of a word, a punctuation symbol, and an end-of-line character, an entity being a part of one of a person name, a job title, an enterprise name, a telephone number, an email address, and a uniform resource locator, and being associated with at least one of an entity type, an entity sequence, and a set of entities; identify each entity sequence that has a total number of entities that is identical to a total number of tokens in the sequence of tokens; determine, for each of the identified entity sequences, an entity sequence score by combining corresponding scores for each token in the sequence of tokens, that corresponds to an entity type in an identified entity sequence; identify an entity sequence from the identified entity sequences with a highest entity sequence score; and output the sequence of tokens as an identified set of entities, in the email signature block, based on the entity sequence with the highest score.
2. The system of claim 1 , wherein scoring each token in the sequence of tokens from the email signature block based on the plurality of entity types comprises scoring each token based on a k-gram from a token matching at least one of the plurality of entity types, wherein the k-gram from the token comprises a string of consecutive characters in the token, with k as a length of the string of consecutive characters in the token.
3. The system of claim 1 , wherein the sequence of tokens from the email signature block comprises a sequence of tokens from a same line of the email signature block.
4. The system of claim 1 , wherein identifying each entity sequence in the plurality of entity sequences which comprises the number of entities that matches the number of tokens in the sequence of tokens comprises identifying each entity sequence which corresponds to an initial line of the email signature block.
5. The system of claim 1 , wherein identifying each entity sequence in the plurality of entity sequences which comprises the number of entities that matches the number of tokens in the sequence of tokens comprises identifying each entity sequence which excludes entities identified from a previous line of the email signature block.
6. The system of claim 1 , wherein combining corresponding scores for each token, in the sequence of tokens, that corresponds to an entity type in an identified entity sequence comprises multiplying scores for each token, in the sequence of tokens, that corresponds to each entity type, for each identified entity sequence.
7. The system of claim 1 , comprising further instructions, which when executed, cause the one or more processors to merge adjacent tokens corresponding to a same entity type to generate a single corresponding entity type.
8. A computer program product comprising computer-readable program code to be executed by one or more processors when retrieved from a non-transitory computer-readable medium, the program code including instructions to: create a plurality of scores for each token, in a sequence of tokens from an email signature block, based on a corresponding independent probability distribution that has been previously trained for a plurality of entity types, wherein each token comprises one of a word, a punctuation symbol, and an end-of-line character, an entity being a part of one of a person name, a job title, an enterprise name, a telephone number, an email address, and a uniform resource locator, and being associated with at least one of an entity type, an entity sequence, and a set of entities; identify each entity sequence that has a total number of entities that is identical to a total number of tokens in the sequence of tokens; determine, for each of the identified entity sequences, an entity sequence score by combining corresponding scores for each token, in the sequence of tokens, that corresponds to an entity type in an identified entity sequence; identify an entity sequence from the identified entity sequences with a highest entity sequence score; and output the sequence of tokens as an identified set of entities, in the email signature block, based on the entity sequence with the highest score.
9. The computer program product of claim 8 , wherein scoring each token in the sequence of tokens from the email signature block based on the plurality of entity types comprises scoring each token based on a k-gram from a token matching at least one of the plurality of entity types, wherein the k-gram from the token comprises a string of consecutive characters in the token, with k as a length of the string of consecutive characters in the token.
10. The computer program product of claim 8 , wherein the sequence of tokens from the email signature block comprises a sequence of tokens from a same line of the email signature block.
11. The computer program product of claim 8 , wherein identifying each entity sequence in the plurality of entity sequences which comprises the number of entities that matches the number of tokens in the sequence of tokens comprises identifying each entity sequence which corresponds to an initial line of the email signature block.
12. The computer program product of claim 8 , wherein identifying each entity sequence in the plurality of entity sequences which comprises the number of entities that matches the number of tokens in the sequence of tokens comprises identifying each entity sequence which excludes entities identified from a previous line of the email signature block.
13. The computer program product of claim 8 , wherein combining corresponding scores for each token, in the sequence of tokens, that corresponds to an entity type in an to identified entity sequence comprises multiplying scores for each token, in the sequence of tokens, that corresponds to each entity type, for each identified entity sequence.
14. The computer program product of claim 8 , wherein the program code comprises further instructions to merge adjacent tokens corresponding to a same entity type to generate a single corresponding entity type.
15. A method for identifying entities in email signature blocks, the method comprising: creating a plurality of scores for each token, in a sequence of tokens from an email signature block, based on a corresponding independent probability distribution that has been previously trained for a plurality of entity types, wherein each token comprises one of a word, a punctuation symbol, and an end-of-line character, an entity being a part of one of a person name, a job title, an enterprise name, a telephone number, an email address, and a uniform resource locator, and being associated with at least one of an entity type, an entity sequence, and a set of entities; identifying each entity sequence that has a total number of entities that is identical to a total number of tokens in the sequence of tokens; determining, for each of the identified entity sequences, an entity sequence score by combining corresponding scores for each token, in the sequence of tokens, that corresponds to an entity type in an identified entity sequence; identifying an entity sequence from the identified entity sequences with a highest entity sequence score; and outputting the sequence of tokens as an identified set of entities, in the email signature block, based on the entity sequence with the highest score.
16. The method of claim 15 , wherein scoring each token in the sequence of tokens from the email signature block based on the plurality of entity types comprises scoring each token based on a k-gram from a token matching at least one of the plurality of entity types, wherein the k-gram from the token comprises a string of consecutive characters in the token, with k as a length of the string of consecutive characters in the token.
17. The method of claim 15 , wherein the sequence of tokens from the email signature block comprises a sequence of tokens from a same line of the email signature block.
18. The method of claim 15 , wherein identifying each entity sequence in the plurality of entity sequences which comprises the number of entities that matches the number of tokens in the sequence of tokens comprises at least one of identifying each entity sequence which corresponds to an initial line of the email signature block and identifying each entity sequence which excludes entities identified from a previous line of the email signature block.
19. The method of claim 15 , wherein combining corresponding scores for each token, in the sequence of tokens, that corresponds to an entity type in an to identified entity sequence comprises multiplying scores for each token, in the sequence of tokens, that corresponds to each entity type, for each identified entity sequence.
20. The method of claim 15 , the method further comprising merging adjacent tokens corresponding to a same entity type to generate a single corresponding entity type.
Unknown
October 23, 2018
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.