A method and system for determining text coherence in an essay is disclosed. A method of evaluating the coherence of an essay includes receiving an essay having one or more discourse elements and text segments. The one or more discourse elements are annotated either manually or automatically. A text segment vector is generated for each text segment in a discourse element using sparse random indexing vectors. The method or system then identifies one or more essay dimensions and measures the semantic similarity of each text segment based on the essay dimensions. Finally, a coherence level is assigned to the essay based on the measured semantic similarities.
Legal claims defining the scope of protection, as filed with the USPTO.
1. A computer-implemented method of evaluating the coherence of an essay, the method comprising: receiving an essay with a computer, wherein the essay comprises one or more discourse elements and a plurality of text segments; annotating the one or more discourse elements of the essay; generating a text segment vector for each text segment corresponding to a discourse element using a vector-based method of random indexing with the computer; identifying a plurality of essay dimensions with the computer, each representing relatedness of a text segment to an aspect of the essay, each dimension having an associated dimension score that provides a measure of said relatedness, the essay dimensions including a first dimension measuring relatedness between a first text segment corresponding to a first discourse element and a second text segment corresponding to a second discourse element, and a second dimension measuring relatedness between text segments within a discourse element; measuring semantic similarity with respect to at least one text segment based on the essay dimensions using a support vector machine to calculate a first dimension score associated with the first dimension and a second dimension score associated with the second dimension; and generating one or more values representing a coherence level to the essay based on the first dimension score and the second dimension score with respect to the at least one text segment.
2. The method of claim 1 wherein the data provided to the support vector machine comprises one or more of the following: a maximum prompt similarity score for a text segment with a sentence in a prompt; a task sentence similarity score for the text segment with a required task sentence, wherein the required task sentence is a portion of the prompt including an explicit directive to write about a specific topic; a maximum thesis similarity score for the text segment with a sentence in a thesis; a maximum similarity score for the text segment with a sentence in a preceding discourse element; a predetermined text similarity score for the text segment with each of one or more predetermined text segments; a number of sentences in a discourse element corresponding to the text segment; a number of sentences in a discourse element corresponding to the text segment having a prompt similarity score greater than a first specified threshold; a number of sentences in a discourse element corresponding to the text segment having a task sentence similarity score greater than a second specified threshold; a number of sentences in a discourse element corresponding to the text segment having a thesis similarity score greater than a third specified threshold; a length, in words, of the text segment; a Boolean flag indicating whether the text segment contains an anaphoric element; a discourse element corresponding to the text segment; a thesis sentence; and a sentence numbering position, wherein the sentence numbering position corresponds to a number of sentences that the text segment is from the beginning of the discourse element corresponding to the text segment.
3. The method of claim 2 wherein the maximum prompt similarity score is computed by computing a prompt similarity score for each sentence of the prompt and the text segment and selecting a prompt similarity score that is greater than all other prompt similarity scores.
4. The method of claim 2 wherein the maximum thesis similarity score is computed by computing a thesis similarity score for each sentence of the thesis and the text segment and selecting a thesis similarity score that is greater than all other thesis similarity scores.
5. The method of claim 1 wherein the discourse elements comprise one or more of: background material; and a supporting idea.
6. The method of claim 1 wherein a text segment is assigned a rank based on its relatedness to the thesis.
7. The method of claim 1 wherein the discourse elements comprise a main idea.
8. The method of claim 7 wherein a text segment is assigned a rank based on its relatedness to the main idea.
9. The method of claim 1 wherein annotating comprises annotating each non-thesis discourse element by a human evaluator.
10. The method of claim 1 wherein annotating comprises annotating each non-thesis discourse element by an automatic evaluator.
11. The method of claim 1 wherein the plurality of essay dimensions further comprises a third dimension measuring relatedness of a text segment to a prompt, wherein the essay is written in response to the prompt; wherein the one or more values representing a coherence level to the essay is generated based on a third dimension score associated with the third dimension.
12. The method of claim 11 wherein the plurality of essay dimensions further comprises a fourth dimension measuring the number of errors in one or more of grammar, usage and mechanics for a text segment; wherein the one or more values representing a coherence level to the essay is generated based on a fourth dimension score associated with the fourth dimension.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
October 26, 2004
May 18, 2010
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.