US-12572828-B2

Method for industry text increment and electronic device

PublishedMarch 10, 2026

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

A method for an industry text increment, as well as an electronic device and a computer readable storage medium for the same are provided. The method may include: acquiring an original industry text in a target industry field, an order of magnitude of a number of the original industry text being smaller than a preset first order of magnitude; and performing a sample incremental processing on the original industry text by using a distant supervision method, to obtain increased industry texts, an order of magnitude of a number of the increased industry texts is greater than a preset second order of magnitude, wherein the preset second order of magnitude is not smaller than the preset first order of magnitude.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

. A method for an industry text increment, comprising:

. The method according to, wherein the performing a sample incremental processing on the original industry text by using a distant supervision method comprises:

. The method according to, wherein the extracting a subject-predicate-object triple set from an actual industry text by using the trained language model comprises:

. The method according to, further comprising:

. An electronic device, comprising:

. The electronic device according to, wherein the performing a sample incremental processing on the original industry text by using a distant supervision method comprises:

. The electronic device according to, wherein the extracting a subject-predicate-object triple set from an actual industry text by using the trained language model comprises:

. The electronic device according to, further comprising:

. A non-transitory computer readable storage medium, storing computer instructions, wherein the computer instructions, when executed by a computer, cause the computer to perform operations comprising:

. The storage medium according to, wherein the performing a sample incremental processing on the original industry text by using a distant supervision method comprises:

. The storage medium according to, wherein the extracting a subject-predicate-object triple set from an actual industry text by using the trained language model comprises:

Detailed Description

Complete technical specification and implementation details from the patent document.

This application claims the priority of Chinese Patent Application No. 202110189733.4, titled “METHOD FOR INDUSTRY TEXT INCREMENT, RELATED APPARATUS, AND COMPUTER PROGRAM PRODUCT”, filed on Feb. 19, 2021, the content of which is incorporated herein by reference in its entirety.

The present disclosure relates to the field of data processing technology, particularly to artificial intelligence technology such as deep learning, natural language processing, knowledge graph construction and smart question answering, and specifically to a method for an industry text increment, as well as an electronic device and a computer readable storage medium for the same.

Information extraction technologies may be used to assist and fulfil the needs of smart question answering, smart customer services and the like that rely on information processing and information search. Benefiting from the development of artificial intelligence and deep learning, technologies on natural language processing such as information extraction have been developed rapidly in recent years. Unlike traditional machine learning models, deep learning models do not need to rely on manually defined advanced features. High accuracies and high recall rates of information extraction tasks can be achieved, only through basic features and by designing suitable deep learning model structures and performing training on large-scale labeled data.

Embodiments of the present disclosure are directed to a method for an industry text increment, as well as an electronic device and a computer readable storage medium for the same.

In a first aspect, an embodiment of the present disclosure provides a method for an industry text increment, including: acquiring an original industry text in a target industry field, an order of magnitude of a number of the original industry text being smaller than a preset first order of magnitude, wherein an industry text refers to a text content used to describe a specific object in a corresponding industry field; and performing a sample incremental processing on the original industry text by using a distant supervision method, to obtain increased industry texts, an order of magnitude of a number of the increased industry texts is greater than a preset second order of magnitude, where the preset second order of magnitude is not smaller than the preset first order of magnitude.

In a second aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor; and a storage device, communicated with the at least one processor, where the storage device stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, to enable the at least one processor to perform the method for an industry text increment as described in any of the implementations of the first aspect.

In a third aspect, an embodiment of the present disclosure provides a non-transitory computer readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to perform the method for an industry text increment as described in any of the implementations of the first aspect.

According to the method for an industry text increment, the electronic device and the computer readable storage medium that are provided in the embodiments of the present disclosure, the original industry text is first acquired in the target industry field, the order of magnitude of the number of the original industry text being smaller than the preset first order of magnitude, where the industry text refers to the text content used to describe the specific object in the corresponding industry field, and then the sample incremental processing is performed on the original industry text by using the methods including a distant supervision method, to obtain the increased industry texts, the order of magnitude of the number of the increased industry texts is greater than the preset second order of magnitude, where the preset second order of magnitude is not smaller than the preset first order of magnitude.

It should be understood that the content described in this part is not intended to identify key or important features of the embodiments of the present disclosure, and is not used to limit the scope of the present disclosure. Other features of the present disclosure will be easily understood through the following description.

Example embodiments of the present disclosure are described below in combination with the accompanying drawings, and various details of the embodiments of the present disclosure are included in the description to facilitate understanding, and should be considered as examples only. Accordingly, it should be recognized by one of ordinary skill in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the present disclosure. Also, for clarity and conciseness, descriptions for well-known functions and structures are omitted in the following description. It should be noted that the embodiments in the present disclosure and the features in the embodiments may be combined with each other on a non-conflict basis.

illustrates an example system architectureto which an embodiment of a method and apparatus for an industry text increment, an electronic device and a computer readable storage medium according to the present disclosure may be applied.

As shown in, the system architecturemay include terminal devices,and, a networkand a server. The networkserves as a medium providing a communication link between the terminal devices,andand the server. The networkmay include various types of connections, for example, wired or wireless communication links, or optical fiber cables.

A user may use the terminal devices,andto interact with the servervia the network, to receive or send a message, etc. Various applications (e.g., a sample incremental application, a text processing application, and an instant communication application) for implementing an information communication between the terminal devices,andand the servermay be installed on the terminal devices,andand the server.

The terminal devices,andand the servermay be hardware or software. When being the hardware, the terminal devices,andmay be various electronic devices having a display screen, the electronic devices including, but not limited to, a smart phone, a tablet computer, a laptop portable computer, a desktop computer, and the like. When being the software, the terminal devices,andmay be installed in the above listed electronic devices. The terminal devices may be implemented as a plurality of pieces of software or a plurality of software modules, or may be implemented as a single piece of software or a single software module, which is not specifically limited herein. When being the hardware, the servermay be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When being the software, the servermay be implemented as a plurality of pieces of software or a plurality of software modules, or may be implemented as a single piece of software or a single software module, which is not specifically limited herein.

The servermay provide various services through various built-in applications. Taking a sample incremental application that may provide a sample incremental service for a text of a low-resource industry as an example, the servermay achieve the following effects when running the sample incremental application. First, an original industry text having a stock lower than a preset first order of magnitude in a target industry field, shared by the terminal devices,and, is received through the network. The industry text refers to text content used to describe a specific object in a corresponding industry field. Then, a sample incremental processing is performed on the original industry text by using methods including a distant supervision method, to obtain increased industry texts, an order of magnitude of a number of the increased industry texts being greater than a preset second order of magnitude, where the preset second order of magnitude is not smaller than the preset first order of magnitude.

Further, after completing a sample incremental task through the above sample incremental application, the servermay further train, through the text processing application, a model for precisely extracting a subject-predicate-object triple set from a to-be-processed text, based on the increased industry texts.

It should be noted that, in addition to being acquired from the terminal devices,andthrough the network, the original industry text having the stock lower than the preset first order of magnitude in the target industry field may be stored locally in the serverin various ways. Accordingly, when detecting that the data is already stored locally (e.g., a to-be-processed sample incremental task stored before the processing starts), the servermay choose to directly acquire the data locally. In this situation, the terminal devices,andand the networkmay not be provided in the example system architecture.

Since the sample incremental processing requires many computing resources and a strong computing capability, the method for an industry text increment provided in the subsequent embodiments of the present disclosure is generally performed by the serverhaving a strong computing capability and many computing resources. Correspondingly, the apparatus for an industry text increment is also generally provided in the server. At the same time, however, it should also be noted that, when having a computing capability and computing resources that meet requirements, the terminal devices,andmay also perform, through the sample incremental application installed thereon, the above operations, which should have performed by the server, and then output the same result as that outputted by the server. In particular, in the situation where many kinds of terminal devices having different computing capabilities are present at the same time, when the sample incremental application determines that a terminal device on which the sample incremental application is installed has a strong computing capability and remains many computing resources, it may instruct the terminal device to perform the above operations, thereby appropriately reducing the computing pressure of the server. Correspondingly, the apparatus for an industry text increment may also be provided in the terminal devices,and. In this situation, the serverand the networkmay not be provided in the example system architecture.

It should be appreciated that the numbers of the terminal devices, the network, and the server inare merely illustrative. Any number of terminal devices, networks, and servers may be provided based on actual requirements.

is a flowchart of a method for an industry text increment provided in an embodiment of the present disclosure. In the flowchart of, the flowincludes the following stepsto.

Stepincludes acquiring an original industry text having a stock lower than a preset first order of magnitude in a target industry field.

The purpose of this step is to acquire, by an executing body (e.g., the servershown in) of the method for an industry text increment, the original industry text in the target industry field, the order of magnitude of the number of the original industry text being smaller than the preset first order of magnitude.

Here, the industry text refers to a text content used to describe a specific object in a corresponding industry field. The preset first order of magnitude, as a preset threshold value, is used to determine an industry field as a low-resource industry field if an industry text in this industry field currently has an actual order of magnitude smaller than the preset first order of magnitude. The low-resource industry field refers to an industry field in which, based on a conventional approach, a model having a precision meeting a desired requirement cannot be trained through a current order of magnitude of the number of original industry texts. The trained model may be used to perform an actual task, such as an entity recognition, an extraction of a subject-predicate-object triple set, and a semantic analysis, on an actual industry text.

Stepincludes performing a sample incremental processing on the original industry text by using methods including a distant supervision method, to obtain increased industry texts, an order of magnitude of a number of the increased industry texts being greater than a preset second order of magnitude.

On the basis of step, the purpose of this step is to use, by the executing body, the distant supervision method as a sample incremental approach to perform the sample incremental processing on the original industry text with an order of magnitude not satisfying a first requirement, and finally obtain the increased industry texts and satisfying a second requirement.

Specifically, in the present disclosure, if an order of magnitude of the number of an original text is smaller than the preset first order of magnitude, it indicates that the first requirement is not satisfied. If an order of magnitude of the number of texts after the incremental processing is greater than the preset second order of magnitude, it indicates that the second requirement is satisfied. That is, the magnitude relationship between the preset first order of magnitude and the preset second order of magnitude is that the preset second order of magnitude is not smaller than the preset first order of magnitude. That is, a minimum preset second order of magnitude should be equal to the preset first order of magnitude. In this case, the preset first order of magnitude, as the threshold value, may be used to determine both whether the number of texts satisfies the first requirement and whether the number of texts satisfies the second requirement.

In order to change natural language information on a network into a structured form convenient for an analysis and processing, researchers propose different relationship extraction methods. A relationship extraction refers to detecting a clear or unclear relationship between entities from a text content and classifying the entities. From the perspective of sample acquisition of a machine learning, there are mainly three kinds of methods used for extracting a relationship fact from a text: fully supervised learning, semi-supervised learning and unsupervised learning. Here, the fully supervised learning refers to a learning in which initial sample data is manually labeled, the labeled data is then used to train a classifier, and finally, the trained classifier is used to recognize whether there are two certain entities having a certain relationship in one new sentence. A fully supervised learning method mainly includes a feature-based method and a kernel method. The semi-supervised learning refers to a learning in which a very small data seeding instance or pattern is used for guided learning to extract some new patterns from a large number of texts, these patterns are then used to extract a new instance, the new instance is used to extract a newer pattern, and these steps are repeated until data is finally obtained. The unsupervised learning refers to a learning in which an initial data set is not required, a character string between two entities is extracted from a large number of texts, and the character string is then aggregated and simplified to obtain a relationship character string.

With the advent of the era of big data, the relationship extraction task may be applied in broader and more complex technical fields. Facing massive and heterogeneous data, the researchers propose the distant supervision method. According to this method, the relationship extraction is completed by heuristically aligning a to-be-extracted relationship and a natural sentence. On the basis of this principle, the present disclosure further utilizes the characteristics of the distant supervision method to apply the distant supervision method to the sample increment on low-resource samples. The principle of the sample increment may be described with reference to an example of an extraction of a relationship among a location, country and capital. In a knowledge library, there is an instance (A, B). If there is a sentence “A is the capital of B . . . ” in a text set, the system can automatically match the instance and the sentence by using the distant supervision method, to form a training instance {capital (A, B), A is the capital of B, . . . }, so that a new sentence may be formed through the training instance using another instance similar to the instance (A, B).

Specifically, in addition to the above emphasized distant supervision method, the sample incremental method may include another method through which similar effects are achieved by using another technical principle, such as, a synonym substitution method, a back translation method, and a random generation method. Whether to add other methods on the basis of the adoption of the distant supervision method may be selected according to the requirement in an actual application scenario, which is not to be specifically limited herein.

For the target industry field in which the order of magnitude of the number of the original industry text is lower than the preset first order of magnitude, according to the method for an industry text increment provided in the embodiment of the present disclosure, the sample increment is implemented by using the distant supervision method. Through the distant supervision method, a new text meeting the requirement can be found from another industry field or a public corpus according to an association between nouns in the original industry text, and the new text is used as an added text, thus the sample's magnitude is expanded. Accordingly, with the help of the sample incremental technology, a model used to precisely extract a subject-predicate-object triple set and having a precision satisfying a requirement can also be trained through the text of the low-resource target industry.

Referring to,is a flowchart of the method for an industry text increment provided in another embodiment of the present disclosure. In the flowchart, the flowincludes the following stepsto.

Stepincludes acquiring an original industry text having a stock lower than a preset first order of magnitude in a target industry field.

This step is consistent with stepshown in. For the content of step, reference is made to the corresponding part in the previous embodiment, which will not be repeatedly described herein.

Stepincludes performing a first sample incremental processing on the original industry text by using a distant supervision method, to obtain a first added industry text.

A method for generating an added industry text in a way including, but not limited to, the distant supervision, may be as follows:

First, an initial subject-predicate-object triple set is extracted from the original industry text of the target industry field. Then, in another industry text of a non-target industry field and a public corpus, a text having a subject and a predicate of the initial subject-predicate-object triple set is determined as a target text. Finally, the target text is used as an added industry text of the original industry text distantly supervised.

Stepincludes performing a second sample incremental processing on the original industry text and the first added industry text respectively by adopting a subject-object replacement method and/or a back translation method, to obtain a second added industry text.

Here, the subject-object replacement method refers to replacing the original subject and the original object with a new subject and a new object while maintaining the subject-object relationship provided by the predicate of the subject-predicate-object triple set. For a deeper understanding, reference is made to the following example:

A subject (abbreviated as S) dictionary and an object (abbreviated as O) dictionary which belong to the same category are obtained through the statistics for labeled training data. Taking a text of the ship industry as an example, the dictionaries may be obtained as follows: ships: M1, M2, M3 . . . ; and producing countries: A1, A2, A3 . . . . Accordingly, a plurality of new samples may be generated by randomly replacing the subject (S) and the object (O). An example is given below.

An original sample: M1 is a large commercial cargo ship of A1, and its full load displacement far exceeds those of other ships (S: M1; P: Producing Country; O: A1).

A new generated sample: M2 is a large commercial cargo ship of A2, and its full load displacement far exceeds those of other ships (S: M2; P: Country of origin; O: A2).

The back translation method refers to a method that a sentence is translated and re-translated, for example, from Chinese to English and then back to Chinese. Accordingly, a new sample having slight differences in expression may be obtained. That is, mainly through a slight distortion in the process of translating a sentence between different languages, a new sentence having a meaning identical to the original sentence or a new sentence different from the original sentence in expression is generated and then used as a sample.

Stepincludes removing a text having a content error, a text having a logic error and a duplicate text from the first added industry text and the second added industry text, to obtain the increased industry texts, the order of magnitude of the number of the increased industry texts is greater than the preset second order of magnitude.

On the basis of step, regardless of whether the subject-object replacement method, the back translation method or the distant supervision method is used, various errors in the new sample generated through the incremental processing may be caused, especially after the second incremental operation is performed based on the distant supervision. Therefore, the purpose of this step is to remove, by the executing body, a text having content error, a text having logic error and/or a duplicate text from the first added industry text and the second added industry text, so as to obtain the increased industry texts being as available as possible.

Further, if the order of magnitude of the number of the increased industry texts after the text having the content error, the text having the logic error and the duplicate text are removed, is not greater than the preset second order of magnitude, an incremental processing may be performed on the increased industry texts again according to the above incremental method, until the order of magnitude of the number of the texts is greater than the preset second order of magnitude. Clearly, if the incremental processing is subsequently performed again on the basis of the increased industry texts, a stricter check should be performed, to ensure the reliability of a subsequently trained model based on the principle of maintaining the availability of the sample.

Stepincludes training a language model based on the increased industry texts, and obtaining a trained language model.

Based on step, the purpose of this step is to train an initial language model by using the increased industry texts as training samples, to finally obtain the trained language model.

Specifically, according to specific requirements, the initial language model may be selected from language model frameworks having different characteristics to take part in the training, and an activation function and a loss function may also be adjusted by themselves according to actual corpus characteristics and requirement characteristics, which is not specifically limited herein.

Stepincludes extracting a subject-predicate-object triple set from an actual industry text by using the trained language model.

On the basis of step, the purpose of this step is to extract, by the executing body, the subject-predicate-object triple set from the actual industry text by using the trained language model. It should be understood that the subject-predicate-object triple set (abbreviated as “SPO triple set”) is generally extracted in a unit of sentence. That is, one SPO triple set should be able to be extracted from one sentence, and the SPO triple set extracted from the sentence generally exists as the core of the content to be expressed by the sentence. Accordingly, the key content can be expressed concisely, and the influence of another content can be eliminated in this way. Meanwhile, it also facilitates performing kinds of structured processing on the content of the industry text directly through the corresponding relationship in the SPO triple set.

Patent Metadata

Filing Date

Unknown

Publication Date

March 10, 2026

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search