Patentable/Patents/US-20250362980-A1

US-20250362980-A1

Systems and Methods for Censoring Text Inline

PublishedNovember 27, 2025

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

Systems and methods for censoring text-based data are provided. In some embodiments a censoring system may include at least one processor and at least one non-transitory memory storing application programming interface instructions. The censoring system may be configured to perform operations comprising storing a target pattern type and a computer-based model for identifying a target data pattern corresponding to a target pattern type within text based data. The censoring system may also be configured to receive text-based data by a server, and to retrieve the stored target pattern type to be censored in the text-based data. The censoring system may be configured to identify within the received text-based data, a target data pattern corresponding to the retrieved target pattern type. The censoring system may be configured to censor target characters within the identified target data pattern, and transmit the censored text-based data to a receiving party.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

. A system for censoring text-based data comprising:

. The system of, wherein generating the censored text-based data comprises embedding, in the censored text-based data, an indication that the text-based data has been censored.

. The system of, wherein generating the censored text-based data comprises replacing the target characters of the target data pattern in the text-based data with the alternative user information based on a determination that the alternative user information corresponds to (i) a permission level associated with at least one security characteristic of the receiving party and (ii) the target characters of the target data pattern.

. The system of, wherein generating the censored text-based data comprises:

. A method comprising:

. The method of, wherein generating the censored text-based data comprises embedding, in the censored text-based data, an indication that the text-based data has been censored.

. The method of, wherein the alternative user information comprises non-descriptive text data.

. The method of, wherein the alternative user information comprises synthetic characters.

. The method of, the operations further comprising:

. The method of, wherein generating the censored text-based data comprises:

. One or more non-transitory computer-readable media comprising instructions that, when executed by one or more processors, causes operations comprising:

. The one or more non-transitory computer-readable media of, wherein generating the censored text-based data comprises embedding, in the censored text-based data, an indication that the text-based data has been censored.

. The one or more non-transitory computer-readable media of, wherein the alternative user information comprises synthetic characters.

. The one or more non-transitory computer-readable media of, wherein generating the censored text-based data comprises replacing the target characters of the target data pattern in the text-based data with the alternative user information based on an indication of the alternative user information as corresponding to (i) a permission level associated with at least one security characteristic of the receiving party and (ii) the target characters of the target data pattern.

. The one or more non-transitory computer-readable media of, the operations further comprising:

. The one or more non-transitory computer-readable media of, wherein generating the censored text-based data comprises:

Detailed Description

Complete technical specification and implementation details from the patent document.

This application is a continuation of U.S. patent application Ser. No. 17/836,614, filed Jun. 9, 2022, which is a continuation of U.S. patent application Ser. No. 16/181,568, filed Nov. 6, 2018, which claims priority from U.S. Provisional Application No. 62/694,968, filed Jul. 6, 2018. The content of the foregoing applications is incorporated herein by reference in its entirety.

The disclosed embodiments generally relate to censoring text. More specifically, the disclosed embodiments relate to censoring text in electronic text-based communications using artificial intelligence.

Computers play a large role in document preparation, analysis, and transformation of numerous forms of information. In many instances during communication of text data, there is a need to protect from disclosure text that contains sensitive information, such as security sensitive words, characters or images. For example, private data such as an individual's social security number, credit history, medical history, business trade secrets, and financial data may be restricted from transmitting via a network.

Documents containing text may be evaluated by a computer system for sensitive data prior to communication via a network. The computer system may identify the presence of sensitive data and prevent transmission of the document via a network. This approach may create problems for the users attempting to communicate documents containing text as the inability to deliver the documents may limit the usefulness of the system . . . .

Accordingly, there is a need for a dynamic, fine-grained control on how the documents containing text are censored and communicated between the users.

Disclosed embodiments provide systems and methods for improved censoring of the text-based data. Disclosed embodiments improve upon disadvantages of conventional censoring by identifying sensitive text characters within the text-based data and censoring only the identified text characters.

Consistent with a disclosed embodiment, a censoring system for censoring text-based data is provided. The system may comprise at least one processor and at least one non-transitory memory storing application programming interface instructions that, when executed by the at least one processor cause the censoring system to perform operations that may include storing a target pattern type. The operations may further include storing a computer-based model for identifying a target data pattern corresponding to a target pattern type within text based data, for identifying target characters within the target data pattern, and for censoring the target characters within the identified target data pattern in the text-based data. The operations may further include receiving text-based data by a server. The operations may further include retrieving the stored target pattern type to be censored in the text-based data. The operations may further include identifying within the received text-based data, a target data pattern corresponding to the retrieved target pattern type using the computer-based model. The operations may further include censoring target characters within the identified target data pattern in the received text-based data with substitute characters, resulting in censored text-based data; and transmitting the censored text-based data to a receiving party.

Consistent with another disclosed embodiment, a method for censoring text-based data is provided. The method may comprise receiving a target pattern type. The method may further comprise storing a computer-based model for identifying a target data pattern corresponding to a target pattern type within text based data, for identifying target characters within the target data pattern, and for censoring the target characters within the identified target data pattern in the text-based data. The method may further comprise receiving text-based data by a server. The method may further comprise retrieving the stored target pattern type to be censored in the text-based data. The method may further comprise identifying within the received text-based data, a target data pattern corresponding to the retrieved target pattern type using the computer-based model.

Consistent with other disclosed embodiments, non-transitory computer-readable storage media may store program instructions, which are executed by at least one processor device and perform any of the methods described herein.

The foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the claims.

Reference will now be made in detail to exemplary embodiments, discussed with regard to the accompanying drawings. In some instances, the same reference numbers will be used throughout the drawings and the following description to refer to the same or like parts. Unless otherwise defined, technical and/or scientific terms have the meaning commonly understood by one of ordinary skill in the art. The disclosed embodiments are described in sufficient detail to enable those skilled in the art to practice the disclosed embodiments. It is to be understood that other embodiments may be utilized and that changes may be made without departing from the scope of the disclosed embodiments. Thus the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.

The disclosed embodiments describe an artificial intelligence system for censoring text-based data. In the present disclosure, the terms “first party” and “second party” may refer to a person or an entity (e.g., a company, a group or an organization). In the present disclosure, the first party may send the censored text-based data containing sensitive information to a second party. In the present disclosure, the term “censoring” may refer to a process of identifying and removing sensitive data, where the sensitive data is associated with a first party that contains information that, when released to a third party, (e.g., a person or an entity that is not authorized to obtain the text-based data) adversely affects the first party. The sensitive data may include Personal Identifiable Data (PID) such as social security number, address, phone number, description of a person, description of objects possessed by a person, as well as person's license and registration numbers. Examples of other sensitive data for a person or an entity may include financial data, criminal records, educational records, voting records, marital status, or any other data that when released to a third party may adversely affect the person or the entity associated with the sensitive data.

In the present disclosure, the term “text-based data” may refer to any data that contains text characters including alphanumeric and special characters. For example, the data may include email letters, office documents, pictures with included text, ascii art, as well as binary data rendered as text data. Examples of special characters may include quotes, mathematical operators, and formatting characters such as paragraph characters and tab characters. The described examples of special characters are only illustrative, and other special characters may be used. The text-based data may be based on text characters from a variety of languages; for example, the text characters may include Chinese characters, Japanese characters, Cyrillic characters, Greek characters or other text characters. In some embodiments, the text-based data may include data embedded into image data or video data. In some embodiments, the text-based data may be part of the scanned text. For example, the text-based data may be a scanned text image in PDF format.

The artificial intelligence system may include computing resources and software instructions for manipulating text-based data. Computing resources may include one or more computing devices configured to analyze text-based data. The computing devices may include one or more memory units for storing data and software instructions. The data may be stored in a database that may include cloud-based databases (e.g., Amazon Web Services S3 buckets) or on-premises databases. Databases may include, for example, Oracle™ databases, Sybase™ databases, or other relational databases or non-relational databases, such as Hadoop™ sequence files, HBase™, or Cassandra™. Database(s) may include computing components (e.g., database management system, database server, etc.) configured to receive and process requests for data stored in memory devices of the database(s) and to provide data from the database(s). The memory unit may also store software instructions that may perform computing functions and operations when executed by one or more processors, such as one or more operations related to data manipulation and analysis. The disclosed embodiments are not limited to software instructions being separate programs run on isolated computer processors configured to perform dedicated tasks. In some embodiments, software instructions may include many different programs. In some embodiments, one or more computers may include multiple processors operating in parallel. A processor may be a central processing unit (CPU) or a special-purpose computing device, such as graphical processing unit (GPU), a field-programmable gate array (FPGA) or application-specific integrated circuits.

The artificial intelligence system may be configured to receive the text-based data via a secure network by a server. The network may include any combination of electronics communications networks enabling communication between user devices and the components of the artificial intelligence system. For example, the network may include the Internet and/or any type of wide area network, an intranet, a metropolitan area network, a local area network (LAN), a wireless network, a cellular communications network, a Bluetooth network, a radio network, a device bus, or any other type of electronics communications network know to one of skill in the art.

The server may be a computer program or a device that provides functionality for other programs or devices, called “clients”. Servers may provide various functionalities, often called “services”, such as sharing data or resources among multiple clients, or performing computation for a client. A single server can serve multiple clients. The servers may be a database server. A database server is a server which houses a database application that provides database services to other computer programs or other computers defined as clients. The artificial intelligence system for censoring text-based data may be configured to instruct the server to store the text-based data in a database.

The artificial intelligence system for censoring text-based data may be configured to receive a target pattern type to be censored in the text-based data. The term “target pattern type” may refer to a particular type of sensitive data that requires censorship and may be a string of text identifying the type of the sensitive data. For example, the target pattern type may include a social security number, a name, a mobile telephone, an address, a checking account, a driver's license and/or the like. In various embodiments, the target pattern type may be used as a label to identify the type of sensitive data that an artificial intelligence system needs to censor. As a label, it can be any alphanumerical string. For example, the target pattern type may be “Phone Number”, “Phone Numbers” “Telephone1” or any other label that might be associated with the sensitive data pertaining to a phone number.

The artificial intelligence system may be configured to receive a list of various target pattern types that may be associated with various types of sensitive data that can be found in the text-based data. For example, for documents related to the financial information, the sensitive data may include checking and saving accounts, the information about mutual funds, person's address, phone number and salary information as well as other sensitive data, such as for example, the credit history. For documents containing a specific type of data, such as financial data, the system may provide a pre-compiled list of target pattern types. For example, the list may include “Social Security Number”, “Checking Account”, “Savings Account”, “Mutual Funds Account”, “Phone”, “Street Address”, “Salary” or other target pattern types.

The target pattern type may identify a collection of target data patterns associated with sensitive information. For example, the target data pattern that corresponds to a social security number may include the social security number and/or a social security number in addition to one or more additional characters and/or words adjacent to the social security number. As an example, a target data pattern (DP) may include DP1: “SSN #123-456-7891” or DP2: “Soc. Sec. No. 123-456-7891” or DP3: “Social Security Number: 123-456-7891”. The described examples are only illustrative, and other target data patterns associated with a social security number may be used. The collection of target data patterns {DP1, DP2, . . . . DPN} is identified by the target pattern type. For example, the collection of target patterns {DP1, DP2, . . . . DPN} may be identified by a target pattern type being a “Social Security Number”.

In various embodiments, different target data patterns may need to be identified. For example, some target data patterns may be related to the phone numbers located in association with an address of a person and may be identified by a target pattern type “Home Phone Number”. Other target data patterns may include a checking account number located adjacent to the words “checking account” that may be identified by a target pattern type “Bank Account.” The various embodiments discussed above are only illustrative, and other target data patterns and target pattern types may be considered. For example, in the various embodiments, the target data patterns and target pattern types of which a computer-based model may be trained to identify can include any target data pattern and target pattern type that is desired to be identified and/or censored.

The artificial intelligence system may be configured to assemble a computer-based model for identifying a target data pattern corresponding to the received target pattern types. In general, the artificial intelligence system may be configured to assemble a computer-based model for the target pattern type found in the list of target pattern types received by the artificial intelligence system. The computer-based model may include a machine learning model trained to identify sensitive data within text-based data related to a specific target pattern type. For example, the computer-based model may be trained to identify various target data patterns. In addition, the computer-based model may analyze identified target data patterns and detect sensitive information within target data patterns. For example, the target data pattern may be “SSN #123-23-1234”, and the sensitive information within such target data pattern may be “123-23-1234.”

In various embodiments, machine-learning models may include neural networks, recurrent neural networks, generative adversarial networks, decision trees, and models based on ensemble methods, such as random forests. The machine-learning models may have parameters that may be selected for optimizing the performance of the machine-learning model. For example, parameters specific to the particular type of model (e.g., the number of features and number of layers in a generative adversarial network or recurrent neural network) may be optimized to improve the model's performance.

In various embodiments, the computer-based model may identify target characters within a target data pattern. For example, the system may first identify a target data pattern, such as “SSN #123-456-7891”. Within this target data pattern, the system may identify target characters “123-456-7891” that need to be censored. In various embodiments, the identified target characters may be censored by removing or obscuring the character strings or by replacing them with generic text that does not contain sensitive information. For example, the system may replace target characters with characters “Social Security Number1”. In various embodiments of the present disclosure, censoring a target pattern type may imply censoring target characters within target data patterns associated with the target pattern type. Also, in various embodiments, censoring a target data pattern may imply censoring target characters within the target data pattern.

In various embodiments, the artificial intelligence system may be configured to assign an identification token to the target characters corresponding to the identified target data pattern. For example, the target data pattern may be “SSN #123-456-7891”, the corresponding target characters “123-456-7891” and the identification token for the target characters may be “SSN1”. The identification token may be used to quickly locate the target characters within the text-based data and perform operations on the target characters. In an embodiment, target characters may be replaced with a text substitute string, for example, depending on security characteristics. The term “text substitute string” may refer to text characters that may replace the target characters.

The term “security characteristics” may refer to various permission levels related to selecting various text substitute strings. In an example embodiment, the simple permission level (PL) may include a PL1 allowing the receiving party that is granted PL1 for the identification token, such as, for example, the token “SSN1” to view the target characters 123-456-7891 within the text-based data. In some cases, the receiving party may be granted a PL2 for the identification token, that is different from PL1. In such cases, the receiving party may not see the target characters, but instead may be authorized to see a first text substitute string which may be, for example, “last four of ssn: 7891”. As another example, the receiving party may be granted a PL3 for the identification token, that is different from PL1 or PL2. For such case, the receiving party may be authorized to see “NA” in place of the target characters 123-456-7891. In various embodiments, the identification token may correspond to one or more security characteristics.

In some embodiments, the receiving party may not have permissions to receive personal contact information (PCI), personable identifiable information (PII) or non-public information (NPI) within text-based data. PCI may include, for example, address, email and phone number of a person or an entity. PII may be regarded in the information security and privacy fields as any piece of information which can potentially be used to uniquely identify, contact, or locate a person or an entity. PII may include national identification numbers, street addresses, driver's licenses, telephone numbers, IP addresses, email addresses, vehicle registrations, and ages. In general, PII may be broader in scope than PCI. NPI may include names, addresses, telephone numbers, social security numbers, PINs, passwords, account numbers, salaries, medical information, and account balances of a person or an entity. In general, NPI may be broader than personally identifiable information (PII).

In various embodiments, the receiving party may have a permission level that does not allow receiving any non-public information contained in the text-based data. For example, the receiving party may have permission level PL3 that allows receiving text-based data containing no NPI. In some embodiments, the receiving party may have a permission level (for example, permission level PL2) that allows receiving party to receive NPI but not PII within text-based data.

In some embodiments, receiving party may have different permission levels associated with different text-based data. For example, for some text-based data the receiving party may have permission level PL1 that allows the receiving party to receive NPI within text-based data, but for another text-based data, the receiving party may have permission level PL3 that does not allow the receiving party to obtain NPI within text-based data. In some embodiments, a user or an entity associated with text-based data may determine what permissions may be granted to the receiving party.

For the pair of the identification token and the security characteristic assigned to the identification token, the method may provide a unique text substitute string that can replace the target characters within the target data pattern of the text-based data. In some embodiments, the text substitute string can replace a portion of the target data pattern, or the entirety of the target data pattern depending on the security characteristics. For example, if a receiving party may be granted a PL5 for the identification token “SSN1”, the entire target data pattern “SSN #123-456-7891” may be replaced with the text substitute string “Social Security is not available”.

In various embodiments the artificial intelligence system may receive a request for text-based data from a user having a set of security characteristics. For example, the user may have security characteristics such as {PL1 “SSN1”, PL3 “Home Phone”; PL1 “Name”, PL1 “Office Number”, PL10 “Crime Record”}, where PL1, PL3, and PL10 are security characteristics, and “SSN1”, “Home Phone”; “Name”, “Office Number”, and “Crime Record”, may be identification tokens for the related sensitive target characters that may be found in the text-based data. The artificial intelligence system based on user security characteristics, may determine target characters that need to be censored, and may substitute the target characters with the text substitute strings resulting in a censored text-based data.

In various embodiments, the artificial intelligence system may receive one or more target pattern types requiring censorship, receive text-based data, and apply one or more computer-based models corresponding to one or more target pattern types to censor text-based data. The computer-based models may identify, within the received text-based data, a target data pattern corresponding to the received target pattern type and replace the target characters within the identified target data pattern with substitute characters, resulting in censored text-based data. The censored text-based data may then be transmitted via a network or stored in a computer memory for further use.

The artificial intelligence system may be configured to receive data that require censorship from user devices via a secure network. Components of an artificial intelligence systemare demonstrated in. For example,shows usersA-C interacting with censoring systemvia user devicesA-C. The user devices may include laptop or desktop computer schematically represented byA, a mobile phone such as smart phone schematically represented byB, or a tablet represented byC. The various examples of user devices are only illustrative, and other devices may be used by the users to interact with the censoring system. The devices may be configured to communicate with censoring systemvia a secure networkand be allowed to transmit text-based data containing sensitive information via secure network. Text-based data transmitted via secure networkmay include emails, office documents, text documents, information transmitted from the interactive forms, and other types of text-based data. In addition, the text-based data may include images, audio and video files associated with the text-based data. For example, the transmitted text-based data may include a PowerPoint presentation that may include both text data and various audio, video and image data. The sensitive information may be encoded to ensure that it is not intercepted or compromised.

The censoring system may include at least one processora serverand a databaseas shown in. Servermay be configured to receive text-based data from secure network, store the text-based data in database, and transmit the text-based data to processor. Processormay be configured to execute software instructions for identifying the sensitive data within text-based data and for censoring the text-based data. The censored text-based data may then be submitted to serverand distributed over the networkto a receiving party. Networkmay not need to be secure, as since the censored data may not contain sensitive data. In various embodiments, the censored data may undergo further analysis by artificial intelligence systemto ensure that it may not contain any sensitive data prior to transmitting it over the network. Processormay censor text-based data using computer-bases models (CBMs) trained to identify sensitive data.

shows an illustrative processof using a CBM. Processmay be performed by, for example, processorof censoring system. It is to be understood, however, that one or more steps of processmay be implemented by other components of system(shown or not shown), including, for example, one or more of devicesA,B, andC.

In step, artificial intelligence systemmay receive, as a first input, a string of text representing target pattern type. In step, artificial intelligence systemmay receive, as a second input, a training text-based data. For example, the first input may be a string “Social Security Number” representing the target pattern type, and the second input may be a text-based financial document containing user related information, such as the user's address and the user's phone number. In step, artificial intelligence systemmay select an appropriate CBM related to the received target pattern type. In step, the selected CBM may process the text-based data by identifying the sensitive information that needs to be censored. In step, artificial intelligence systemmay be configured to censor the identified information as a part of the processing step ofand output the censored text-based data. For example, the CBM may be configured to remove or obscure (e.g., by blacking out or covering over) sensitive information from the text-based data or substitute target characters related to the sensitive information within the text-based data by some default generic characters. In some embodiments, the censoring process may be executed by a different software application not directly related to the CBM.

Identifying the sensitive information by the CBM in step, may include the CBM assigning a probability value to the character in a string of characters forming the text-based data. For example, for target pattern type “Phone” and for text-based data “Jane Doe's permanent address is Branch Ave, apt 234, Alcorn, N.H. 20401, and her phone number: 567-342-1238”, the probability value for all the characters in the text-based data except characters “phone number 567-342-1238” may be close to zero. The probability value for the character in the target data pattern “phone number 567-342-1238” may be close to one for probability values obtained from a well-trained CBM. The target data pattern may be identified by selecting the characters within the text-based data that have substantially non-zero probability values, or that have probability values that are close to one. For untrained CBMs, the probability value for various characters within the text-based data may be a random number between zero and one.

After identifying the target data pattern in step, the CBM may also identify the target characters that need to be censored. For example, within the text data pattern “phone number 567-342-1238”, the target characters that need to be censored may be “567-342-1238”. While the CBM may be trained to identify complex target data patterns such as “phone number 567-342-1238” containing both sensitive characters “567-342-1238” the CBM may also identify simpler target data patterns such as “567-342-1238”. In some embodiments, the CBM may be configured or trained to identify target data patterns that include only the characters that need to be censored. For example, the target data pattern may correspond to just the social security number “567-342-123” that needs to be censored. In some embodiments, it may be important to identify complex target data patterns. For example, the text-based data may contain the following string “the phone number of the customer is 123-435-1234, and the identification number for his hamster is 567-452-1234”. In such case, the CBM may need to only censor the number “123-435-1234” and may not need to censor the number “567-452-1234” related to the identification number for a pet hamster. For example, if the censored data is transmitted to a second party being a veterinarian, it may be essential to preserve the identification number for the hamster uncensored.

In step, CBM may censor the target characters by substituting synthetic characters for the characters that need to be censored. The term “synthetic” may refer to data that may resemble sensitive data but does not contain real sensitive information. For example, the synthetic characters for the phone number may be “321-345-2134” or other arrangements of text data that may closely resemble the sensitive data but do not actually correspond to real data. In step, CBM may censor the target characters by substituting generic characters for the characters that need to be censored. The term “generic” may refer to non-descriptive text data that may not necessarily resemble sensitive data. For example, the generic characters for the phone number may be “xxx-xxx-xxxx” or other non-descriptive text data. Various embodiments of censoring target characters by substituting synthetic characters are discussed in U.S. patent application Ser. No. 16/151,407 filed Oct. 5, 2018, and incorporated here by reference.

In step, the CBM may output the censored text-based data to artificial intelligence system. In an illustrative embodiment, artificial intelligence systemmay store the censored text-based data in the database. Additionally, or alternatively, artificial intelligence systemmay communicate the censored text-based data via networkto a receiving party. In some embodiments, artificial intelligence systemmay communicate text-based data to servervia secure network. Servermay be configured to save the text-based data in a secure database. In some embodiments, servermay request processorto censor text-based data and store censored text-based data in in another database, which may be less secure or maintain different security standards. In some embodiments, servermay be configured to communicate the censored text-based data via networkto the receiving party.

In various embodiments, CBMs, such as neural networks, may need to be trained to correctly identify target characters within a target data pattern for a given target pattern type. In general, to train a CBM, artificial intelligence systemmay provide a set of inputs to the model, determine the output of the model, and adjust parameters of the model to obtain the desired output.shows an illustrative processof training a CBM. Processmay be performed by, for example, processorof censoring system. It is to be understood, however, that one or more steps of processmay be implemented by other components of system(shown or not shown), including, for example, one or more of devicesA,B, andC. Various embodiments of training CBMs are discussed in U.S. patent application Ser. No. 16/151,407 filed Oct. 5, 2018, and incorporated here by reference.

In some embodiments, the training may start with stepof selecting a CBM. For example, if a neural network is selected as a CBM, then various parameters of the neural network may be selected during step. For instance, the number of hidden layers and the number of nodes may be selected during step. In stepthe CBM may receive a training text-based data., shows a table comprising training text-based data and tags identifying target characters that need to be censored. For example, the training text-based data may include target charactersthat may have associated numerical or alphabetical tagsindicating if the data requires censoring. For example, the numerical tag zero may indicate that the character does not need to be censored, and the tag one may indicate that the character needs to be censored. In stepthe parameters of CBM may be adjusted. The parameters may be adjusted after at least one iteration via the training process. Furthermore, the parameters may be adjusted by the training process via backpropagation process for cases when CBM is an artificial neural network. In some embodiments, stepmay involve a training specialist (e.g., computer specialists supervising the training of the CBMs) interacting with CBM directly to adjust various CBM parameters.

In various embodiments, artificial intelligence systemmay parse text-based data using a language parser resulting in identified data objects. The language parser may label data objects of the text-based data with labels or tags, including tags identifying parts of speech. The part of speech tags may include: “noun”, “verb”, “adjective”, “adverb”, “pronoun”, “preposition”, “conjunction”, or “interjection”. Such preprocessing may be useful for improving the training and performance of CBMs. For example, the labels identifying parts of speech for the text-based data objects may be used as input values to a CBM during and after training.

In various embodiments, the text-based data may include special or predetermined characters. Such characters may include formatting characters such as space characters, tab characters, paragraph characters, as well as semiotic characters such as commas, periods, semicolons, and/or the like. The special characters may be used to preprocess the text-based data into segments, with language parser configured to identify and label the segments. For example, the language parser may be configured to identify and label the sentences within the text-based data.

In some embodiments, non-textural objects or text-based data properties may be identified by a language parser. For example, the language parser may identify the font properties of the text-based data objects. In some embodiments, the language parser may identify mathematical formulas or tables within the text-based data. The text-based data may then be labeled by the language parser as it relates to the non-textural objects or text-based data properties. For example, if the word “Jennifer” appears to be in red font, the language parser may label text characters corresponding to the word “Jennifer” by an appropriate tag, such as “red font” tag. Similarly, as an example, if the word “Jennifer” appears in a table, the language parser may label the text characters corresponding to the word “Jennifer” by an appropriate tag, such as “in table” tag. Other tags may include other supplementary information associated with the text characters. For example, the tags may include “end of the sentence”, “capital letter”, “in quotes”, “next to colon” “in parentheses”, “heading”, “within address” and/or the like.

In stepthe CBM may process the text-based data by identifying sensitive information that needs to be censored. The CBM may, in some cases, be configured to censor the identified information as a part of the processing step of. For example, the CBM may be configured to remove sensitive information from the text-based data or substitute target characters related to the sensitive information within the text-based data by some default generic characters. In some embodiments, the censoring process may be executed by a different software application not directly related to the CBM. In various embodiments, the process of identifying whether the target characters in the text-based data need to be censored may involve tagging the characters as shown inwith tags.

In stepartificial intelligence systemmay evaluate the performance of the CBM by comparing the resulting censored text-based data with the target result. For example, the target censored text-based data may be produced by a training specialist or a separate trained CBM that can identify and censor correctly the text-based data. Inthe tag valuesmay be input by a training specialist or a separate trained CBM. If the output of the CBM does not match the target censored text-based data, that is if the tags output by CBM in training do not match the tags of the training text-based data, (step; NO) processmay proceed to stepand the parameters of CBM may be adjusted as described above. The training may then proceed again via stepsand.

If at stepthe output of the CBM matches the target censored text-based data (step; YES), the process of training may proceed to stepof validating CBM. At step, the CBM may be further evaluated by censoring various text-based validation data and comparing the censored text-based data to the target censored text-based data. If the CBM satisfactory censors the text-based validation data (step; YES), the model may be determined to be trained and may be output in stepto artificial intelligence system. The model may be then stored in a memory of artificial intelligence system. In the case that the CBM fails validation step(step; NO) and does not correctly censor the text-based validation data, the training process may be repeated by returning to step. If the training fails after a set number of training iterations, artificial intelligence systemmay inform a training specialist about the failure and discard the CBM.

shows a process, a variation of processdescribed in, wherein the process provides counter examples of data patterns within text-based data. The text-based data may include context data and target data patterns embedded in the context data. The term “context data” may refer to text characters that do not belong to any target data patterns. For example, “Jennifer has a new phone, and her number is 456-123-2344” may include context data “Jennifer has a new”, “and her”, with target data pattern being “phone”, “number is 456-123-2344”. In various embodiments, the target data pattern may have several disjoint parts. For example, the first part of the target data pattern may be “phone”, and the second part of the target data pattern may be “number is 456-123-2344”. Similarly, the context data may have several disjoint parts such as first part “Jennifer has a new”, and a second part “and her”.

The text-based data may include context data, the target characters being embedded in the context data, and counter character examples of the target characters embedded in the context data located in proximity to the target characters. The term “counter character examples” or “counter examples” may refer to data patterns that are similar to the target data patterns but do not contain sensitive information related to the information found in the target data patterns. For example, the text-based data may contain the target data pattern “SSN #234-12-1234” and a counter example data pattern “SSN #234-A1-12f4” that does not correspond to a data pattern having a social security number. Another counter example data pattern may include a credit card number adjacent to a social security number. In an example embodiment, the credit card number may be positioned before the social security number, and in another example, the credit card number may be positioned after the social security number. In an example embodiment, the credit card number may be separated from the social security number by some text characters. In another example embodiment, the credit card number may be separated from the social security number by one or more words. In general counter examples of data patterns may be selected to improve CBM via training, by attempting to confuse CBM.

Stepof processmay be carried out as described in relation to processabove.shows the stepof receiving the training text-based data. Stepof processmay be similar to stepof process. At stepprocessmay select a type of training text-based data to receive. For example, different training text-based data may differ in complexity, type of data, as well as other text metrics. For example, one of the text metrics may include frequency of sensitive data within the text-based data. At stepprocessmay add counter example data to the training text-based data received in step. The counter example data may be embedded into the text-based data. In general, the counter example data may include counter character examples of target characters embedded in the context data located in proximity to the target characters. Processmay proceed with steps,,,, andas in process. The type of the training text-based data may be selected based on a performance of CBM. For example, if CBM can successfully censor a first type of the training text-based data, as verified, for example, using validating CBM step, CBM may be validated in stepusing a second type of the training text-based data. If CBM fails step(step; NO), the training process may be repeated by returning to step, where the second type of the training text-based data may be retrieved for training CBM.

Patent Metadata

Filing Date

Unknown

Publication Date

November 27, 2025

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search