Patentable/Patents/US-20250315555-A1
US-20250315555-A1

Identification of Sensitive Information in Datasets

PublishedOctober 9, 2025
Assigneenot available in USPTO data we have
Inventorsnot available in USPTO data we have
Technical Abstract

A method, a system, and a computer program product for identifying sensitive data. A plurality of text portions associated with one or more data subjects is identified. A machine learning model is applied to the identified plurality of portions to extract one or more entities representative of one or more data subjects. The entities are grouped into one or more entity groups. Based on one or more entity groups, at least one data subject is identified for replacement or redaction in at least one text portion in the plurality of text portions.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

1

. A computer-implemented method, comprising:

2

. The method of, wherein the grouping includes grouping the one or more entities using at least one of: a semantic similarity between entities, a relationship between entities, and any combination thereof.

3

. The method of, further comprising assigning one or more weights to the one or more entities based on a representation of at least one data subject in the one or more data subjects by each entity in the one or more entities.

4

. The method of, wherein the grouping includes grouping the one or more entities using the one or more weights.

5

. The method of, wherein the plurality of text portions includes: at least one document, at least one portion of a document, and any combination thereof.

6

. The method of, wherein the machine learning model is trained using a plurality of historical data subjects.

7

. The method of, wherein the one or more data subjects include at least one of: a sensitive data or information, a commercially sensitive data or information, a trade secret data or information, a secret data or information, a non-public data or information, and any combination thereof.

8

. A system, comprising:

9

. The system of, wherein the grouping includes grouping the one or more entities using at least one of: a semantic similarity between entities, a relationship between entities, and any combination thereof.

10

. The system of, wherein the at least one processor is configured to assign one or more weights to the one or more entities based on a representation of at least one data subject in the one or more data subjects by each entity in the one or more entities.

11

. The system of, wherein grouping of the one or more entities includes grouping the one or more entities using the one or more weights.

12

. The system of, wherein the plurality of text portions includes: at least one document, at least one portion of a document, and any combination thereof.

13

. The system of, wherein the machine learning model is trained using a plurality of historical data subjects.

14

. The system of, wherein the one or more data subjects include at least one of: a sensitive data or information, a commercially sensitive data or information, a trade secret data or information, a secret data or information, a non-public data or information, and any combination thereof.

15

. A non-transitory computer-readable storage medium, the computer-readable storage medium including instructions that when executed by at least one processor, cause the at least one processor to:

16

. The non-transitory computer-readable storage medium of, wherein the at least one processor is configured to assign one or more weights to the one or more entities based on a representation of at least one data subject in the one or more data subjects by each entity in the one or more entities.

17

. The non-transitory computer-readable storage medium of, wherein grouping of the one or more entities includes grouping the one or more entities using the one or more weights.

18

. The non-transitory computer-readable storage medium of, wherein the plurality of text portions includes: at least one document, at least one portion of a document, and any combination thereof.

19

. The non-transitory computer-readable storage medium of, wherein the machine learning model is trained using a plurality of historical data subjects.

20

. The non-transitory computer-readable storage medium of, wherein the one or more data subjects include at least one of: a sensitive data or information, a commercially sensitive data or information, a trade secret data or information, a secret data or information, a non-public data or information, and any combination thereof.

Detailed Description

Complete technical specification and implementation details from the patent document.

An electronic document management platform allows organizations to manage a growing collection of electronic documents, such as electronic agreements. In today's data-driven world, proliferation of unstructured datasets, particularly documents containing commercially sensitive information, poses a significant challenge. Some existing solutions include named entity recognition (NER) and natural language processing (NLP) algorithms to perform data anonymization. Solutions using NER identify and classify sensitive entities, such as, personal names, addresses, financial details, and medical information within vast datasets. NLP algorithms enhance this process by understanding the contextual nuances surrounding these entities, ensuring a more precise identification and classification. The NER algorithm-based solution can discover a plurality of entities, however, it is not capable of connecting discovered entities to a specified sensitive subject, which might be new and/or unknown to the existing NER algorithm. Further, performance of existing NER based solutions suffers from low accuracy issues. Hence, existing solutions for data anonymization often fall short in preserving data utility while ensuring confidentiality, leading to potential breaches of privacy and regulatory non-compliance.

Embodiments disclosed herein are generally directed to techniques for identification of sensitive data subjects in one or more documents and/or document portions, where identification of such data subjects are assisted through use of machine learning models and artificial intelligence architectures. In general, a document may include a multimedia record. The term “electronic” may refer to technology having electrical, digital, magnetic, wireless, optical, electromagnetic, or similar capabilities. The term “electronic document” may refer to any electronic multimedia content intended to be used in an electronic form. An electronic document may be part of an electronic record. The term “electronic record” may refer to a contract or other record created, generated, sent, communicated, received, or stored by an electronic mechanism. An electronic document may have an electronic signature. The term “electronic signature” may refer to an electronic sound, symbol, or process, attached to or logically associated with an electronic document, such as a contract or other record, and executed or adopted by a person with the intent to sign the record.

An online electronic document management system provides a host of different benefits to users (e.g., a client or customer) of the system. One advantage is added convenience in generating and signing an electronic document, such as a legally binding agreement. Parties to an agreement can review, revise and sign the agreement from anywhere around the world on a multitude of electronic devices, such as computers, tablets and smartphones.

In some embodiments, the current subject matter relates to identification of sensitive information in datasets, including structured and/or unstructured datasets. Such datasets may include contracts, agreements, commercial documentation, trade secret data or information, nonpublic data or information, confidential data or information, secret data or information, and/or any other type of sensitive data or information and/or any combination thereof. Sensitive data or information may include information that an entity (e.g., a party to an agreement) may prefer to keep away from public disclosure and/or from disclosure to any unintended recipients. For instance, a trade secret (e.g., soft drink formula, trade secret manufacturing process, etc.), commercially sensitive data, and/or any other secret data may fall into the category of sensitive information. through use of a clustering/bucketing/grouping approach.

The current subject matter may be configured to receive electronic documents, text, images, graphics, etc. (hereinafter, “documents”) and may analyze such collection of documents to identify documents in accordance with each sensitive data subject (e.g., a trade secret, commercially sensitive information, etc.). As part of the identification of data processing, the current subject matter may be configured to receive and/or ingest electronic documents that may be represented in any desired format (e.g., .pdf, .docx, etc.). Moreover, the documents may include, for instance, text, graphics, images, tables, audio, video, computing code (e.g., source code, etc.) and/or any other type of media. Further, the documents may be any type of electronic documents, e.g., agreement types, legal document types, non-legal document types, and any combinations thereof. Further, portions of documents and/or documents (e.g., sales agreement) may be associated with other portions of and/or documents (e.g., master services agreement).

Once the documents that may include sensitive subjects for redaction/replacement are identified, entities (e.g., parties, document clauses, sentences, etc.) representative of the sensitive data subjects may be extracted from the identified documents. One or more machine learning (ML) models may be used for the purposes of extracting such entities. The ML model(s) may be trained using set(s) of data representing sensitive data subjects. For example, one ML model may be trained using trade secret data (e.g., recipe formula) and another ML model may be trained using confidential information (e.g., company employee names, addresses, etc. data). As can be understood, a single ML model may be trained on different types of data representing different sensitive data subjects. In some embodiments, the ML models may, for example, include at least one of the following: a large language model, a generative artificial intelligence (AI) model, and any combination thereof, where the generative AI models may be part of the current subject matter system and/or be one or more third party models (e.g., ChatGPT, Bard, DALL-E, Midjourney, DeepMind, etc.).

The extracted entities may then be grouped into “buckets” or grouped entities. The grouping of entities may be executed based on semantic similarities and/or semantic distances. For instance, a person's name and signature of the person (whether image or text based) may be grouped into a single grouped entity-“name-person”. Entities that are connected to other entities representing sensitive data subjects (e.g., a sales agreement and a product description document (e.g., a trade secret soft drink formula) may be grouped together into a single grouped entity by virtue of their connection to one another. Further, one or more weighting factors (e.g., importance of sensitive data subject) may also be used to group entities. For example, a description of a trade secret soft drink formula and a manufacturing process involving the formula may be grouped into a single grouped entity-“trade secret formula”. As can be understood, any other parameters may be used for the purposes of grouping entities.

Once the entities have been grouped into grouped entities, the current subject matter may be configured to identify at least one data subject (e.g., trade secret, commercially sensitive data/information, etc.) for replacement and/or redaction in the received documents and/or document portions. The identification of replacements/redactions may be accomplished through use of highlighting of text, images, graphics, etc. in the documents/document portions, and/or in any other way. Alternatively, or in addition, metadata, underlying code, etc. associated with the identified text, images, graphics, etc. in the documents/document portions may be used to identify specific text, images, graphics, etc. that may be candidates for replacement/redaction. Moreover, one or more ML models may be used for identifying and/or selecting specific text, images, graphics, etc. that may be candidates for replacement/redaction.

In some embodiments, the current subject matter may be configured to receive feedback from at least one user computing device. The feedback may be provided to the identified documents, identified sensitive subjects, associated portions of documents, and/or documents that have been identified as containing sensitive subjects and/or any documents/portions of documents linked to or connected with other documents containing sensitive subjects. Once feedback is received, the current subject matter may be configured to update identified documents, portions and/or sensitive subjects for redaction/replacement. Moreover, the feedback may be used to train, retrain, refresh train, etc. one or more machine learning (ML) models that may be used for the purposes of identification of sensitive subjects in documents/portions, entities in documents, etc. As can be understood, the feedback may be used to perform any desired action and/or any combination of actions.

In some embodiments, the user may provide feedback (e.g., “thumbs up”, “thumbs down”, vote, written feedback, etc.). The feedback may be used to adjust and/or finetune, for example, how documents/portions are identified, how entities are identified. For example, too many thumbs down on a sensitive subject of a particular type may mean that the way the sensitive subject is identified in documents/portions may need be adjusted to account for more important content, other documents, other portions, etc.

The current subject matter may have one or more of the following technical benefits. In particular, the sensitive information/data identification processes executed by the current subject matter enable more accurate identification of all sensitive subjects, including subjects that may be semantically linked to or connected with specific sensitive subjects. Existing solutions (such as, Named Entity Recognition (NER) and Natural Language Processing (NLP) are not capable of connecting discovered entities to a specified sensitive subject, which might be new/unknown to existing NER/NLP algorithms. Further, existing solutions suffer from low accuracy issues. An advantage of the solution in this IDF is that it is capable of learning to identify sensitive information with an unknown full list of representative entities. For example, for a sensitive subject “trade secret” to be discovered, which has an unknown list of representative entities, the above methodology can automatically learn representations of the trade secret subject, from a corpus of trade secret samples, which is further used for discovering trade secret in unprocessed documents (e.g., contracts, agreements, etc.).

The present disclosure will now be described with reference to the attached drawing figures, wherein like reference numerals are used to refer to like elements throughout, and wherein the illustrated structures and devices are not necessarily drawn to scale. As utilized herein, terms “component,” “system,” “interface,” and the like are intended to refer to a computer-related entity, hardware, software (e.g., in execution), and/or firmware. For example, a component can be a processor (e.g., a microprocessor, a controller, or other processing device), a process running on a processor, a controller, an object, an executable, a program, a storage device, a computer, a tablet PC and/or a user equipment (e.g., mobile phone, etc.) with a processing device. By way of illustration, an application running on a server and the server can also be a component. One or more components can reside within a process, and a component can be localized on one computer and/or distributed between two or more computers. A set of elements or a set of other components can be described herein, in which the term “set” can be interpreted as “one or more.”

Further, these components can execute from various computer readable storage media having various data structures stored thereon such as with a module, for example. The components can communicate via local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network, such as, the Internet, a local area network, a wide area network, or similar network with other systems via the signal).

As another example, a component can be an apparatus with specific functionality provided by mechanical parts operated by electric or electronic circuitry, in which the electric or electronic circuitry can be operated by a software application, or a firmware application executed by one or more processors. The one or more processors can be internal or external to the apparatus and can execute at least a part of the software or firmware application. As yet another example, a component can be an apparatus that provides specific functionality through electronic components without mechanical parts; the electronic components can include one or more processors therein to execute software and/or firmware that confer(s), at least in part, the functionality of the electronic components.

Use of the word exemplary is intended to present concepts in a concrete fashion. As used in this application, the term “or” is intended to mean an inclusive “or” rather than an exclusive “or”. That is, unless specified otherwise, or clear from context, “X employs A or B” is intended to mean any of the natural inclusive permutations. That is, if X employs A; X employs B; or X employs both A and B, then “X employs A or B” is satisfied under any of the foregoing instances. In addition, the articles “a” and “an” as used in this application and the appended claims should generally be construed to mean “one or more” unless specified otherwise or clear from context to be directed to a singular form. Furthermore, to the extent that the terms “including”, “includes”, “having”, “has”, “with”, or variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term “comprising.” Additionally, in situations wherein one or more numbered items are discussed (e.g., a “first X”, a “second X”, etc.), in general the one or more numbered items may be distinct, or they may be the same, although in some situations the context may indicate that they are distinct or that they are the same.

As used herein, the term “circuitry” may refer to, be part of, or include a circuit, an integrated circuit (IC), a monolithic IC, a discrete circuit, a hybrid integrated circuit (HIC), an Application Specific Integrated Circuit (ASIC), an electronic circuit, a logic circuit, a microcircuit, a hybrid circuit, a microchip, a chip, a chiplet, a chipset, a multi-chip module (MCM), a semiconductor die, a system on a chip (SoC), a processor (shared, dedicated, or group), a processor circuit, a processing circuit, or associated memory (shared, dedicated, or group) operably coupled to the circuitry that execute one or more software or firmware programs, a combinational logic circuit, or other suitable hardware components that provide the described functionality. In some embodiments, the circuitry may be implemented in, or functions associated with the circuitry may be implemented by, one or more software or firmware modules. In some embodiments, circuitry may include logic, at least partially operable in hardware.

illustrates an embodiment of a system. The systemmay be suitable for implementing one or more embodiments as described herein. In one embodiment, for example, the systemmay comprise an electronic document management platform (EDMP) suitable for managing a collection of electronic documents. An example of an EDMP includes a product or technology offered by DocuSign®, Inc., located in San Francisco, California (“DocuSign”). DocuSign is a company that provides electronic signature technology and digital transaction management services for facilitating electronic exchanges of contracts and signed documents. An example of a DocuSign product is a DocuSign Agreement Cloud that is a framework for generating, managing, signing and storing electronic documents on different devices. It may be appreciated that the systemmay be implemented using other EDMP, technologies and products as well. For example, the systemmay be implemented as an online signature system, online document creation and management system, an online workflow management system, a multi-party communication and interaction platform, a social networking system, a marketplace and financial transaction management system, a customer record management system, and other digital transaction management platforms. Embodiments are not limited in this context.

The systemmay implement an EDMP as a cloud computing system. Cloud computing is a model for providing on-demand access to a shared pool of computing resources, such as servers, storage, applications, and services, over the Internet. Instead of maintaining their own physical servers and infrastructure, companies can rent or lease computing resources from a cloud service provider. In a cloud computing system, the computing resources are hosted in data centers, which are typically distributed across multiple geographic locations. These data centers are designed to provide high availability, scalability, and reliability, and are connected by a network infrastructure that allows users to access the resources they need. Some examples of cloud computing services include Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Software-as-a-Service (SaaS).

The systemmay implement various search tools and algorithms designed to search for electronic document(s) and/or collections of electronic documents (which may also be referred to as “transaction documents”, “transaction packages”, “document packages” or “packages”) and/or information within an electronic document or across a collection of electronic documents. Within the context of a cloud computing system, the systemmay implement a cloud search service accessible to users via a web interface or web portal front-end server system. A cloud search service is a managed service that allows developers and businesses to add search capabilities to their applications or websites without the need to build and maintain their own search infrastructure. Cloud search services typically provide powerful search capabilities, such as faceted search, full-text search, and auto-complete suggestions, while also offering features like scalability, availability, and reliability. A cloud search service typically operates in a distributed manner, with indexing and search nodes located across multiple data centers for high availability and faster query responses. These services typically offer application program interfaces (APIs) that allow developers to easily integrate search functionality into their applications or websites. One major advantage of cloud search services is that they are designed to handle large-scale data sets and provide powerful search capabilities that can be difficult to achieve with traditional search engines. Cloud search services can also provide advanced features, such as machine learning-powered search, natural language processing, and personalized recommendations, which can help improve the user experience and make search more efficient. Some examples of popular cloud search services include Amazon CloudSearch, Elasticsearch, and Azure Search. These services are typically offered on a pay-as-you-go basis, allowing businesses to pay only for the resources they use, making them an affordable option for businesses of all sizes.

In general, the systemmay allow users to generate, revise and electronically sign electronic documents. When implemented as a large-scale cloud computing service, the systemmay allow entities and organizations to a mass a significant number of electronic documents, including both signed electronic documents and unsigned electronic documents. As such, the systemmay need to manage a large collection of electronic documents for different entities, a task that is sometimes referred to as contract lifecycle management (CLM).

As shown in, the systemmay include a server devicecommunicatively coupled to a set of client devicesvia a network. The server devicemay also be communicatively coupled to a set of client devicesvia a network. The client devicesmay be associated with a set of clients. The client devicesmay be associated with a set of clients. In one network topology, the server devicemay represent any server device, such as a server blade in a server rack as part of a cloud computing architecture, while the client devicesand the client devicesmay represent any client device, such as a smart wearable (e.g., a smart watch), a smart phone, a tablet computer, a laptop computer, a desktop computer, a mobile device, and so forth. The server devicemay be coupled to a local or remote data storeto store document records. It may be appreciated that the systemmay have more or less devices than shown inwith a different network topology as needed for a given implementation. Embodiments are not limited in this context.

In various embodiments, the server devicemay include various hardware elements, such as a processing circuitry, a memory, a network interface, and a set of platform components. The client devicesand/or the client devicesmay include similar hardware elements as those depicted for the server device. The server device, client devices, and client devices, and associated hardware elements, are described in more detail with reference to a computing architectureas depicted in.

In various embodiments, the server devices,and/ormay communicate various types of electronic information, including control, data and/or content information, via one or both network, network. The networkand the network, and associated hardware elements, are described in more detail with reference to a communications architectureas depicted in.

The memorymay store a set of software components, such as computer executable instructions, that when executed by the processing circuitry, causes the processing circuitryto implement various operations for an electronic document management platform. As depicted in, for example, the memorymay include a document manager, a signature manager, and a sensitive data identification engine, among other software elements.

The document managermay generally manage a collection of electronic documents stored as document recordsin the data store. The document managermay receive as input a document containerfor an electronic document. A document containeris a file format that allows multiple data types to be embedded into a single file, sometimes referred to as a “wrapper” or “metafile.” The document containercan include, among other types of information, an electronic documentand metadata for the electronic document.

A document containermay include an electronic document. The electronic documentmay comprise any electronic multimedia content intended to be used in an electronic form. The electronic documentmay comprise an electronic file having any given file format. Examples of file formats may include, without limitation, Adobe portable document format (PDF), Microsoft Word, PowerPoint, Excel, text files (.txt, .rtf), and so forth. In one embodiment, for example, the electronic documentmay comprise a PDF created from a Microsoft Word file with one or more workflows developed by Adobe Systems Incorporated, an American multi-national computer software company headquartered in San Jose, California. Embodiments are not limited to this example.

In addition to the electronic document, the document containermay also include metadata for the electronic document. In one embodiment, the metadata may comprise signature tag marker element (STME) informationfor the electronic document. The STME informationmay include one or more STME, which are graphical user interface (GUI) elements superimposed on the electronic document. The GUI elements may include textual elements, visual elements, auditory elements, tactile elements, and so forth. In some embodiments, for example, the STME informationand STMEmay be implemented as text tags, such as DocuSign anchor text, Adobe® Acrobat Sign® text tags, and so forth. Text tags are specially formatted text that can be placed anywhere within the content of an electronic document specifying the location, size, type of fields such as signature and initial fields, checkboxes, radio buttons, and form fields; and advanced optional field processing rules. Text tags can also be used when creating PDFs with form fields. Text tags may be converted into signature form fields when the document is sent for signature or uploaded. Text tags can be placed in any document type such as PDF, Microsoft Word, PowerPoint, Excel, and text files (.txt, .rtf). Text tags offer a flexible mechanism for setting up document templates that allow positioning signature and initial fields, collecting data from multiple parties within an agreement, defining validation rules for the collected data, and adding qualifying conditions. Once a document is correctly set up with text tags it can be used as a template when sending documents for signatures ensuring that the data collected for agreements is consistent and valid throughout the organization.

In one embodiment, the STMEmay be utilized for receiving signing information, such as GUI placeholders for approval, checkbox, date signed, signature, social security number, organizational title, and other custom tags in association with the GUI elements contained in the electronic document. A clientmay have used the client deviceand/or the server deviceto position one or more signature tag markers over the electronic documentwith tools applications, and workflows developed by DocuSign or Adobe. For instance, assume the electronic documentis a commercial lease associated with STMEdesigned for receiving signing information to memorialize an agreement between a landlord and tenant to lease a parcel of commercial property. In this example, the signing information may include a signature, title, date signed, and other GUI elements.

The document managermay process a document containerto generate a document image. The document imageis a unified or standard file format for an electronic document used by a given EDMP implemented by the system. For instance, the systemmay standardize use of a document imagehaving an Adobe portable document format (PDF), which is typically denoted by a “.pdf” file extension. If the electronic documentin the document containeris in a non-PDF format, such as a Microsoft Word “.doc” or “.docx” file format, the document managermay convert or transform the file format for the electronic document into the PDF file format. Further, if the document containerincludes an electronic documentstored in an electronic file having a PDF format suitable for rendering on a screen size typically associated with a larger form factor device, such as a monitor for a desktop computer, the document managermay transform the electronic documentinto a PDF format suitable for rendering on a screen size associated with a smaller form factor device, such as a touch screen for a smart phone. The document managermay transform the electronic documentto ensure that it adheres to regulatory requirements for electronic signatures, such as a “what you see is what you sign” (WYSIWYS) property, for example.

The signature managermay generally manage signing operations for an electronic document, such as the document image. The signature managermay manage an electronic signature process to send the document imageto signers, obtaining electronic signatures, verifying electronic signatures, and recording and storing the electronically signed document image. For instance, the signature managermay communicate a document imageover the networkto one or more client devicesfor rendering the document image. A clientmay electronically sign the document imageand send the signed document imageto the server devicefor verification, recordation, and storage.

The enginemay implement and/or manage various artificial intelligence (AI) and machine learning (ML) agents to assist in various operational tasks for the EDMP of the system. The AI/ML agents and their operation associated with the sensitive data identification engine, and associated software elements, are described in more detail with reference to an artificial intelligence architectureas depicted in. The sensitive data identification engine, and associated hardware elements, are described in more detail with reference to a computing architectureas depicted in.

In general operation, assume the server devicereceives a document containerfrom a client deviceover the network. The server deviceprocesses the document containerand makes any necessary modifications or transforms as previously described to generate the document image. The document imagemay have a file format of an Adobe PDF denoted by a “.pdf” file extension. The server devicesends the document imageto a client deviceover the network. The client devicerenders the document imagewith the STMEin preparation for electronic signing operations to sign the document image.

The document imagemay further be associated with STME informationincluding one or more STMEthat were positioned over the document imageby the client deviceand/or the server device. The STMEmay be utilized for receiving signing information (e.g., approval, checkbox, date signed, signature, social security number, organizational title, etc.) in association with the GUI elements contained in the document image. For instance, a clientmay use the client deviceand/or the server deviceto position the STMEover the electronic documents, as shown in, with tools, applications, and workflows developed by DocuSign. For example, the electronic documentsmay be a commercial lease that is associated with one or more or more STMEfor receiving signing information to memorialize an agreement between a landlord and tenant to lease a parcel of commercial property. For example, the signing information may include a signature, title, date signed, and other GUI elements.

Broadly, a technological process for signing electronic documents may operate as follows. A clientmay use a client deviceto upload the document container, over the network, to the server device. The document manager, at the server device, receives and processes the document container. The document managermay confirm or transform the electronic documentas a document imagethat is rendered at a client deviceto display the original PDF image including multiple and varied visual elements. The document managermay generate the visual elements based on separate and distinct input including the STME informationand the STMEcontained in the document container. In one embodiment, the PDF input in the form of the electronic documentmay be received from and generated by one or more workflows developed by Adobe Systems Incorporated. The STMEinput may be received from and generated by workflows developed by DocuSign. Accordingly, the PDF and the STMEare separate and distinct input as they are generated by different workflows provided by different providers.

The document managermay generate the document imagefor rendering visual elements in the form of text images, table images, STME images and other types of visual elements. The original PDF image information may be generated from the document containerincluding original documents elements included in the electronic documentof the document containerand the STME informationincluding the STME. Other visual elements for rendering images may include an illustration image, a graphic image, a header image, a footer image, a photograph image, and so forth.

The signature managermay communicate the document imageover the networkto one or more client devicesfor rendering the document image. The client devicesmay be associated with clients, some of which may be signatories or signers targeted for electronically signing the document imagefrom the clientof the client device. The client devicemay have utilized various work flows to identify the signers and associated network addresses (e.g., email address, short message service, multimedia message service, chat message, social message, etc.). For example, the clientmay utilize workflows to identify multiple parties to the lease including bankers, landlord, and tenant. Further, the clientmay utilize workflows to identify network addresses (e.g., email address) for each of the signers. The signature managermay further be configured by the clientwhether to communicate the document imagein series or parallel. For example, the signature managermay utilize a workflow to configure communication of the document imagein series to obtain the signature of the first party before communicating the document image, including the signature of the first party, to a second party to obtain the signature of the second party before communicating the document image, including the signature of the first and second party to a third party, and so forth. Further for example, the clientmay utilize workflows to configure communication of the document imagein parallel to multiple parties including the first party, second party, third party, and so forth, to obtain the signatures of each of the parties irrespective of any temporal order of their signatures.

The signature managermay communicate the document imageto the one or more parties associated with the client devicesin a page format. Communicating in page format, by the signature manager, ensures that entire pages of the document imageare rendered on the client devicesthroughout the signing process. The page format is utilized by the signature managerto address potential legal requirements for binding a signer. The signature managerutilizes the page format because a signer is only bound to a legal document that the signer is intended to be bound. To satisfy the legal requirement of intent, the signature managergenerates PDF image information for rendering the document imageto the one or more parties with a “what you see is what you sign” (WYSIWYS) property. The WYSIWYS property ensures the semantic interpretation of a digitally signed message is not changed, either by accident or by intent. If the WYSIWYS property is ignored, a digital signature may not be enforceable at law. The WYSIWYS property recognizes that, unlike a paper document, a digital document is not bound by its medium of presentation (e.g., layout, font, font size, etc.) and a medium of presentation may change the semantic interpretation of its content. Accordingly, the signature manageranticipates a possible requirement to show intent in a legal proceeding by generating original PDF image information for rendering the document imagein page format. The signature managerpresents the document imageon a screen of a display device in the same way the signature managerprints the document imageon the paper of a printing device.

As previously described, the document managermay process a document containerto generate a document imagein a standard file format used by the system, such as an Adobe PDF, for example. Additionally, or alternatively, the document managermay also implement processes and workflows to prepare an electronic documentstored in the document container. For instance, assume a clientuses the client deviceto prepare an electronic documentsuitable for receiving an electronic signature, such as the lease agreement in the previous example. The clientmay use the client deviceto locally or remotely access document management tools, features, processes and workflows provided by the document managerof the server device. The clientmay prepare the electronic documentas a brand new originally written document, a modification of a previous electronic document, or from a document template with predefined information content. Once prepared, the signature managermay implement electronic signature (c-sign) tools, features, processes and workflows provided by the signature managerof the server deviceto facilitate electronic signing of the electronic document.

In addition, as discussed above, the systemmay include a sensitive data identification engine. The sensitive data identification enginemay implement a set of tools and/or algorithms to identify sensitive subjects in documents and/or portions of documents as candidates for redaction and/or replacement. The enginemay be configured to receive one or more electronic documents and/or portions of documents, which may include text, graphics, images, and/or any other type of media. The enginemay also be provided with one or more data subjects and/or sensitive data subjects that may need to be redacted and/or replaced within the received electronic documents. For example, the enginemay be provided with sensitive data subject corresponding to personal information (e.g., name, email address, etc.), a trade secret (e.g., a soft drink formula), a commercially sensitive information (e.g., pre-initial public offering stock price), and/or any other non-public and/or secret information, and/or any other information that is not to be publicly disclosed.

The enginemay then process the received electronic documents and identify a plurality of text portions associated with one or more data subjects that it has been provided with. For instance, the enginemay identify a portion of the sales agreement that contains a heading “trade secrets” and select that portion as potentially containing sensitive data subject. The enginemay also identify entire document, which may be titled as or include “personal information” and determine that it needs to be processed further to determine whether it contains sensitive data subject that needs to be redacted and/or replaced.

Once the specific electronic documents/portions are identified, the sensitive data identification enginemay be configured to apply one or more machine learning (ML) model(s) to the identified documents/portions to extract one or more entities representative of one or more sensitive data subjects. The entities may be specific sentences, clauses, words, parties to agreements, individuals, commercial entities, formulas, equations, etc. and/or any other type of entities that may be present in the documents/portions. For example, an entity may be a soft drink formula; an entity may be a name of an individual; etc.

The enginemay then group one or more entities into one or more entity groups. The enginemay be configured to identify and/or select entities that may be linked to or connected with one entity. For example, an entity “name-person” (e.g., John Smith) may be linked with entities “name-(e) signature (text based)” (representing text based electronic signature of John Smith) and/or “name-(e) signature (image based)” (representing an image of the electronic signature of John Smith) into a single grouped entity “name-person”. Entities may be grouped based on semantic similarity and/or distance between entities (e.g., names, signatures, etc.). Further, entities may be grouped based on weights that may be assigned to the entities, which may represent importance of entities. For instance, higher weights may be assigned to an entity representing a trade secret soft drink formula and a manufacturing process using the formula, thereby linking the two entities based on the assigned weights. As can be understood, any other way of grouping entities into grouped entities are possible.

Using the grouped entities, the sensitive data identification enginemay be configured to identify at least one data subject that may be present in at least one document/portion for replacement or redaction. For instance, documents including grouped entities of the trade secret soft drink formula and describing manufacturing process involving the formula may be identified as containing trade secret sensitive data subject and hence would be candidates for redaction/replacement.

illustrates an example systemshowing operation of the sensitive data identification engine, according to some embodiments of the current subject matter. The sensitive data identification enginemay include an entity extraction engine, an entity grouping engine, and a redaction identification engine. The sensitive data identification enginemay also be communicatively coupled to one or more user devices. The enginemay also implement one or more machine learning (ML) models. In some embodiments, one or more electronic documents and/or portions of documents(hereinafter, electronic documents) may be received by the enginefor analysis and identification of sensitive data subjectsfor redaction and/or replacement.

One or more components of the systemshown inmay be communicatively coupled using one or more communications networks. The communications networks may include one or more of the following: a wired network, a wireless network, a metropolitan area network (“MAN”), a local area network (“LAN”), a wide area network (“WAN”), a virtual local area network (“VLAN”), an internet, an extranet, an intranet, and/or any other type of network and/or any combination thereof.

Further, one or more components of the systemmay include any combination of hardware and/or software. In some embodiments, one or more components of the system may be disposed on one or more computing devices, such as, server(s), database(s), personal computer(s), laptop(s), cellular telephone(s), smartphone(s), tablet computer(s), virtual reality devices, and/or any other computing devices and/or any combination thereof. In some example embodiments, one or more components of the system may be disposed on a single computing device and/or may be part of a single communications network. Alternatively, or in addition to, such devices may be separately located from one another. A device may be a computing processor, a memory, a software functionality, a routine, a procedure, a call, and/or any combination thereof that may be configured to execute a particular function associated with interface and/or document certification processes disclosed herein.

In some embodiments, one or more components of the systemmay include network-enabled computers. As referred to herein, a network-enabled computer may include, but is not limited to a computer device, or communications device including, e.g., a server, a network appliance, a personal computer, a workstation, a phone, a smartphone, a handheld PC, a personal digital assistant, a thin client, a fat client, an Internet browser, or other device. One or more components of the system also may be mobile computing devices, for example, an iPhone, iPod, iPad from Apple® and/or any other suitable device running Apple's iOS® operating system, any device running Microsoft's Windows®. Mobile operating system, any device running Google's Android® operating system, and/or any other suitable mobile computing device, such as a smartphone, a tablet, or like wearable mobile device.

One or more components of the systemmay include a processor and a memory, and it is understood that the processing circuitry may contain additional components, including processors, memories, error and parity/CRC checkers, data encoders, anti-collision algorithms, controllers, command decoders, security primitives and tamper-proofing hardware, as necessary to perform the interface and/or document certification functions described herein. One or more components of the system may further include one or more displays and/or one or more input devices. The displays may be any type of devices for presenting visual information such as a computer monitor, a flat panel display, and a mobile device screen, including liquid crystal displays, light-emitting diode displays, plasma panels, and cathode ray tube displays. The input devices may include any device for entering information into the user's device that is available and supported by the user's device, such as a touchscreen, keyboard, mouse, cursor-control device, touchscreen, microphone, digital camera, video recorder or camcorder. These devices may be used to enter information and interact with the software and other devices described herein.

In some example embodiments, one or more components of the systemmay execute one or more applications, such as software applications, that enable, for example, network communications with one or more components of system and transmit and/or receive data.

One or more components of the systemmay include and/or be in communication with one or more servers via one or more networks and may operate as a respective front-end to back-end pair with one or more servers. One or more components of the system may transmit, for example from a mobile device application (e.g., executing on one or more user devices, components, etc.), one or more requests to one or more servers. The requests may be associated with retrieving data from servers (e.g., retrieving one or more electronic documents from one or more document storage sources that may store electronic documents). The servers may receive the requests from the components of the system. Based on the requests, servers may be configured to retrieve the requested data from one or more storage locations. Based on receipt of the requested data from the databases, the servers may be configured to transmit the received data to one or more components of the system, where the received data may be responsive to one or more requests.

Patent Metadata

Filing Date

Unknown

Publication Date

October 9, 2025

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “IDENTIFICATION OF SENSITIVE INFORMATION IN DATASETS” (US-20250315555-A1). https://patentable.app/patents/US-20250315555-A1

© 2026 Patentable. All rights reserved.

Patentable is a research and drafting-assistant tool, not a law firm, and does not provide legal advice. Documents we generate are drafts for review by a licensed patent attorney.

IDENTIFICATION OF SENSITIVE INFORMATION IN DATASETS | Patentable