Systems and methods are disclosed comprising instructions to receive a request to evaluate authorization of a development service that comprises a digital artifact set, each digital artifact in the digital artifact set, access an authorization schema set available for the development service, identify an applicable authorization schema from the authorization schema set via comparing the content embeddings of the digital artifacts and the reference embeddings of the authorization schemas, retrieve a historical artifact attribute set representing tracked development actions for prior development services authorized via the applicable authorization schema, predict an authorization status for the development service using the historical artifact attribute set and the artifact attribute set, configure for display a visual representation of the applicable authorization schema and the mapped at least one digital artifact of the development service.
Legal claims defining the scope of protection, as filed with the USPTO.
receiving a request to evaluate authorization of a development service that corresponds to a digital artifact comprising a content embedding for an artifact attribute associated with tracked actions of the development service; inputting the content embedding and at least one reference embedding of at least one available authorization schema into a first AI model to identify, from the at least one available authorization schema, an applicable authorization schema for authorizing the development service, comparison of the content embedding of the digital artifact and a reference embedding of the applicable authorization schema satisfying a similarity threshold value; retrieving a historical artifact attribute associated with tracked actions for prior development services authorized via the applicable authorization schema; inputting the historical artifact attribute, the artifact attribute, and attribute threshold values of the applicable authorization schema into a second AI model to generate, for the development service, an authorization status indicating whether the artifact attribute satisfies the attribute threshold values of the applicable authorization schema; automatically generating for transmission, a representation of a formatted export artifact based on an artifact template comprising a required field query that corresponds to an empty input field; and inputting the artifact attribute and the historical artifact attribute into a third AI model to generate a human-readable narrative for the required field query; and automatically updating the representation of the formatted export artifact to populate the empty input field of the required field query with the human-readable narrative. responsive to receiving a positive user indication for authorizing the development service using the applicable authorization schema, updating the representation for the required field query of the formatted export artifact by: . A computer-implemented method for executing interactive review services of one or more development services, the computer-implemented method comprising:
claim 1 automatically generating a model prediction training sample comprising an input data based on the historical artifact attribute and the artifact attribute of the digital artifact and an output label based on the authorization status of the applicable authorization schema; accessing a stored model prediction training sample corresponding to predicted authorization statuses of prior applicable authorization schemas; and retraining, using the stored model prediction training sample and the model prediction training sample, the first AI model, the second AI model, the third AI model, or a combination thereof. responsive to receiving a negative user indication for the applicable authorization schema: . The computer-implemented method offurther comprising:
claim 1 inputting into a fourth AI model, the historical artifact attribute, the artifact attribute, and the attribute threshold subset to generate an adjusted artifact attribute satisfying the attribute threshold subset; and configuring for transmission, to an authorized editor of the digital artifact, the attribute threshold subset and the adjusted artifact attribute. responsive to receiving a negative user indication identifying an attribute threshold subset of the attribute threshold values not satisfied by the artifact attribute: . The computer-implemented method offurther comprising:
claim 1 . The computer-implemented method of, wherein the representation is configured to transmit the artifact attribute of the digital artifact associated with the tracked actions of the development service.
claim 4 configuring for transmission a distinct visual marking over an artifact attribute subset that corresponds to artifact attributes satisfying the attribute threshold values of the applicable authorization schema. . The computer-implemented method offurther comprising:
claim 4 . The computer-implemented method of, wherein the authorization status further indicates whether the attribute threshold values of the applicable authorization schema is partially satisfied.
claim 6 configuring for transmission a distinct visual marking over an artifact attribute subset that corresponds to artifact attributes partially satisfying the attribute threshold values of the applicable authorization schema. . The computer-implemented method offurther comprising:
claim 7 causing the third AI model to generate a human-readable recommendation for adjusting one or more artifact attributes from the artifact attribute subset to satisfy the attribute threshold values of the applicable authorization schema; and configuring for transmission the human-readable recommendation. . The computer-implemented method offurther comprising:
claim 1 . The computer-implemented method of, wherein the applicable authorization schema comprises regulatory policies, predetermined evaluation rulesets, narrative guidelines, fiscal procedures, or any combination thereof.
claim 1 . The computer-implemented method of, wherein the representation of the applicable authorization schema is further configured to transmit a comparative diagram that maps a first mapping of content similarities between the historical artifact attribute and the artifact attribute and a second mapping of content differences between the historical artifact attribute and the artifact attribute.
claim 1 obtaining a first sequence of intermediary logic operations executed during operation of the first AI model and a second sequence of intermediary logic operations executed during operation of the second AI model; causing the third AI model to generate, using the first sequence of intermediary logic operations and the second sequence of intermediary logic operations, a human-readable narrative explaining a logical sequence resulting in the authorization status of the applicable authorization schema; and configuring for transmission the human-readable narrative alongside the representation of the applicable authorization schema. . The computer-implemented method offurther comprising:
claim 1 monitoring one or more intermediary logic operations from invocation of the first AI model to generate output reference identifiers of the applicable authorization schema; and invoking, during the invocation of the first AI model, a generative AI model to output at least one human-readable explanation based on input comprising the one or more intermediary logic operations, the at least one human-readable explanation indicating incremental logic for the one or more intermediary logic operations. . The computer-implemented method of, further comprising:
at least one hardware processor; and receive a request to evaluate authorization of a development service that corresponds to a digital artifact comprising a content embedding for an artifact attribute associated with tracked actions of the development service; input the content embedding and at least one reference embedding of at least one available authorization schema into a first AI model to identify an applicable authorization schema for authorizing the development service, comparison of the content embedding of the digital artifact and a reference embedding of the applicable authorization schema satisfying a similarity threshold value; retrieve a historical artifact attribute associated with tracked actions for prior development services authorized via the applicable authorization schema; input the historical artifact attribute, the artifact attribute, and attribute threshold values of the applicable authorization schema into a second AI model to generate an authorization status indicating whether the artifact attribute satisfies the attribute threshold values of the applicable authorization schema; automatically generate for transmission a representation of a formatted export artifact based on an artifact template comprising required field query that corresponds to an empty input field; and input the artifact attribute and the historical artifact attribute into a third AI model to generate a human-readable narrative for the required field query; and automatically update the representation of the formatted export artifact to populate the empty input field of the required field query with the human-readable narrative. responsive to receiving a positive user indication for authorizing the development service using the applicable authorization schema, update the representation for the required field query of the formatted export artifact by: at least one non-transitory memory storing instructions, which, when executed by the at least one hardware processor, causes the system to: . A system for executing interactive review services of one or more development services, the system comprising:
claim 13 automatically generate a model prediction training sample comprising an input data based on the historical artifact attribute and the artifact attribute of the digital artifact and an output label based on the authorization status of the applicable authorization schema; access a stored model prediction training sample corresponding to predicted authorization statuses of prior applicable authorization schemas; and retraining, using the stored model prediction training sample and the model prediction training sample, the first AI model, the second AI model, the third AI model, or a combination thereof. responsive to receiving a negative user indication for the applicable authorization schema: . The system offurther caused to:
claim 13 input into a fourth AI model, the historical artifact attribute, the artifact attribute, and the attribute threshold subset to generate an adjusted artifact attribute satisfying the attribute threshold subset; and configure for transmission, to an authorized editor of the digital artifact, the attribute threshold subset and the adjusted artifact attribute. responsive to receiving a negative user indication identifying an attribute threshold subset of the attribute threshold values not satisfied by the artifact attribute: . The system offurther caused to:
claim 15 obtain a first sequence of intermediary logic operations executed during operation of the first AI model and a second sequence of intermediary logic operations executed during operation of the second AI model; cause the third AI model to generate, using the first sequence of intermediary logic operations and the second sequence of intermediary logic operations, a human-readable narrative explaining a logical sequence resulting in the authorization status of the applicable authorization schema; and configure for transmission a human-readable narrative alongside the representation of the applicable authorization schema. . The system offurther caused to:
receive a request to evaluate authorization of a development service that corresponds to a digital artifact comprising a content embedding for an artifact attribute associated with tracked actions of the development service; input the content embedding and at least one reference embedding of at least one available authorization schema into a first AI model to identify an applicable authorization schema for authorizing the development service, comparison of the content embedding of the digital artifact and a reference embedding of the applicable authorization schema satisfying a similarity threshold value; retrieve a historical artifact attribute associated with tracked actions for prior development services authorized via the applicable authorization schema; and input the historical artifact attribute, the artifact attribute, and attribute threshold values of the applicable authorization schema into a second AI model to generate an authorization status indicating whether the artifact attribute satisfies the attribute threshold values of the applicable authorization schema. . One or more non-transitory, computer-readable storage media comprising instructions recorded thereon, wherein the instructions when executed by at least one data processor of a system for executing interactive review services of one or more development service, cause the system to:
claim 17 wherein the authorization status of the applicable authorization schema indicates satisfaction of the attribute threshold values. configure for transmission a representation of the applicable authorization schema and the digital artifact of the development service, . The one or more non-transitory, computer-readable storage media of, wherein the instructions further cause the system to:
claim 17 automatically generate an export artifact based on an artifact template, the artifact template comprising at least one required field query; and input the artifact attribute and the historical artifact attribute into a fourth AI model to generate at least one human-readable narrative for the at least one required field query. responsive to receiving a positive user indication for the applicable authorization schema: . The one or more non-transitory, computer-readable storage media of, wherein the instructions further cause the system to:
claim 18 configure for display a distinct visual marking over an artifact attribute subset that corresponds to artifact attributes satisfying the attribute threshold values of the applicable authorization schema. . The one or more non-transitory, computer-readable storage media of, wherein the representation is configured to display the artifact attribute of the digital artifact associated with the tracked actions of the development service, and wherein the instructions further cause the system to:
Complete technical specification and implementation details from the patent document.
This application is a continuation of U.S. patent application Ser. No. 19/185,195 entitled “ROBUST ARTIFACTS MAPPING AND AUTHORIZATION SYSTEMS AND METHODS FOR OPERATING THE SAME” and filed Apr. 21, 2025, which is a continuation-in-part of U.S. patent application Ser. No. 19/182,588 entitled “REMEDIATION OF UNSTRUCTURED DATA USING ARTIFICIAL INTELLIGENCE” and filed Apr. 18, 2025, which is a continuation-in-part of U.S. patent application Ser. No. 19/050,084 entitled “DETECTING DATA ANOMALIES USING ARTIFICIAL INTELLIGENCE” and filed Feb. 10, 2025, which is a continuation-in-part of U.S. patent application Ser. No. 18/736,407 entitled “OUT-OF-DISTRIBUTION PREDICTION” and filed Jun. 6, 2024. The content of the foregoing applications is incorporated herein by reference in their entireties.
In response to widespread concerns that the economic performance of the United States had significantly underperformed its potential, Congress enacted the Economic Recovery Tax Act (ERTA) of 1981. The ERTA was designed to function as an economic stimulus to promote investment within the United States. Congress recognized that declines in research spending had adversely affected the nation's economic growth, productivity gains, and competitiveness in the global marketplace, as evidenced by the decline of the U.S. automotive industry. The ERTA included a provision known as the ‘Credit for Increasing Research Activities’ (the Credit), which aimed to reverse the decline in U.S. research spending by incentivizing year-over-year increases in research expenditures.
Initially articulated in House Report No. 97-201 (H.R. 4242) and subsequently codified by the Tax Reform Act of 1986, ‘Qualified Research’ generally refers to private sector or commercially driven development efforts intended to foster innovation within scientific or technological fields. However, administrative challenges and divergent interpretations by the Internal Revenue Service (IRS) and taxpayers have necessitated numerous revisions to the relevant Code Section and associated Treasury Regulations.
In practice, ‘Qualified Research’ is frequently distilled into a “Four Part Test” to provide a reference framework. However, this simplification can be misleading due to the numerous requirements or elements within each “Test” and the extensive Regulations that supplement certain parts of Section 41 with examples. This convention underscores the necessity for detailed evaluations and documentation of taxpayer research efforts over time at the business component level. This evaluation is further complicated by a substantial body of case law and the need to reconcile research activities with allowable expenditures.
The technologies described herein will become more apparent to those skilled in the art from studying the Detailed Description in conjunction with the drawings. Implementations or implementations describing aspects of the invention are illustrated by way of example, and the same references can indicate similar elements. While the drawings depict various implementations for the purpose of illustration, those skilled in the art will recognize that alternative implementations can be employed without departing from the principles of the present technologies. Accordingly, while specific implementations are shown in the drawings, the technology is amenable to various modifications.
Experimental projects (e.g., research studies) and/or development inclined services (“development services” or “development service) are often eligible for exclusive material benefits (e.g., discounted resource costs, fiscal deductions, and/or the like) offered by large institutions (e.g., a national government, a private enterprise, a taxation authority and/or the like) to promote collective innovation and discovery. Sponsors of such exclusive benefits (e.g., tax credits) typically restrict distributions to select projects and/or services that satisfy a narrow set of qualifications (e.g., a high threshold of innovation, an extensive process of experimentation, and/or the like). Accordingly, proper accreditation and/or authorization of such material benefits requires candidates to submit comprehensive and/or extensive evidence (e.g., development documentation, experimental reports, design plans, and/or the like) indicating that a select development service exhibits these target qualifications. Additionally, these experimental projects and/or development services (e.g., firmware research and/or innovations) are often eligible for refactoring their reported assets and/or expenses as amortized entities that are distributed across multiple time intervals (e.g., capitalization of software development expenditures). Similarly, regulatory institutions validating these reported assets and/or expenses (e.g., a federal revenue service and/or administration) typically require candidates to submit meticulous and consistent documentation of the appropriated capitalization (e.g., software development cost amortization under Section 174 of the Internal Revenue Code (IRC) of the United States of America) for each consecutive time interval (e.g., a fiscal year).
However, conventional methods typically rely on manual vetting process (e.g., performed by a human analyst) to identify and submit relevant artifacts (e.g., digital documentation) that demonstrate eligibility of the development service (e.g., for a material benefit and/or amortization of assets and expenditures). Manual selection of the necessary supporting documentation is a time-and resource-expensive process that often requires several hours, or days, to complete, and is also subject to vagaries and error of human collection activities, storage activities, and analysis. As a result, existing systems are typically slow, inefficient, lacking appropriate and available data, and often inaccurate/insufficient at submitting large volumes of authorization requests (e.g., applications for benefit distribution). To further compound this issue, development projects and/or services (e.g., improvements to computing devices) can naturally generate numerous artifacts (e.g., administrative documentation) that may be inconsequential and/or irrelevant with respect to eligibility for a specific material benefit (e.g., fiscal deductions for development of novel firmware) or amortization of assets and/or expenditures. Accordingly, manual review of excessively large quantities of information may result in oversight of critical documentation and/or artifacts that would otherwise qualify a development service for a specific benefit. As a result, these and other problems of inefficient manual preparation of development service artifacts can significantly diminish the overall amount of entitled benefits, place undue burden on staff support teams, and so forth.
Attempting to create a system that automates management and preparation of documentation supporting authorization (e.g., accreditation of exclusive material benefits and/or amortized entities) for development services presented significant technical challenges. Developing such a system required addressing several limitations in conventional approaches to data management, such as the difficulty in selectively identifying supporting digital artifacts that enable qualification of a select development service. As described herein, traditional data management systems rely on manual vetting processes (e.g., via a human analyst) to identify supporting documentation, which can result in extended manual review periods, oversight of supporting resources (e.g., an applicable digital artifact), and/or erroneous identification of ineligible resources (e.g., a digital artifact incompatible with qualification criteria for select material benefits and/or amortization of assets) for projects and/or services that correspond to large quantities of data records. As a result, conventional methods often lead to inefficient or insufficient resource utilization and inaccuracies (e.g., failure to identify eligible/ineligible digital artifacts) when managing and preparing supporting documentation.
To address these technical challenges, multiple design approaches were evaluated. With respect to generating qualification criteria, one approach included directly inputting unstructured eligibility criterion data (e.g., natural language documents describing qualifications for specified material benefits and/or asset amortization requirements) into an AI model as training data without normalization or additional pre-processing, thus passively relying on the AI model's ability to extract comparative signals (e.g., eligibility rules, qualification thresholds, and/or the like) for evaluating digital artifact contents. However, implicit determination of the eligibility criteria would lead to inconsistent, and often inaccurate, predictions that reduce the overall accuracy and reliability of the AI model, since the AI model maintains a fluid characterization of the eligibility criteria (e.g., new criterions determined for each subsequent evaluation). With respect to evaluating qualification of the development services, a similar approach included direct input of digital artifact contents (e.g., project documentation, recorded experimental results, and/or the like) into the AI model and relying on the AI model's ability to accurately map supporting digital artifacts to eligible benefits (e.g., tax credits, fiscal expenditure capitalization, and/or the like). Although this method streamlined the process for identifying digital artifacts with relevance to qualification criteria for select material benefits, the lack of additional supervisory oversight for the AI model resulted in inaccurate assessments of whether the identified artifacts satisfied the qualification criteria and supported eligibility of the corresponding development service to receive such material benefits and/or amortization of assets. Further, this approach entailed indiscriminate and direct input of all digital artifact contents associated with individual projects into the AI model, resulting in significant resource inefficiencies and incremental cost associated with the use of the AI model.
As such, the inventors have developed hybrid systems, methods, and computer-readable media for automatically and/or semi-automatically evaluating authorization of development services (e.g., experimental projects). For example, the system can evaluate an approximate authorization status (e.g., eligibility likelihood) for a development service (e.g., an experimental project) to claim a material benefit (e.g., a fiscal deduction) and/or valid amortization of assets and expenditures based on digital artifacts (e.g., associated project documentation) and/or artifact attributes that track development actions (e.g., experimentation results, design version control, and/or the like) of the development service. The system can identify and/or associate applicable authorization schemas (e.g., relevant fiscal credit policies and/or guidelines) for the development service by comparing embeddings associated with the digital artifacts and the authorization schemas. Accordingly, the system can evaluate digital artifact attributes (e.g., development narratives, financial records, and/or the like) via statistical inference models (e.g., machine learning models, small language models, large language models, and/or the like) to predict the approximate authorization status for the development service. In some implementations, the system can configure a user interface to display detailed contents of the authorization schemas and corresponding digital artifacts via visual representations displayed at a user interface of an authorized user (e.g., an accountant or auditor for benefit claims). In some implementations, the system can first provide a detailed analysis (e.g., via a dashboard within the user interface) as to the likelihood of successfully obtaining a material benefit and/or validation of asset amortization based on the authorization schemas and corresponding digital artifacts. In some implementations, the detailed analysis can evaluate, and present, an individualized likelihood of obtaining a material benefit and/or validation of asset amortization from specific sponsors and/or institutions, which can be based on data provided from sources external to the party seeking the material benefit and/or based prior submissions for material benefits (e.g., submissions based on the same or similar authorization schemas and the same or similar digital artifacts to those under consideration). In some implementations, the system can provide an interactive chatbot (e.g., enabled via generative machine learning models) to enable and respond to user inquiries for information associated with the authorization schemas, corresponding digital artifacts, or detailed analysis. In response to positive user validation (e.g., receiving accountant or auditor approval that the authorization schemas and corresponding digital artifacts are appropriate and desired for submission to a material benefit provider and/or asset amortization validator), the system can automatically generate an export digital artifact (e.g., a fiscal reporting document) that incorporates necessary digital artifact contents to satisfy attribute thresholds associated with the authorization schemas.
For illustrative purposes, examples are described herein in the context of computer systems for evaluating authorization of requested development services (e.g., assessed eligibility of satisfying a benefit claim). However, a person skilled in the art will appreciate that the disclosed system can be applied in other contexts. For example, the disclosed system can be used within a healthcare-related environment to identify (e.g., via a user initiated prompt for a generative machine learning model or an automatic execution) an applicable diagnosis for a given medical condition based on reference sources (e.g., Gray's Anatomy: The Anatomical Basis of Clinical Practice, Mosby's Dictionary of Medicine, and/or the like) in combination with relevant evidence (e.g., for the body of a given patient, from an environment to which the patient is or was exposed, and/or the like) that supports the identified diagnosis.
The description and associated drawings are illustrative examples and are not to be construed as limiting. This disclosure provides certain details for a thorough understanding and enabling description of these examples. One skilled in the relevant technology will understand, however, that the invention can be practiced without many of these details. Likewise, one skilled in the relevant technology will understand that the invention can include well-known structures or features that are not shown or described in detail, to avoid unnecessarily obscuring the descriptions of examples.
1 FIG. 2 FIG. 100 105 200 105 130 is a system diagram illustrating an example of a computing environment in which the disclosed system operates in some implementations. In some implementations, environmentincludes one or more client computing devicesA-D, examples of which can host the service authorization systemof. Client computing devicesoperate in a networked environment using logical connections through networkto one or more remote computers, such as a server computing device.
110 120 110 120 200 110 120 120 2 FIG. In some implementations, serveris an edge server which receives client requests and coordinates fulfillment of those requests through other servers, such as serversA-C. In some implementations, server computing devicesandcomprise computing systems, such as the service authorization systemof. Though each server computing deviceandis displayed logically as a single server, server computing devices can each be a distributed computing environment encompassing multiple computing devices located at the same or at geographically disparate physical locations. In some implementations, each servercorresponds to a group of servers.
105 110 120 110 120 115 125 120 115 125 115 125 115 125 Client computing devicesand server computing devicesandcan each act as a server or client to other server or client devices. In some implementations, servers (,A-C) connect to a corresponding database (,A-C). As discussed above, each servercan correspond to a group of servers, and each of these servers can share a database or can have its own database. Databasesandwarehouse (e.g., store) information such as claims data, email data, call transcripts, call logs, policy data and so on. Though databasesandare displayed logically as single units, databasesandcan each be a distributed computing environment encompassing multiple computing devices, can be located within their corresponding server, or can be located at the same or at geographically disparate physical locations.
130 130 105 130 110 120 130 Networkcan be a local area network (LAN) or a wide area network (WAN) but can also be other wired or wireless networks. In some implementations, networkis the Internet or some other public or private network. Client computing devicesare connected to networkthrough a network interface, such as by wired or wireless communication. While the connections between serverand serversare shown as separate connections, these connections can be any kind of local, wide area, wired, or wireless network, including networkor a separate public or private network.
2 FIG. 2 FIG. 1 FIG. 200 200 202 210 220 230 240 210 210 202 210 210 202 210 202 204 202 106 204 250 252 254 256 is a block diagram that illustrates a service authorization system(“system”) that can implement aspects of the present technology. The components shown inare merely illustrative, and well-known components are omitted for brevity. As shown, the computing serverincludes a processor, a memory, a wireless communication circuitryto establish wireless communication and/or information channels (e.g., Wi-Fi, internet, APIs, communication standards) with other computing devices and/or services (e.g., servers, databases, cloud infrastructure), and a display(e.g., user interface). The processorcan have generic characteristics similar to general-purpose processors, or the processorcan be an application-specific integrated circuit (ASIC) that provides arithmetic and control functions to the computing server. While not shown, the processorcan include a dedicated cache memory. The processorcan be coupled to all components of the computing server, either directly or indirectly, for data communication. Further, the processorof the computing servercan be communicatively coupled to a computing databasethat is hosted alongside the computing serveron the core networkdescribed in reference to. As shown, the computing databasecan include a machine learning (ML) database, a digital artifact database, an authorization schema database, and a historical evaluation database.
220 210 220 210 210 220 204 220 220 The memorycan comprise any suitable type of storage device including, for example, a static random-access memory (SRAM), dynamic random-access memory (DRAM), electrically erasable programmable read-only memory (EEPROM), flash memory, latches, and/or registers. In addition to storing instructions that can be executed by the processor, the memorycan also store data generated by the processor(e.g., when executing the modules of an optimization platform). In additional, or alternative, implementations, the processorcan store temporary information onto the memoryand store long-term data onto the computing database. The memoryis merely an abstract representation of a storage environment. Hence, in some implementations, the memorycomprises one or more actual memory chips or modules.
2 FIG. 220 221 222 223 224 225 226 202 221 222 223 224 225 226 202 As shown in, modules of the memorycan include a schematization engine, an artifact prediction engine, a logic evaluation engine, a generative export engine, an anomaly evaluation engine, and a data management engine. Other implementations of the computing serverinclude additional, fewer, or different modules, or distribute functionality differently between the modules. As used herein, the term “module” and/or “engine” refers broadly to software components, firmware components, and/or hardware components. Accordingly, the engines,,,,, andcould each comprise software, firmware, and/or hardware components implemented in, or accessible to, the computing server.
200 200 200 200 200 200 200 200 252 The systemcan be configured to evaluate authorization and/or qualification of candidate development services (e.g., experimental projects, research inquiries, and/or the like) to obtain a limited material benefit (e.g., a fiscal expense reduction) and/or amortize select assets, or expenditures (e.g., capitalization of project development costs of multiple fiscal years). For example, the systemcan evaluate digital artifacts of a candidate development service (e.g., a service pending, or intended for, authorization) to identify supporting documentation and/or historical records that qualify the service for the specified material benefits and/or amortization of assets, or expenditures. In a non-limiting manner, examples of a digital artifact (e.g., of a development service) can include a progression report (e.g., tracked development activities), an experimentation result (e.g., a tested hypothesis), a fiscal expenditure record, and/or a combination thereof. The systemcan identify and/or designate individual digital artifacts associated with a development service as supporting documentation to a claim for a material benefit, an amortization of entities (e.g., assets and/or expenditures), or both. In some implementations, the systemcan be configured receive digital artifacts for a select development service from participant users (e.g., inventors, researchers, project managers, and/or individuals of similar roles) via a user interface. For example, the systemcan periodically transmit a digital form to participant users to request updated information regarding progression and/or current state of the development service. The transmitted digital form can be configured to include one or more queries (e.g., text-based questions) and corresponding input components (e.g., modifiable text boxes) enabling participant users to provide the requested information. Accordingly, the systemcan receive, and record, a digital response form (e.g., a completed version of the digital form) as a new digital artifact for the development service. In some implementations, the systemcan be configured to extract, and reformat, information received from the participant users into a standardized data structure format for the stored digital artifacts. The systemcan store the received digital artifacts in the digital artifact database.
200 252 200 252 In some implementations, the systemcan be configured to store digital artifacts designated for supporting claims to a material benefit (e.g., a tax credit) within a separate database from digital artifacts designated for supporting claims for amortization of assets and/or expenditures (e.g., capitalization of software development costs). As an illustrative example, the digital artifact databasecan include two discrete partitions, such that the first partition corresponds to stored digital artifacts for supporting a material benefit claim and the second partition corresponds to stored digital artifacts for supporting an entity amortization claim. In additional or alternative implementations, the systemcan be configured to store classification labels (e.g., metadata tags) for each digital artifact, such that the classification label indicates whether the digital artifact is eligible for supporting a material benefit claim and/or an asset amortization claim (e.g., while storing the digital artifacts within a consolidated digital artifact database).
221 221 The schematization enginecan be configured to generate authorization schemas available for evaluating qualifications of development services (e.g., for a restricted processes, an entitlement to material benefits, an eligibility for amortization of assets or expenditures, and/or the like). An authorization schema represents data structures that define required evaluation criteria (e.g., mandatory attribute-based thresholds for satisfaction by supporting digital artifacts) to authorize a candidate development service (e.g., for a restricted process, a limited material benefit, an asset/expenditure capitalization and/or the like). For example, the schematization enginecan generate an authorization schema that comprises functional evaluation criteria for determining eligible entitlement of a development service to a limited material benefit (e.g., a fiscal expense reduction), or amortization of select assets and/or expenditures over specified time intervals (e.g., a plurality of fiscal years). In some implementations, an individual authorization schema can correspond to an entitlement to a single material benefit (e.g., an individual tax credit claim) or a grouped plurality of material benefits (e.g., an enumerated list of tax credit claims).
Authorization schemas can include predetermined rules (e.g., a binary criterion, a set of value based conditions, etc.) that are required to be fulfilled by digital artifacts for authorization. For example, an authorization schema can include a logical condition that requires digital artifacts of a candidate development service to include attributes specifying a jurisdiction, a recordation date, an authentication (e.g., an official notarization), a specified format type (e.g., a standardized template), and/or the like. An authorization schema can include preferred sources of data such as specific databases, report types, metadata (e.g., project tracking systems, slide decks seeking approval to initiate projects, progress reports on technical developments). Authorization schemas can include embedded content requirements of supporting digital artifacts that correspond to additional context information that are represented via non-standardized rules and/or qualifications (e.g., non-conforming to strict logic-based criteria). For example, an authorization schema can include a text-based rule (e.g., a tax credit policy, an internal enterprise statute) that requires associated digital artifacts to include media contents (e.g., text data, images, audio, and/or the like) that indicate and/or describe a pertinent relationship between the development service and a specified field of interest (e.g., applicability of a software application project for a firmware focused tax credit policy). In another example, an authorization schema can include a narrative guideline that describes qualifying criteria and/or additional context information of digital artifacts that enables the corresponding development service to be eligible under a text-based rule (e.g., an historical advisory opinion of regulatory institutions, lawyers or accountants on eligibility for candidate development services based on prior digital artifacts and their contents). In another example, an authorization schema can include a prior digital artifact that includes exemplary contextual information (e.g., project descriptions, experimental reports, financial documentation, and/or the like) and/or formatting that resulted in a positive (e.g., or negative) evaluation result indicating eligibility of a previous development service under a text-based rule. Accordingly, authorization schemas can represent unified data structures that incorporates both standardized (e.g., predetermined rules) and non-standardized (e.g., regulatory policies, narrative guidelines, fiscal procedures, and/or the like) requirements of digital artifact contents to qualify a development service (e.g., for a restricted process, entitled benefit, and/or the like).
221 The schematization enginecan be configured to receive an input corpus (e.g., regulatory policies, narrative guidelines, and/or the like) of qualification criteria that define eligibility requirements of supporting digital artifacts (e.g., project development reports, experimental results, fiscal documentation, and/or the like) for a candidate development service. In a non-limiting manner, examples of the input corpus can include text-based statutes of regulatory policies (e.g., an institutional tax code), predetermined logic-based rules (e.g., required properties of supporting digital artifacts), narrative guidelines (e.g., an advisory opinion of qualified individuals and/or institutions), and/or additional context information pertinent to qualifying the development service (e.g., internal procedural rules for a private entity). For simplicity of illustration, the input corpus of the qualification criteria is described herein in the form of text-based data sources (e.g., written documentation, alphanumeric data structures, and/or the like). However, an ordinary person skilled in the art will appreciate the input corpus can include alternative forms and/or media for representing eligibility details for a development service, such as an image (e.g., visual examples of eligibility requirements), an audio (e.g., recorded conversations of eligibility requirements), and/or the like.
221 221 221 221 221 Using the input corpus, the schematization enginecan generate an authorization schema data structure that comprises functional evaluation criteria for determining qualification of the candidate development service. For example, the schematization enginecan configure the authorization schema to include attribute-based thresholds (e.g., a binary value condition, a content similarity threshold, and/or the like) that are used to assess satisfaction of the defined qualification parameters by supporting digital artifacts of the candidate development service. In some implementations, the schematization enginecan analyze a subset of the input corpus to generate standardized evaluation criteria for assessing candidate development services. As an illustrative example, the schematization enginecan input portions of the corpus associated with structured and/or quantitative eligibility requirements (e.g., an enumerated legal statute of limited qualifying conditions) to a machine learning model (e.g., a semantic natural language model, a generative machine learning model, and/or the like) to reduce the qualification criteria to simplified logic-based rules for identifying and assessing target attributes (e.g., legal jurisdiction, record datetime, identifiable information, summative variable values, structured metadata, and/or the like) of the supporting digital artifacts. As another example, the schematization enginecan receive (e.g., via a user interface) an external user input that specifies additional standardized evaluation criteria (e.g., logic-based rules, procedural analysis scripts, and/or the like) for assessing the candidate development service.
221 221 221 200 221 221 254 In other implementations, the schematization enginecan analyze a subset of the input corpus to generate non-standardized evaluation criteria for assessing candidate development services. The schematization enginecan use a machine learning model (e.g., a semantic natural language model, a generative machine learning model, and/or the like) to embed portions of the corpus associated with qualitative eligibility requirements to determine a reference identifier (e.g., a uniform embedding vector) that represents the corpus subset. For example, the schematization enginecan be configured to embed portions of the corpus corresponding to abstract regulatory policies (e.g., a requirement for narrative contents of data artifacts relating to a specific field of development), narrative guidelines, historical evaluation results (e.g., positive and/or negative examples of data artifact attributes enabling and/or preventing qualification of the candidate development service), and/or similarly qualitative eligibility criteria to determine the reference identifier. Accordingly, the systemcan compare the reference identifier to content identifiers of data artifact attributes (e.g., via cosine similarity, Euclidean distance, model-based inference, and/or the like) to evaluate content similarities between portions of the corpus and the supporting data artifacts for the development service. In some implementations, the schematization enginecan embed (e.g., via a machine learning model) contents of the authorization schema to determine a summative identifier (e.g., a uniform embedding vector) for the generated authorization schema. The schematization enginecan be configured to store generated authorization schemas in the authorization schema database.
200 200 200 200 The systemcan evaluate authorization (e.g., eligibility and/or qualification) of a candidate development service (e.g., to claim entitlement to a restricted process, a material benefit, an asset amortization, and/or the like) based on a corresponding set of supporting digital artifacts. For example, the systemcan receive a user request (e.g., via a user interface) to evaluate eligibility of a candidate development service for one or more entitled benefits (e.g., a fiscal expense reduction) and/or valid amortization of assets, or expeditures. In some implementations, the systemcan be configured to receive a customized user request that specifies a set of candidate development services to be evaluated for eligibility. For example, the systemcan display (e.g., at the user interface) a set of filtering options that enable the external user (e.g., an authorized user, a licensed auditor, and/or the like) to specify evaluation of a custom selection development service (e.g., an individual development service, all development services, development services pertaining to a specific category, and/or the like).
222 222 252 254 222 222 222 222 222 222 222 222 The artifact prediction enginecan be configured to identify applicable authorization schemas (e.g., available material benefits, eligible asset amortization conditions, and/or the like) for a candidate development service specified in the user request. For example, the artifact prediction enginecan retrieve a set of digital artifacts (e.g., project development reports, experimental results, fiscal expense documentation, and/or the like) associated with the candidate development service from the digital artifact databaseand a set of available authorization schemas (e.g., accessible benefits and/or asset amortization at time of evaluation request) from the authorization schema database. Accordingly, the artifact prediction enginecan assess the set of digital artifacts using the evaluation criteria of each available authorization schema to identify a subset of applicable authorization schemas (e.g., possible qualification for select benefits and/or asset amortization) for the candidate development service. For example, the artifact prediction enginecan evaluate compliance of concrete digital artifact attributes (e.g., recorded datetime, applied jurisdiction, fiscal expenses accrued, and/or the like) for predetermined logic-based rules (e.g., limited and/or quantitative eligibility conditions) of each authorization schema to determine an initial set of applicable authorization schemas for the candidate development service. In some implementations, the artifact prediction enginecan compare content similarities between the set of data artifacts and the set of available authorization schemas to identify the subset of applicable authorization schemas. For example, the artifact prediction enginecan generate (e.g., via a machine learning model) a content identifier (e.g., an embedding vector) for a data artifact of the candidate development service. Accordingly, the artifact prediction enginecan compare the content identifier of each data artifact to the reference identifier of each authorization schema. As an example, the artifact prediction enginecan input pairs of the content identifiers (e.g., of the set of data artifacts) and the reference identifiers (e.g., of the set of authorization schemas) into a machine learning model (e.g., a statistical inference algorithm, a generative machine learning model, a semantic natural language model, and/or the like) to determine an approximate content similarity score between the corresponding data artifacts and the evaluation criteria (e.g., or components thereof) of the authorization schemas. As a result, the artifact prediction enginecan identify a subset of applicable authorization schemas that correspond to a content similarity score exceeding a predefined threshold. In some implementations, the artifact prediction enginecan be configured to evaluate compliance of the digital artifacts to predetermined logic-based rules prior to evaluating content similarities between the digital artifacts and the authorization schemas.
222 222 222 222 222 222 222 222 In some implementations, the artifact prediction enginecan be configured to generate a mapping that correlates data artifacts (e.g., of a development service) that to the applicable authorization schemas. For example, the artifact prediction enginecan dynamically create a stored mapping (e.g., a reference table, a hash map, and/or the like) that links a select data artifact (e.g., of the candidate development service) to an authorization schema when a determined content similarity score (e.g., via comparison of the content and reference identifiers) exceeds the predetermined threshold. In another example, the artifact prediction enginecan create a stored mapping that links the select data artifact to an authorization schema when a predetermined logic-based evaluation rule for the schema is satisfied by content attributes of the data artifact. Accordingly, the artifact prediction enginecan identify, for each applicable authorization schema for the candidate development service, a subset of digital artifacts (e.g., of the development service) that shares high content similarities to the eligibility requirements of the authorization schema (e.g., and thus supports qualification of the candidate development service). In some implementations, the artifact prediction enginecan be configured to generate a granular mapping that further correlates individual attributes of each data artifact (e.g., select paragraphs of narrative text, fiscal documentation for an expense category, and/or the like) to specific evaluation criteria (e.g., a legal statute, an individual qualification condition, and/or the like) of the applicable authorization schema. In additional or alternative implementations, the artifact prediction enginecan generate a plurality of stored mappings that correlate different categories of data artifact attributes to component evaluation criteria of the applicable authorization schemas. As an illustrative example, the artifact prediction enginecan generate a first stored mapping that correlates attributes of a data artifact to evaluation criteria (e.g., of an authorization schema) relating to a technical qualification (e.g., eligible jurisdiction, datetime, and/or the like) of the candidate development service. In the above example, the artifact prediction enginecan also generate a second stored mapping that corelates attributes of the data artifact to evaluation criteria relating to itemized fiscal qualifications (e.g., individual tax credit claims) of the candidate development service.
222 222 222 222 222 222 222 256 222 The artifact prediction enginecan be configured to determine an approximate authorization status (e.g., a categorical likelihood and/or confidence score) for each identified applicable authorization schema for the candidate development service. For each applicable authorization schema, the artifact prediction enginecan identify data artifacts of the candidate development service that correlates to the evaluation criteria of the schema. For example, the artifact prediction enginecan use the stored mapping between data artifacts (e.g., or attributes thereof) and applicable authorization schemas (e.g., or component evaluation criteria thereof) to identify a subset of data artifacts (e.g., of the candidate development service) pertinent to the applicable authorization schema. Accordingly, the artifact prediction enginecan evaluate the attributes of the data artifacts to predict an authorization score, or status, that indicates relative compliance strength (e.g., a probability score, a categorical label, a confidence range, and/or the like) of the data artifacts for the evaluation criteria of the applicable authorization schema. For example, the artifact prediction enginecan input the data artifact attributes into a machine learning model (e.g., a semantic natural language model, a generative machine learning model) to determine a categorical label (e.g., “fully compliant,” “likely compliant,” “partially compliant,” “not compliant,” and/or the like) indicating relative compliance strength for the evaluation criteria. In some implementations, the artifact prediction enginecan use historical data artifacts (e.g., of tracked development actions for prior development services) and/or evaluation results (e.g., authorization/denial of claimed benefit and/or asset amortization) to further qualify the compliance assessment of data artifact attributes of the candidate development service. For example, the artifact prediction enginecan access (e.g., from the historical evaluation database), a set of historical artifact attributes corresponding to previous data artifacts used to support an authorization request for a prior development service. Accordingly, the artifact prediction enginecan input attributes of the data artifact set (e.g., of the candidate development service) and attributes of historical data artifact set into the machine learning model to determine the categorical label and/or compliance score.
200 200 200 200 The systemcan be configured to generate, and display, a custom visual interface that presents the identified set of applicable authorization schemas (e.g., possible claims to fiscal benefits and/or asset amortization) for the candidate development service and indicates the corresponding data artifacts (e.g., or component attributes thereof) that support eligibility of the development service for each authorization schema. For example, the systemcan generate a custom dashboard view that provides a structured visual representation (e.g., an enumerated list, a tabular widget, and/or the like) of the identified applicable authorization schemas. Additionally, the systemcan configure the structured visual representation to display alongside each applicable authorization schema a subset of data artifacts that support qualification of the candidate development service for that schema. The systemcan also configure the visual representation to further display the approximated authorization status (e.g., categorization label, likelihood probability, and/or the like) for each schema, indicating estimated compliance and/or qualification strength of the candidate development service for the authorization schema based on attributes of the supporting subset of data artifacts.
200 200 In some implementations, the systemcan configure the structured visual representation to include a comparative diagram (e.g., a column-wise table, a tree network, and/or the like) that visually maps and/or connects attributes of data artifacts (e.g., of the candidate development service) to correlated (e.g., high content similarities) evaluation criteria for each applicable authorization schema (e.g., or component criteria thereof). In other implementations, the systemcan further configure the structured visual representation to include a secondary comparative diagram that visually maps and/or connects attributes of historical data artifacts (e.g., of previously evaluated development services) to correlated evaluation criteria of the authorization schemas. Likewise, the structured visual representation can also be configured to identify, and map, content similarities between attributes of the data artifacts (e.g., of the candidate development service) and the historical data artifacts (e.g., of the previous development services).
200 200 256 200 In some implementations, the systemcan configure the custom visual interface to include a notification component that alerts users (e.g., authorized reviewers, licensed auditors, and/or the like) to portions of the evaluation results that may require further attention and/or review. The system can identify and highlight evaluation criteria for which the system has not identified any relevant digital artifacts (e.g., no project technology identified, no purpose related to the enterprise's commercial objectives). For example, the systemcan review historical usage records (e.g., from the historical evaluation database) for the presented digital artifacts (e.g., for supporting qualification of the candidate development service) to determine a subset of overlapping digital artifacts and/or overlapping attributes of overlapping digital artifacts that were previously used to qualify other development services. Accordingly, the systemcan display a visual indicator (e.g., an icon) within vicinity of each overlapping digital artifact displayed.
223 200 223 223 200 The logic evaluation enginecan be configured to generate human-readable narratives that summarize logical processes performed by the systemto obtain intermediate results during evaluation of the candidate development service. For example, the logic evaluation enginecan actively monitor (e.g., in real-time) intermediary logic operations executed during operation of a machine learning model to identify applicable authorization schemas and/or approximate authorization statuses for the candidate development service. The intermediary logic operations can define an intermediary input data (e.g., data retrieved from prior logic operation), an intermediary model (e.g., a component layer of the complete machine learning model), and an intermediary output data (e.g., data generated via inputting the intermediary input data into the intermediate model layer). The logic evaluation enginecan further cause a generative machine learning model (e.g., a large language model, a semantic natural language algorithm, and/or the like) to create a human-readable narrative (e.g., a text-based paragraph) that explains how the incremental logic of the intermediary logic operations results in the intermediary results (e.g., identified authorization schemas, approximated authorization statuses, and/or the like). Accordingly, the systemcan further configure the custom visual interface to display the generated human-readable narrative to provide additional context information on the resulting evaluation results for the candidate development service.
223 200 200 200 In some implementations, the logic evaluation enginecan be configured to iteratively cause the generative machine learning model to create the human-readable narrative explaining the intermediary logic operations (e.g., performing a chain-of-thought evaluation via generative models). For example, the systemcan input the monitored intermediary logic operations alongside a first prompt (e.g., request for high-level abstract details) into the generative model to create a first response comprising a human-readable explanation of the incremental logic for the intermediary logic conditions. Further, the systemcan input the monitored intermediary logic operations alongside a second prompt (e.g., request for low-level specific details) and the first response into the generative model to create a second response comprising a modified human-readable explanation of the incremental logic. The systemcan further configure the custom visual interface to display the intermediary results from the iterative generation of the human-readable narrative, including the intermediate prompts (e.g., the first and/or the second prompts) and/or responses (e.g., the first and/or the second responses).
200 224 224 In some implementations, the systemcan be configured to perform additional operations in response to user feedback via the custom visual interface. For example, an authorized user (e.g., an accountant or auditor) can submit a positive validation signal (e.g., via the custom interface) indicating vetted approval to submit an authorization request for the candidate development service under the identified applicable authorization schemas using the evaluated supporting digital artifacts. In response to the positive validation signal, the generative export enginecan automatically generate an export digital artifact (e.g., a fiscal expense reduction request) for each applicable authorization schema based on an artifact template (e.g., a tax credit request form) that comprises a required set of field queries (e.g., claims for entitled material benefit or amortization of assets/expenditures, claimed amount of material benefit or amortization of assets/expenditures, and/or the like). The generative export enginecan further cause a generative machine learning model (e.g., a large language model) to generate human-readable entries for each required field query (e.g., of the artifact template) based on information stored in the identified supporting digital artifacts for the applicable authorization schemas.
200 225 225 225 In other examples, the systemcan receive a negative validation signal (e.g., via the custom interface) indicating disapproval of using the displayed digital artifacts (e.g., or components thereof) to support qualification of the candidate development service for the identified applicable authorization schemas. In another example, the system can indicate missing or incomplete qualification documentation for use by the AI model and/or human reviewers. In response, the anomaly evaluation enginecan perform an automatic update to one or more machine learning models used during evaluation and identification of the supporting digital artifacts. For example, the anomaly evaluation enginecan generate a model prediction training sample comprising an input data based on attributes of the presented digital artifacts, the contextual historical digital artifacts, and/or the contextual information from the user feedback response, and an output label based on the predicted authorization status of applicable authorization schema using the presented digital artifacts. In some implementations, the anomaly detection engine can further augment the prediction training sample using a stored model prediction training sample set that corresponds to predicted authorization statuses of prior applicable authorization schemas. Accordingly, the anomaly evaluation enginecan use the combined prediction training samples to retrain the machine learning models used to generate intermediary results during evaluation of candidate development services.
200 200 200 200 200 200 200 In some implementations, the systemcan receive (e.g., via the custom visual interface) a user feedback response that can include contextual information for the negative validation signal, such as explanatory narratives (e.g., embedded commentary), visual indicators (e.g., manual identification of abnormal mappings between digital artifacts and authorization schemas), and/or the like. The systemcan use the additional context information provided via the user feedback response to modify and/or update one or more attributes of the digital contents for the candidate development service. For example, the systemcan cause a generative machine learning model (e.g., a semantic natural language processing model, a large language model, and/or the like) to generate an adjustment to contents and/or attributes of the supporting digital artifacts (e.g., an augmented text-based narrative and/or description) based on the user feedback information. Accordingly, the systemcan iteratively re-evaluate the authorization of the candidate development service (e.g., via one or more processes described herein) using the adjusted digital artifact attributes until a positive validation signal is received from the authorized user. In additional or alternative implementations, the systemcan be configured to display contents of the user feedback response to a second user interface corresponding to one or more participant users associated with the candidate development service. The systemcan also receive (e.g., via the second interface) a secondary user feedback response that can include manual adjustments to the supporting digital artifacts (e.g., modifications to text-based descriptions) submitted by a participant user. As a result, the systemcan incorporate both the manual adjustments (e.g., of the participant user) and generative adjustments to re-evaluate authorization of the candidate development service. Such adjustments can be delivered by any of multiple methods including text, icons, buttons, voice input, or predetermined templates.
3 FIG.A 2 FIG. 1 FIG. 3 FIG.A 300 200 300 300 100 300 300 300 310 320 330 is a block diagram illustrating an example configuration of an evaluation interfaceof the service authorization systemofin accordance with some implementations of the present technology. The evaluation interface(“interface”) is implemented using components of the example computer systemillustrated and described in more detail with reference to. Likewise, implementations of interfacecan include different and/or additional components or can be connected in different ways. Interfaceis a visual interface that allows users (e.g., an authorized user, a participant user, and/or the like) to interact with electronic devices using graphical elements (e.g., windows, icons, buttons, and/or the like) rather than text-based commands. As shown in, the interfaceincludes a request configuration component, an evaluation summary component, and an interactive review component.
200 300 300 310 312 312 312 3 FIG.A The systemcan be configured to transmit instructions (e.g., for the interface) to generate and/or display content via interactive graphical user interface (GUI) components. As shown in, the interfacecan be configured to display a request configuration componentthat comprises one or more user-adjustable evaluation modifiersenabling an authorized user (e.g., an accountant or auditor) to create customized service authorization evaluation requests. For example, the evaluation modifierscan include service type filtering options (e.g., assigned technical field, fiscal expenditure quantities, and/or the like) that enable the authorized user to request authorization evaluation for a subset of available candidate development services. In another example, the evaluation modifierscan include a service quantity modifier that enables the authorized user to specify the total number of services that are evaluated for authorization (e.g., an individual service, a group of selected services, all available services, and/or the like).
300 320 320 322 322 324 3 FIG.A In some implementations, the interfacecan be configured to display an evaluation summary componentthat comprises an aggregated view of authorization evaluation results for a candidate development service. For example, the evaluation summary componentcan display an enumerated listof identified applicable authorization schemas (e.g., possible authorization claims with supporting digital artifacts) for the candidate development service. As shown in, each applicable authorization schema within the enumerated listcan include a visible authorization statusthat indicates an approximate compliance strength of the supporting digital artifact attributes in qualifying the candidate development service. The system can compare and display the application of multiple approaches to documentation and qualification. The system can permit a user to select between such multiple approaches.
300 330 332 334 330 336 330 338 338 3 FIG.A In some implementations, the interfacecan be configured to display an interactive review componentthat comprises a detailed visualization of the correlative relationships (e.g., connective links representing relational mappings) between supporting digital artifact attributes(e.g., of the candidate development service) and evaluation criterionsof the select applicable authorization schema. As shown in, the interactive review componentcan include a secondary view of exemplary historical digital artifact attributesthat correspond to digital artifacts previously used to successfully (e.g., or unsuccessfully) authorize prior development services under the same applicable authorization schema. Additionally, the interactive review componentcan be configured to include a user feedback modulethat enables the authorized user to transmit a positive (e.g., or negative) validation signal indicating approval (e.g., or disapproval) of using the presented supporting digital artifacts to submit an authorization request for the candidate development service under the applicable authorization schema. The user feedback modulecan also include custom interface tools that allow authorized users to submit additional context information (e.g., commentary narratives, custom annotations, and/or the like) that further explains the transmitted positive (e.g., or negative) validation signal.
3 FIG.B 2 FIG. 340 200 200 200 340 is a block diagram illustrating an example visual representationproduced by the service authorization systemofin accordance with some implementations of the present technology. For example, the systemcan generate a digital visualization (e.g., a webpage, an application window, a report, and/or the like) that captures and/or summarizes detailed results of a service authorization evaluation request. In some implementations, the systemcan be configured to dynamically update, or re-generate, the visual representationin response to additional modifications to material and/or contextual contents of the service authorization evaluation (e.g., adjustment of digital artifact attributes, user commentary on evaluation logic, user edits (additions, deletions or revisions) to such evaluation, and/or the like).
3 FIG.B 3 FIG.B 340 342 344 346 348 200 340 340 340 200 340 200 340 As shown in, the visual representationcan include one or more criterion components, attribute summaries(e.g., narrative text descriptions, visual images, embedded audio files, and/or the like), references, and/or evaluation stimuli. In some implementations, the systemcan be configured to generate a portable export of the visual representation(e.g., a plaintext file, a PDF, and/or the like). For purposes of illustration, the visual representationofis depicted as a tabular data structure. However, an ordinary person skilled in the art will appreciate that the visual representationcan be customized in both format and medium to seamlessly integrate into existing digital interfaces (e.g., touchscreens, monitors, and/or the like), applications (e.g., website, desktop program, internal portals, and/or the like), and/or data processing pipelines. In one example, the systemcan configure the visual representationas a modifiable digital asset (e.g., a collaborative user interface) that is shared between two or more users. In another example, the systemcan export the visual representationas a formatted document (e.g., a PDF) as an output result of evaluating the user request for service authorization, as described herein.
340 340 340 342 1 342 2 342 3 340 342 342 2 342 3 342 340 342 3 FIG.B 3 FIG.B 3 FIG.B In some implementations, the visual representationcan be configured to present results of evaluating authorization of development services (e.g., with respect to authorization schemas) in a modular format. For example, the visual representationcan include discrete information modules associated with individual evaluation criterions (e.g., identifiable service features, observed challenges of development service, and/or the like) associated with an authorization schema (e.g., qualification criteria for financial benefits and/or asset amortization). As shown in, the visual representationincludes row-wise tabular cell groups corresponding to a criterion component A-, a criterion component B-, and a criterion component C-. In some implementations, the visual representationcan group a plurality of criterion componentsinto a single representation unit, such as the criterion component B-and component C-, as shown in. Althoughpresents three criterion components, an ordinary person skilled in the art will appreciate that the visual representationcan be extended to include an arbitrary number of components.
340 342 340 342 344 346 348 340 344 346 340 342 348 3 FIG.B 3 FIG.B In some implementations, the visual representationcan be configured to present results within an individual criterion componentvia a modular format. For example, the visual representationcan configure the individual criterion componentsto include discrete information modules corresponding to an attribute summary, one or more references, and/or a guiding evaluation stimulus. As shown in, the visual representationcan include column-wise tabular cell groups corresponding to attribute summariesand references. The visual representationofalso includes a header component within each criterion componentfor the evaluation stimulus.
344 342 200 344 200 344 In some implementations, an attribute summaryof a criterion componentcan include human-readable (e.g., alphanumeric text, visual images, and/or the like) and/or interactable elements (e.g., embedded audio recordings) based on attributes of digital artifacts (e.g., documentation files, graphic figures, video and/or audio recordings) indicating tracked development actions for a monitored development service. For example, the systemcan configure the attribute summaryto present one or more pertinent digital artifact attributes used to predict an authorization status of the development service for a specified evaluation criterion (e.g., satisfaction of required artifact attribute thresholds of an authorization schema). In some implementations, the systemcan be configured to use a machine learning model (e.g., a large language model, a natural language processing algorithm, and/or the like) to generate the content information (e.g., human-readable narratives) presented in the attribute summaries.
346 342 344 200 345 342 3 FIG.B In some implementations, the referencesof a criterion componentcan include navigational and/or executable components (e.g., redirect links, file/system shortcuts, embedded attachment executable, and/or the like) that are configured to retrieve, or redirect, the end user to a digital artifact associated with the development service and/or attribute summary. For example, the systemcan configure the referencesof a criterion componentto include a web-based hyperlink, an embedded text document, and/or an embedded graphical element, as shown in.
348 342 342 200 342 344 346 200 348 In some implementations, the evaluation stimulusof a criterion componentcan include human-readable elements that characterize an evaluation criterion of an authorization schema (e.g., criterion corresponding to component). For example, the systemcan configure the criterion componentto present a narrative prompt (e.g., an instructional text description, a clarification query, and/or the like) for the evaluation criterion that is answered via content presented in the attribute summaryand/or references. In some implementations, the systemcan be configured to use a machine learning model (e.g., a large language model, a natural language processing algorithm, and/or the like) to generate the narrative prompt for the evaluation stimulusbased on contents of the authorization schema.
4 FIG. 400 400 200 400 400 is a flow diagram that illustrates an example process(e.g., a computer-implemented method) for evaluating authorization of development services in accordance with some implementations of the disclosed technology. The processcan be performed by a system (e.g., service authorization system) configured to identify and report applicable authorization schemas for a development service and the corresponding digital artifacts to support the authorization. In one example, the system includes at least one hardware processor and at least one non-transitory memory storing instructions, which, when executed by the at least one hardware processor, cause the system to perform the process. In another example, the system includes a non-transitory, computer-readable storage medium comprising instructions recorded thereon, which, when executed by at least one data processor, cause the system to perform the process.
402 At block, the system can receive (e.g., via a user interface) a request to evaluate authorization of at least one development service that comprises a digital artifact set. For example, the system can receive a request to evaluate authorization of at least one development service such that each corresponding digital artifact in the digital artifact set comprises a content embedding for an artifact attribute set representing tracked development actions of the at least one development service. A content embedding for a digital artifact (e.g., a natural language document) can include a quantitative, and often standardized, representation (e.g., a normalized vector of numerical features in a high-dimensional vector space) of contents and/or attributes associated with the digital artifact (e.g., alphanumeric text, rasterized image data, recorded audio signals, and/or the like). The content embeddings of each digital artifact enable the system and/or components thereof (e.g., machine learning models, natural language processing algorithms, and/or the like) to numerically compare, and identify similar (or disparate) content attributes between multiple digital artifacts. As an illustrative example, the system can use a semantic encoder (e.g., a natural language model) to convert contents of a digital artifact (e.g., alphanumeric text strings) into a formatted numerical identifier (e.g., a vector array) comprising one or more feature attributes (e.g., numerical values of the vector array) corresponding to embedded attributes of the digital artifact. In some implementations, the system can be configured to generate the content embeddings for each digital artifact corresponding to the at least one development service. In other implementations, the system can generate and/or retrieve content embeddings for digital artifacts as a component function of a broader knowledge base and/or framework (e.g., a Retrieval-Augmented Generation framework for large language models).
404 At block, access an authorization schema set available for the at least one development service. For example, the system can access an authorization schema such that each authorization schema in the authorization schema set comprises a reference embedding (e.g., a numerical vector identifier) for required qualitative evaluation criterions to authorize the at least one development service. Each authorization schema in the authorization schema set for the at least one development service can comprise required artifact attribute thresholds representative of the evaluation criterions, such as regulatory policies (e.g., natural language rules), predetermined evaluation rulesets (e.g., conditional criterion mappings), narrative guidelines (e.g., alphanumeric descriptions), fiscal procedures, and/or a combination thereof. In some implementations, the system can generate the reference embedding for an authorization schema via converting contents of the qualitative evaluation criterions into one or more comparative numerical features. For example, the system can use a semantic encoder (e.g., a natural language model) to convert natural language rules and/or criterions of text-based regulatory policies into a set of formatted numerical identifiers (e.g., a vector matrix, or a set of vector arrays), such that each identifier comprises feature attributes (e.g., numerical values of the array) corresponding to individual policy rules. By converting the qualitative evaluation criterions into the numerical identifiers, the system can quantitatively compare (e.g., via cosine similarity, Euclidean distance, and/or the like) the content embedding of a digital artifact (e.g., representation of artifact contents) and the reference embedding of the evaluation criterion to determine an approximate content similarity score (e.g., a likelihood indicating satisfaction of the evaluation criterion by the digital artifact contents). In some implementations, the system can configure each required artifact attribute threshold of the authorization schema as a required content similarity threshold (e.g., a static, or dynamic, numerical value) between content embeddings of digital artifacts and reference embeddings of the evaluation criterions.
406 At block, the system can identify an applicable authorization schema subset from the authorization schema set via comparing the content embeddings of the digital artifacts and the reference embeddings of the authorization schemas. For example, the system can input into a first machine learning model, the content embeddings of the digital artifacts and the reference embeddings of the authorization schemas to output a content correlation score (e.g., a likelihood measure indicating digital artifact content satisfaction of evaluation criterion). In some implementations, the system can compare the content embeddings of the digital artifacts and the reference embeddings of the authorization schemas such that each applicable authorization schema in the subset is mapped to at least one digital artifact of the at least one development service and the reference embedding of the applicable authorization schema and the content embedding of the at least one digital artifact satisfies a similarity threshold.
408 At block, the system can retrieve one or more historical artifact attribute sets representing tracked development actions for prior development services authorized via the applicable authorization schema. For example, the system can, for each applicable authorization schema in the applicable authorization schema subset, retrieve from a remote database a historical artifact attribute set representing tracked development actions for prior development services authorized via the applicable authorization schema.
410 At block, the system can predict an authorization status for the at least one development service using the historical artifact attribute set and the artifact attribute set. For example, the system can input into a second machine learning model, the historical artifact attribute set, and the artifact attribute set to predict an authorization status for the at least one development service. In some implementations, the authorization status indicates whether the required artifact attribute thresholds of the applicable authorization schema are satisfied.
412 3 FIG.B At block, the system can configure for display a visual representation of at least one applicable authorization schema and the mapped at least one digital artifact of the at least one development service (for example, as illustrated in). For example, the system can configure for display (e.g., at a user interface) a visual representation of at least one applicable authorization schema and the mapped at least one digital artifact such that the authorization status of the at least one applicable authorization schema indicates satisfaction of the required artifact attribute thresholds. In some implementations, the system can configure the visual representation to display the artifact attribute set of the at least one digital artifact that represent the tracked development actions of the at least one development service. In other implementations, the system can configure the visual representation of the at least one applicable authorization schema to display a comparative diagram that maps a first mapping of content similarities between the historical artifact attribute set and the artifact attribute set and a second mapping of content differences between the historical attribute set and the artifact attribute set.
In some implementations, the system configures for display (e.g., at the user interface) a distinct visual marking over an artifact attribute subset that corresponds to artifact attributes satisfying the required artifact attribute thresholds of the applicable authorization schema. In other implementations, the authorization status further indicates whether the required artifact attribute thresholds of the applicable authorization schema are partially satisfied. Accordingly, the system can configure for display (e.g., at the user interface) a distinct visual marking over an artifact attribute subset that corresponds to artifact attributes partially satisfying the required artifact attribute thresholds of the applicable authorization schema.
In some implementations, the system causes the generative machine learning model to generate a human-readable recommendation for adjusting at least one artifact attribute from the displayed artifact attribute subset to satisfy the required artifact attribute thresholds of the applicable authorization schema. Accordingly, the system can configure for display (e.g., at the user interface) the generated human-readable recommendation.
In some implementations, the system obtains a first sequence of intermediary logic operations executed during operation of the first machine learning model and a second sequence of intermediary logic operations executed during operation of the second machine learning model. The system can further cause the generative machine learning model to generate, using the first and the second sequence of intermediary logic operations, a human-readable narrative explaining a logical sequence resulting in the authorization status of the displayed at least one applicable authorization schema. Accordingly, the system can configure for display (e.g., at the user interface) the generated human-readable narrative alongside the visual representation of the at least one applicable authorization schema and the mapped at least one digital artifact.
In some implementations, the system automatically generates an export digital artifact in response to receiving (e.g., via the user interface) a positive user indication for the at least one applicable authorization schema. For example, the system can generate an export digital artifact based on an artifact template (e.g., stored in a database or obtained from a third-party service) that comprises a required field query set. In additional or alternative implementations, the system can cause a generative machine learning model to generate human-readable narratives for each required field query using the artifact attribute set and the historical artifact attribute set associated for the at least one digital artifact mapped to the at least one applicable authorization schema. As an illustrative example, the system can generate, and populate, a digital form (e.g., a fillable PDF, a formatted text document, and/or the like) that indicates an eligibility claim to one or more material benefits (e.g., tax credits, fiscal rewards, and/or the like) for the at least development service (e.g., experimental research project). Further, the system can generate the digital form based on a pre-defined template (e.g., an unfilled PDF) obtained from a third-party service (e.g., a government institution, a private organization, an internal assessment group, and/or the like) associated with the provisioning of the one or more material benefits and/or the evaluation of the at least one development service for claim eligibility. The digital form can include separate input fields (e.g., fillable text boxes) that indicate required and/or applicable information for evaluating qualification of the development service (e.g., attributes associated with the authorization schema). Examples of applicable information can include qualifying fiscal expenses, such as third-party software contracts, costs of qualifying tools, software expenses, software services, and/or other related expenditures eligible for inclusion with respect to the authorization schema of the material benefit and/or asset amortization. Accordingly, the system can be configured to automatically populate each available input field of the digital form using content attributes of the retrieved digital artifacts and/or human-readable narratives generated via the generative model. In another example, the system can generate a digital form that include input fields that correspond to cumulative information (e.g., qualifying fiscal expenses for a selected group of development services, total contract wages for digital artifacts corresponding to a pre-determined category, and/or the like) across multiple digital artifacts (e.g., of an individual development service) and/or multiple development services.
In some implementations, the system automatically generates a model prediction training sample comprising an input data in response to receiving (e.g., via the user interface) a negative and/or positive user indication for the at least one applicable authorization schema. For example, the system can generate a model prediction training sample comprising an input data based on the historical artifact attribute set and the artifact attribute set of the at least one digital artifact and an output label based on the authorization status of the at least one applicable authorization schema. In additional or alternative implementations, the system can access (e.g., from the remote database) a stored model prediction training sample set such that each model prediction training sample corresponding to predicted authorization statuses of prior applicable authorization schemas. Accordingly, the system can use the stored model prediction training sample set and the generated model prediction training sample to retrain the first machine learning model, the second machine learning model, the generative machine learning model, and/or a combination thereof.
In some implementations, the user interface can be a first user interface, and the system can be configured to generate an adjusted artifact attribute set for the at least one digital artifact in response receiving (e.g., via the first user interface) a negative user indication for the at least one applicable authorization schema. For example, the system can receive a negative user indication that identifies a required artifact attribute threshold subset not satisfied by the artifact attribute set. The system can input into a third machine learning model the historical artifact attribute set, the artifact attribute set, and the required artifact attribute threshold subset to generate an adjusted artifact attribute set for the at least one digital artifact that satisfies the required artifact attribute threshold subset. Accordingly, the system can configure for display (e.g., at a second user interface) the identified required artifact attribute threshold subset and the adjusted artifact attribute set such that the second user interface corresponding to an authorized editor of the at least one digital artifact.
5 FIG. 500 226 500 502 226 520 522 500 illustrates an example environmentof a data management enginefor improving data quality of a dataset. Environmentincludes dataset, data management engine, compliance report, and modified dataset. Implementations of example environmentcan include different and/or additional components or can be connected in different ways.
502 226 502 502 226 226 506 508 510 512 514 516 518 2 5 FIGS.and The datasetcan include structured and/or unstructured data. Structured data refers to data organized in a predefined manner, such as databases or spreadsheets (e.g., in rows and columns, in a graph, and so forth), while unstructured data refers to data without a predefined data model, such as emails, multimedia files, and other free-form documents. For example, a company's customer database can include structured data, such as customer identifiers and transaction records, while unstructured data includes customer feedback emails. The data management engineingests the datasetand performs one or more validation checks on the dataset. The data management enginecan be cloud-based or stored on a local server. The validation actions performed by the data management engineincan be executed by data profiling engine, threshold modeling engine, anomaly detection engine, root cause evaluation engine, rule generation engine, remediation engine, and/or information extraction engine.
506 502 502 506 502 502 506 506 506 506 502 506 506 The data profiling enginecan identify dataset'sstructure, data types, and/or indicate one or more attributes/features of the dataset(e.g., typos, wrong format, out of range values). The data profiling enginecan, using the variables and observations within dataset, automatically identify attributes of the dataset, such as the number of records, field types (e.g., integers, floats, strings), variables, variable values, and/or frequency distributions. In some implementations, the data profiling enginedetermines the features of each variable (i.e., univariate). For numerical data, the data profiling enginecan calculate mean, median, standard deviation, interquartile range, and so forth. For categorical data, the data profiling enginecan calculate the number of categories, the number of observations in each category, and so forth. Using the identified features, the data profiling enginecan, in some implementations, identify one or more anomalies of the datasetin one or more variables. For example, the data profiling enginecan identify values beyond a certain SD from the mean. The thresholds used in determining anomalies can be configurable by a user (e.g., by defining the threshold SD, threshold variance, combination threshold that requires satisfying both the threshold SD and the threshold variance, etc.). For example, the data profiling enginecan detect that customer birth dates are missing in a certain percentage of records.
506 502 502 506 502 506 506 502 502 506 502 15 FIG. In some implementations, data profiling enginecan identify metadata within datasetassociated with data lineage and/or versioning to monitor transformations of data within dataset. Further methods of identifying attributes/features of the dataset are discussed with reference to. In some implementations, the data profiling enginecan generate an output file (e.g., text, image, audio, video, multi-modal) indicating the identified structure, data types, and/or one or more features of the dataset(e.g., on a graphical user interface). The data profiling enginecan be data agnostic, meaning that the data profiling enginedoes not use prior context or knowledge about the datasetto identify the dataset'sfeatures. In some alternative implementations, the data profiling enginecould be data discerning whereby the data profiling engine applies prior context or knowledge about the datasetto more rapidly identify the dataset's features.
508 506 508 508 502 508 The threshold modeling enginecan identify one or more anomalies by dynamically generating thresholds and/or setting static thresholds for particular data attributes (e.g., variable values, means, SD, interquartile range, and so forth) determined by the data profiling engine. For example, the threshold modeling enginecan identify anomalies in seasonal attributes based on historical data using univariate analysis by determining thresholds (e.g., ranges of variable values) during different times. The threshold modeling enginecan use historical data to establish baseline patterns (e.g., using an autoregressive integrated moving average (ARIMA) model) and continuously update the thresholds at various intervals, e.g., as new data (e.g., dataset) is ingested, at preset time intervals, or at preset data quantities. By using historical data, the threshold modeling enginecan account for expected variations and seasonal trends, reducing the likelihood of false positives.
510 502 510 510 510 502 510 510 15 FIG. 6 8 FIGS.- 15 19 FIGS.- The anomaly detection enginecan detect univariate and/or multivariate anomalies within dataset. The anomaly detection enginecan flag transactions that deviate significantly from established thresholds or exhibit unusual correlations (e.g., indicating potential errors) using methods discussed with reference to. The anomaly detection enginecan use one or more anomaly detection modeling techniques, such as clustering, regression analysis, anomaly score computation, and so forth, to identify outliers. The anomaly detection enginecan assign one or more anomaly scores for each data point in datasetand compare the score against the established thresholds to determine if an anomaly exists. In some implementations, the anomaly detection engineuses a majority vote between multiple models to assign the anomaly score. Methods of detecting anomalies within unstructured data are discussed in further reference to. Methods of using an out-of-distribution prediction engine within the anomaly detection enginethat trains a machine learning model to identify whether a data object is out-of-distribution is discussed in further reference to.
510 502 502 502 502 502 502 226 In some implementations, anomalies detected by the anomaly detection enginecan use predefined context or knowledge bases. The context or knowledge bases can be tailored to the specific use case or application of dataset, such as appending datasetto another dataset. A use case refers to a specific situation or scenario in which the datasetis applied to achieve a particular goal (e.g., resolving missing values) or solve a specific problem (e.g., whether two datasets belong to the same corpus). For instance, an anomaly in datasetcan be identified if the data of datasetexceeds a certain standard deviation threshold value from a reference dataset, indicating that the datasetpotentially fails to belong to the same corpus (e.g., group of artifacts, group of documents) as the reference dataset. In some implementations, the threshold value is configurable by a user of the data management engine. For example, the user can select how many degrees of standard deviation should be allowed when determining if an observed set of values belongs to the same corpus as another set of values. If the standard deviation of both the observed set of values and the other set of values exceeds the user-defined standard deviation threshold, the observed values can be raised as an anomaly.
512 502 512 502 512 The root cause evaluation enginecan identify one or more events associated with (e.g., causing, linked to, mapped to) the anomalies using correlations between or among values of different data variables in the datasetand identifying sequence patterns that precede anomalies. For instance, the root cause evaluation enginecan identify that a particular system error during data entry leads to inconsistencies in the dataset. The root cause evaluation enginecan use techniques such as causal inference, dependency analysis, and/or sequence mining to trace the anomaly back to its source. The source of an anomaly can be a specific variable or multiple variables within the dataset. For instance, an anomaly can be traced back to a single variable that is significantly higher or lower than the expected range. Alternatively, the source can include multiple variables that together form a pattern indicative of, for example, data entry errors, system errors, hardware malfunctions, and so forth.
514 502 12 FIG. 15 FIG. In some implementations, the rule generation enginecan automatically generate/formulate association rules based on historical data patterns and observations. The association rules define expected data behaviors and relationships of dataset. For example, an association rule can state that if a value of a variable exceeds a certain threshold, the value of a different variable is of a certain category. Further methods of determining root causes of detected anomalies are discussed with reference toand.
516 516 516 516 The remediation enginecan generate one or more actions (e.g., workflows, computer-executable tasks) to remediate anomalies. The actions can include data correction, alert generation, or perform one or more computer-executable tasks to rectify data inconsistencies. For instance, the remediation enginecan automatically correct data mismatches by referencing a master data source or filling in missing values of a dataset using predicted values. In conjunction or alternatively, the remediation enginecan use one or more predefined rules, machine learning models, and so forth to recommend and/or implement remediation actions upon user authorization. In some implementations, remediation enginecan integrate with external workflow management systems to automate remediation processes involving multiple tasks.
522 502 502 522 226 520 226 502 520 The modified datasetrefers to the datasetafter the remediation actions are performed on the dataset. In some implementations, modified datasetcan include enriched data, where missing values are imputed, or additional context is added based on reference data sources. The data management enginecan track changes to maintain a history of data modifications for audit purposes. The compliance reportcan be generated by the data management engineto document the compliance status of datasetwith specified data quality standards/guidelines/regulations. The compliance reportcan include identified anomalies, remediation actions, data quality metrics, version, and so forth.
518 502 502 518 518 15 FIG. In some implementations, the information extraction enginecan extract data from unstructured sources and use datasetto determine anomalies within the unstructured source and/or the dataset. For example, the information extraction enginecan use natural language processing (NLP) techniques and other methods discussed with reference toto parse text, recognize entities, and transform unstructured data into a structured format. In some implementations, information extraction enginecan ingest text, audio, images, videos, and so forth.
6 FIG. 2 FIG. 8 FIG. 600 226 600 602 604 606 608 610 612 614 616 618 620 622 624 626 628 630 632 634 600 600 600 226 illustrates an example environmentof the data management engineoffor remediating unstructured data. Environmentincludes input documents, a summarization enginethat uses an AI modelto output summaries, a categorization enginethat outputs categories, a duplicate detection enginethat outputs duplicates, a knowledge conflict check enginethat outputs knowledge conflicts, a linkage detection enginethat outputs linked documents, an organizational reference enginethat outputs organizational references, results, a user interface, and user feedback. Implementations of example environmentcan include different and/or additional components or can be connected in different ways. The different engines of environmentcan be performed in parallel by, for example, separate AI models (e.g., agentic models). Though example environmentdescribes remediating unstructured documents, the data management enginecan similarly remediate unstructured data of any sort, including, but not limited to, audio data, image data, video data, and so forth. Methods of remediating unstructured data of different modalities are discussed in further detail with reference to.
602 226 632 602 632 602 602 The input documentscan represent a collection of unstructured data (or a mix of structured and unstructured data) that the data management engineingests via, for example, a user interface (e.g., the user interface). In some implementations, the input documentsare received from a computer system separate from one associated with the user interface. The input documentscan include various types of data such as text files, emails, chat logs, images, voice recordings, and so forth. In some implementations, the input documentscan include multimedia files and other free-form documents that lack a predefined data model.
604 226 606 608 602 604 602 608 226 604 8 FIG. The summarization enginewithin the data management engineuses an AI model(e.g., a non-generative AI model, a generative AI model, a machine learning model, an LLM, and so forth) to generate summariesof the input documents. The summarization enginecan, for example, categorize the input documentsinto clusters based on vector comparisons of content within the documents. Methods of summarizing the documents are discussed in further detail with reference to. The summariesprovide a condensed version of the content to enable the data management engineto remediate or otherwise process large volumes of unstructured data. In some implementations, the summarization enginecan use different AI models for different types of documents (e.g., a text-based model for text documents, an image-based model for images, and so forth).
610 226 608 604 608 612 610 608 610 610 610 608 The categorization enginewithin the data management enginecan use the summariesgenerated by the summarization engineand categorize the summariesinto categories. The categorization enginecan group the summariesbased on their respective content. In some implementations, the categorization engineuses predefined categories. In other implementations, the categorization enginedynamically generates categories that reflect the themes present in the documents. The categorization engineuses, for example, one or more AI models (e.g., ML models) to identify patterns and similarities in the documents (e.g., by determining a distance between vector representations of the documents) and group related summariestogether.
610 602 226 The categorization enginecan use one or more generative AI models (e.g., large language models (LLMs)) and/or term frequency-inverse document frequency (TF-IDF) algorithms to categorize the documents. Generative AI models can identify, for the input documents, concepts, entities, and the relationships between them and thus suggest categories based on the context and content of the documents. The data management enginecan additionally or alternatively determine the term frequency (TF)—the number of times a term appears in a document, weighted against the inverse document frequency (IDF)—which measures how common or rare a term is across the entire dataset. By multiplying these two metrics, TF-IDF identifies terms that are particularly “important” within individual documents while diminishing the weight of common terms that appear frequently across multiple documents.
226 610 602 8 FIG. The data management enginecan further perform category reduction (e.g., using text similarity algorithms) to ensure that the categories are not overly granular. In some implementations, the categorization enginecan create hierarchical categories, where a particular document is assigned to multiple categories. This hierarchical categorization allows for more nuanced organization and retrieval of documents based on multiple facets of their content. Methods of categorizing input documentsare discussed in further detail with reference to.
614 226 616 608 614 614 226 614 8 FIG. The duplicate detection enginewithin the data management engineidentifies duplicateswithin the unstructured data by generating intermediate and overall similarity values for pairs of summariesand setting thresholds to detect duplicates. The duplicate detection enginecan compare the content of each summary to identify documents that contain similar or identical information. The duplicate detection enginecan, for example, determine vector similarities and apply predefined thresholds to determine if two summaries are duplicates. The data management enginecan embed document summaries into a vector database and use an embeddings model to detect similar documents by summary and full text through vector similarity search and text similarity algorithms. In some implementations, the results of the duplicate detection process can be classified into three actions: reject (true duplicate), accept (false duplicate), or review. In some implementations, the duplicate detection enginecan identify duplicates in different languages or a mix of languages by, for example, converting all documents to a common language. Methods of detecting duplicates are discussed in further detail with reference to.
618 226 620 608 608 618 620 618 620 618 618 620 618 602 226 602 4 FIG. The knowledge conflict check enginewithin the data management enginedetects knowledge conflictsbetween pairs of summaries. For example, the engine maps summariesto topics and information sets and compares vector representations of information sets that share a common topic to identify contradictions. For example, one document that suggests a different action than another creates a knowledge conflict. The knowledge conflict check enginecan use one or more AI models to extract the semantic context to detect inconsistencies. Additionally, once knowledge conflictsare identified, the knowledge conflict check enginecan resolve it by flagging the knowledge conflictsfor human review (human in the loop), suggesting potential resolutions based on predefined rules, or automatically resolving the conflict if the predefined rules or AI confidence thresholds are met. The knowledge conflict check enginecan, for example, update the conflicting summaries, reorganize the affected document categories, and/or provide additional context to resolve the contradiction. The knowledge conflict check enginecan identify and resolve the knowledge conflictsusing methods discussed with further reference to. In some implementations, the knowledge conflict check enginecan automatically resolve conflicts by automatically executing one or more computer-executable instructions on one or more applications associated with the input documentsbased on subsequently received user input (e.g., clicking an “approve” button, turning on a setting to enable the data management engineto automatically correct the input documents, and so forth).
600 Thresholds used in the environmentcan be dynamically determined using, for example, a separate AI model that identifies a degree of satisfaction of the threshold against a set of criteria or performance metrics. For example, if the threshold uses computing resources above a certain threshold, the model can automatically increase/decrease the threshold to reduce the amount of computing resources used. In some implementations, the thresholds are determined using a panel of AI models (e.g., LLMs) by, for example, taking a majority vote of the models.
622 226 624 622 622 622 622 The linkage detection enginewithin the data management engineindicates the evolution and/or lineage of documents defined by the linked documents. The linkage detection enginecan use document parser libraries to search for embedded links within the documents. The linkage detection enginecan track changes and updates across different versions of documents to provide a history of modifications, parent-child relationships, and so forth. In some implementations, the engine can generate visual representations of document linkages in the form of, for example, a knowledge graph, a tree structure, a table, or another data structure. In some implementations, the linkage detection enginecan dynamically update the visual representations as new versions of documents are created or existing documents are modified. Thus, users are enabled access to the most current depiction of document relationships. The linkage detection enginecan associate and display metadata for each linkage, including timestamps of changes, the author of modifications, the nature of the changes made, and so forth.
626 226 628 626 626 628 626 The organizational reference enginewithin the data management engineoutputs organizational referencesby mapping documents back into an organizational system of an organization. The organizational reference enginecan parse data within the organizational system to identify corresponding reference numbers or other metadata associated with the documents. In some implementations, the organizational reference engineuses a customized small language model (SML) to use domain-specific data (e.g., organizational-specific data) to search for the organizational references. In some implementations, the organizational reference enginecan integrate with external systems to fetch additional metadata (e.g., references from regulatory authorities).
630 226 226 602 608 612 616 620 624 628 The resultscan indicate the outputs of the data management enginein the form of a report, a graphical representation, an image, a video, an audio file, and so forth. The data management enginecan compile the input documents, the summaries, the categories, the duplicates, the knowledge conflicts, the linked documents, and/or the organizational referencesinto a dataset that can be, in some implementations, exported to downstream systems through application programming interfaces.
632 630 226 632 632 632 634 226 634 634 226 600 634 512 8 FIG. 5 FIG. The user interfacecan display or otherwise indicate the resultsand enables users to interact with the data management engine. The user interfacecan provide different views and filters to aid users in navigating the data. Users can view summaries, categories, duplicates, knowledge conflicts, document linkages, and/or organizational references through the user interface. In some implementations, the user interfacecan enable the input of the user feedback. The data management enginecan use the user feedbackto improve the accuracy and relevance of the summaries, categories, duplicates, knowledge conflicts, document linkages, and/or organizational references using methods discussed with further reference to. In some implementations, the user feedbackcan be used to train the AI models within the platform. For example, if the user continuously re-uploads the same unstructured data, the data management enginecan modify one or more elements of environmentbased on evaluating the user feedbackusing, for example, the root cause evaluation enginein.
7 FIG.A 7 FIG.A 700 226 704 700 702 602 700 226 226 700 702 illustrates a snapshot of a user interfaceof the data management enginefor displaying detected duplicatesof the unstructured data. The user interfaceincludes data(e.g., input documents). The user interfaceof the data management enginecan include a navigation menu providing options, such as “About,” “Tools,” and “Workspace,” to enable users to access various functionalities and additional information of the data management engine. The user interfacecan include one or more indicators of an originating location of the data(e.g., a file path input field that displays the path to the current dataset illustrated as “//windowshare/data/duplicates_found.json” in).
702 602 700 700 704 700 226 614 700 704 700 6 FIG. 7 FIG.A 6 FIG. 7 FIG.A The datacan include a visual representation of the input documentsin. The user interfacecan present the data as, for example, a table organized into columns for different variables (e.g., “Friendly_id,” “Legacy_friend_id,” “Title” in). The user interfacecan include listed entries (e.g., within the table) that indicate the values of the variables. Duplicatesin the user interfaceof the data management enginerefer to a section that identifies duplicate entries (e.g., detected by the duplicate detection enginein) within the dataset. The user interfacecan indicate a similarity score of the duplicates, which can quantitatively express the degree of similarity to identify procedural overlaps or redundancies. For example,illustrates a similarity score of 79.6628749815041 between two procedures both associated with balance transfer checks. In some implementations, the user interfaceindicates the degree of similarity using a binary indicator, a categorical indicator, multiple indicators, a hierarchical indicator, and so forth.
7 FIG.B 6 FIG. 7 FIG.B 700 226 706 706 700 226 610 706 illustrates a snapshot of the user interfaceof the data management enginefor displaying identified categoriesof the unstructured data. Categoriesin the user interfaceof the data management enginerepresent the classified groups that organize and manage the unstructured data (e.g., generated by the categorization engineof). For example, categoriesininclude different operational areas such as Fraud Management, Account Management, First Track Operations, Customer Service, Account Updates, and so forth. Each category can be mapped to procedures and documents sharing common activities.
8 FIG. 2 FIG. 23 FIG. 21 FIG. 800 226 800 2300 2100 800 is a flow diagram illustrating an example processof remediating anomalies using the data management engineof. In some implementations, the processis performed by components of example devicesillustrated and described in more detail with reference to. Particular entities, for example, the AI model(s), are illustrated and described in more detail with reference to AI systemin. Implementations of processcan include different and/or additional operations or can perform the operations in different orders.
802 226 In operation, the data management enginecan obtain (e.g., receive via a user input to a user interface) a plurality of unstructured data (e.g., documents, images, video, audio, emails, chat logs, and so forth). For example, a user interface can enable users to upload various types of unstructured data. The unstructured data can be obtained using features such as drag-and-drop interfaces, file selection dialogs, and/or direct integrations with cloud storage services. One or more unstructured data of the plurality of unstructured data can include a content set. Each uploaded file can be parsed to extract the actual (e.g., existing) content within the document, which can be in various formats.
804 226 226 In operation, the data management enginecan generate, using a first AI model set, multiple summaries defining the plurality of unstructured data. For example, the data management enginecan categorize each unstructured document of the plurality of unstructured data into one or more clusters by comparing respective vector representations of content sets of pairs of unstructured data within the plurality of unstructured data. Each unstructured document can be converted into a vector representation that captures its semantic content. Techniques like Word2Vec, GloVe, or transformer-based models such as BERT can be used to produce these vector representations. A first distance set between vector representations corresponding to pairs of unstructured data categorized into a common cluster can be less than a second distance between vector representations corresponding to pairs of unstructured data categorized into different clusters (i.e., similar documents are grouped together).
226 226 226 For each particular cluster, the data management enginecan summarize the content set corresponding to respective unstructured data of the particular cluster. For example, within each cluster, the data management enginegenerates summaries by identifying and selecting the most representative sentences or documents that capture the main ideas of that cluster's content. This selection is made by ranking sentences, portions of sentences, or documents based on their importance, such as the frequency of particular terms, the presence of phrases where the vector representation of the phrases is the closest in distance to the vector representation of the topic, and so forth. For example, the data management enginecan select and extract sentences directly from the original data or generate new sentences that encapsulate the highest-ranked terms of the original data.
226 226 226 226 226 In some implementations, the data management enginedetects knowledge conflicts. For example, the second AI model set can identify at least one content conflict between the one or more pairs of summaries within the multiple summaries by mapping a first summary of the multiple summaries to (1) a topic and (2) a first information set and mapping a second summary of the multiple summaries to (1) the topic and (2) a second information set. The data management enginecan determine an associated topic by using one or more NLP techniques to identify themes within the summary. The data management enginecan use, for example, the frequency and distribution of terms to detect particular keywords and phrases to highlight terms that characterize the document's content. The identified keywords and phrases are mapped to predefined or dynamically generated topics. The data management enginecan, for example, compare the terms in the summary to a database of topic models (which can be curated using domain-specific data). The data management enginecan assign the summary to the most relevant topic based on the highest similarity scores with these models.
226 226 226 226 226 The data management enginecan extract the information set by defining the entities and their relationships within the summary. The data management enginecan identify and categorize entities such as names, dates, organizations, and other elements within the summary. The extracted information sets and topics can be transformed into vector representations to numerically encode the semantic content of the summary. If the first and second information sets are different (e.g., by comparing vector representations corresponding to the first and second summaries and determining that a degree of similarity between the vector representations fails a predefined threshold), the data management enginecan identify the content as a knowledge conflict. For example, the data management enginecompares these vector representations of summaries that share the same topic. The cosine similarity measure can be used to quantify how similar or different these vectors are. If the similarity between the vectors falls below a predefined threshold (i.e., the information sets are determined to be dissimilar despite sharing a similar topic), the data management engineflags the information sets as a knowledge conflict.
226 226 226 226 In some implementations, the data management enginecategorizes the unstructured data by generating an intermediate category set. For example, the data management enginecan categorize each summary in the set of summaries into one or more categories using a respective content set of the summary. To refine the intermediate categories further, the data management enginecan generate vector representations of each category. The data management enginecan calculate a degree of similarity between the vector representations of different categories and thus generate an overall category set by combining one or more categories in the intermediate category set using the degree of similarity between vector representations of the one or more categories. Categories with high similarity scores can be grouped together since this indicates a semantic similarity between the categories.
806 226 226 226 226 226 226 In operation, the data management enginecan identify, using a second AI model set (same as or different from the first set of AI models), at least one duplicate content set between one or more pairs of summaries within the multiple summaries. For example, the data management enginecan detect similar documents (or other modalities of data) using the summaries (e.g., vector similarity search). The data management enginecan perform a vector similarity search to detect similar documents by comparing vector representations of the one or more pairs of summaries by measuring the distance between their corresponding vector representations. Distance metrics such as cosine similarity, Euclidean distance, and the like can be used to quantify these distances. If the intermediate similarity value for a pair of summaries satisfies (e.g., meets or exceeds) a first predefined threshold, the data management enginecan detect similar documents using the full text (e.g., text similarity algorithms). For example, the data management enginegenerates an overall similarity value by comparing the content sets corresponding to the pairs of summaries. The data management enginecan compare the full text of the documents corresponding to the pairs of summaries. Techniques such as TF-IDF or other NLP methods can be used to compare the content sets of the documents and generate the overall similarity value based on this detailed full-text comparison. Duplicative content can be identified by determining that the overall similarity value of the one or more pairs of summaries satisfies a second predefined threshold. If the overall similarity value meets or exceeds this threshold, the documents are flagged as duplicates.
226 226 226 In some implementations, the data management enginegenerates, using a third AI model set (same as or different from the first and second sets of AI models), a reconfiguration command set configured to remove the at least one duplicate content set and/or content sets associated with knowledge conflicts from the content sets within the plurality of unstructured documents. For example, the data management enginecan identify one or more unstructured documents within the unstructured document set that corresponds to the at least one duplicate content and/or knowledge conflict and select a portion of the one or more unstructured documents by mapping the one or more unstructured documents to a predefined ranked rule set. The predefined ranked rule set can rank the one or more unstructured documents using a timestamp of a corresponding document, an author of a corresponding document, a version of a corresponding document, a status of a corresponding document, and so forth. For example, the data management enginecan select the most recent version of a document (determined by the timestamp), select documents authored by recognized experts, or select documents marked as the latest version.
226 226 In some implementations, the AI model assigns a priority score to each document. Subsequently, the data management enginecan generate a command set that defines the actions used to remove the duplicate content. The actions can include deleting the duplicate sections, merging information from multiple versions, or consolidating data into a single authoritative document. The data management enginecan, in some implementations, automatically execute the reconfiguration command set on the plurality of unstructured documents to modify the portion of the one or more unstructured documents to remove the at least one duplicate content from the content sets within the plurality of unstructured documents.
226 226 226 The reconfiguration commands can include computer-executable instructions to perform an automatic execution of one or more workflows for a first type of duplicate content and/or trigger a notification of one or more alerts for a second type of duplicate content. For instance, the data management enginecan identify the specific sections of documents that contain duplicate content, and the reconfiguration commands can instruct the data management engineto either merge or delete these sections based on predefined rules. Additionally, the data management enginecan trigger alerts for duplicates that require user review. The modified data can be automatically displayed on a user interface.
226 In some implementations, the data management enginegenerates and displays, on the user interface, a compliance report indicating (i) the identified at least one duplicate content and (ii) the reconfiguration command set. The compliance report can be generated automatically after the execution of the reconfiguration commands and provides a summary of the actions taken. The report can include, for example, information such as the document IDs, the sections affected, the nature of the duplicate content, the specific modifications made, and so forth.
9 FIG. 2 FIG. 900 506 226 900 902 904 900 illustrates an example environmentof the data profiling engineof the data management engineoffor automatically detecting features of an ingested dataset. Environmentincludes variablesand observations. Implementations of example environmentcan include different and/or additional components or can be connected in different ways.
506 902 502 902 902 902 5 FIG. 9 FIG. The data profiling enginecan identify variables, which represent the different attributes or fields within the dataset (e.g., datasetin). For example, in, the attributes can include “Identifier,” “count,” and “length.” The variablescan include numerical data, categorical data, dates, and other types of data points that define the structure of the dataset. For example, in a customer database, variablescan include customer ID, name, birth date, transaction amount, and product category. In some implementations, variablescan include derived attributes, such as calculated fields or aggregated metrics.
506 904 902 904 904 904 904 Further, the data profiling enginecan identify observations, which refer to the individual records or entries within the dataset that contain values for each of the variables. Each observationrepresents a single instance of data, such as a row in a database table. Observationscan be a single value or multiple values. For example, in a sales dataset, an observationcan represent a single transaction, including details such as the transaction ID, date, customer ID, and amount. In some implementations, observationscan include time-series data, where each observation represents a data point in a sequence over time.
10 FIG. 2 FIG. 1000 508 226 1000 1002 1004 1000 illustrates an example chartof a threshold modeling engineof the data management engineoffor dynamically detecting univariate anomalies of the dataset. Chartincludes observationsand anomalies. Implementations of example chartcan include different and/or additional components or can be connected in different ways.
1002 904 1002 1004 1002 508 508 1000 1002 508 508 508 The observationscan be the same as or similar to observations. The observationscan refer to the individual data points or records within the dataset that are analyzed to detect anomalies. Each observation contains values for one or more variables, representing a single instance of data. The anomaliesare a subset of the observationsthat deviate significantly from the expected patterns or thresholds established by the threshold modeling engine. The deviations can indicate potential errors or other unusual activities. The threshold modeling enginecan operate within chartto dynamically detect univariate anomalies by identifying the distribution and variability of observations. The threshold modeling enginecan establish dynamic thresholds that adapt to changes in the data over time. For instance, the threshold modeling enginecan adjust the threshold for acceptable observation values based on historical data, accounting for seasonal variations. In some implementations, the threshold modeling enginecan use autoregressive integrated moving average (ARIMA) models to forecast future values and detect anomalies based on predicted trends.
11 FIG. 2 FIG. 1100 510 226 1100 1102 1104 1106 1108 1110 1100 illustrates an example environmentof an anomaly detection engineof the data management engineoffor dynamically detecting multivariate anomalies of the dataset. Environmentincludes anomaly detection model, binary tree, non-flagged observation, flagged observation, and anomaly. Implementations of example environmentcan include different and/or additional components or can be connected in different ways.
1102 1102 1102 1104 1102 1102 1104 The anomaly detection model, such as an isolation forest, can be used to identify anomalies within a dataset by isolating observations that deviate significantly from the norm. The anomaly detection modelcan, for example, construct multiple binary trees (isolation trees) to partition the data. Observations that require fewer splits to isolate can be considered anomalies. In some implementations, anomaly detection modelcan use other techniques such as clustering-based methods (e.g., DBSCAN), statistical methods (e.g., Z-score), or neural networks (e.g., autoencoders) to detect anomalies. The binary treewithin the anomaly detection modelis a data structure that can be used by the anomaly detection modelto recursively partition the dataset into smaller subsets. Each node in the binary tree can represent a decision based on a feature value, and the branches can represent the possible outcomes of the decision. The partitioning continues until each observation is isolated in a leaf node. In some implementations, binary treecan be replaced with other non-tree or tree-based structures, such as decision trees or random forests, which can also be used for anomaly detection by evaluating the depth of the nodes where observations are isolated.
1106 1102 1108 1102 1108 1108 1110 1108 1110 The non-flagged observationrefers to data points within the dataset that are not identified as anomalies by the anomaly detection model. These observations fall within the expected range of values and patterns established by the model. The flagged observationrefers to data points that are identified as potential anomalies by the anomaly detection model. The flagged observationexhibits unusual patterns or values that deviate from the norm and are flagged for further investigation. The degree of deviation can be customizable by a user. In some implementations, flagged observationcan be prioritized based on the severity of the deviation or other user-provided context (e.g., type of deviation, extent of deviation). The anomalycan refer to a specific type of flagged observationthat has been confirmed as an anomaly. Anomaliesrepresent significant deviations (e.g., above a certain threshold) from the expected patterns and can indicate errors or other unusual activities.
12 FIG. 2 FIG. 1200 512 226 1200 1202 1204 1200 illustrates an example environmentof a root cause evaluation engineof the data management engineoffor identifying root causes of the anomalies of the dataset. Environmentincludes antecedentand consequent. Implementations of example environmentcan include different and/or additional components or can be connected in different ways.
1202 1204 1202 1204 512 1200 1202 1204 512 10 FIG. The antecedentrefers to the condition or set of conditions that precede and potentially cause an observed anomaly in the dataset. For an association rule, the antecedent is the “if” portion of the rule, representing the combination of factors that lead to a particular outcome. On the other hand, the consequentis the outcome or result that follows from the antecedentin an association rule. The consequentrepresents the “then” portion of the rule, indicating the effect or anomaly that occurs when the antecedent conditions are met. The root cause evaluation engineoperates within environmentto identify the antecedentand consequentrelationships that explain the root causes of anomalies in the dataset. The root cause evaluation enginecan use association rule mining to discover patterns and correlations between different variables. Further methods of determining the root cause are discussed with reference to.
13 FIG. 2 FIG. 1300 516 226 1300 1302 1304 1306 1308 1300 illustrates an example environmentof a remediation engineof the data management engineoffor remediating the anomalies of the dataset. Environmentincludes association rule, observation, observed variable values, and recommended variable values. Implementations of example environmentcan include different and/or additional components or can be connected in different ways.
1302 1302 1304 904 1002 1306 1304 1302 1308 1304 516 1302 516 1302 1308 516 The association rulerefers to a rule derived from data mining techniques that identifies relationships between variables in the dataset. The association rulecan include an antecedent (if portion) and a consequent (then portion), indicating that when certain conditions are met, a specific outcome is likely to occur. The observationcan be the same as or similar to observationsand/or observation. The observed variable valuesrefer to the values of the variables in an observation. The values are used to evaluate the observation against the association rulesto determine if any anomalies are present (e.g., missing values). The recommended variable valuesare the suggested values for the variables in an observationthat would align the observation with the expected patterns or rules. The recommendations can be generated by the remediation enginebased on the association rulesand the identified anomalies. The remediation enginecompares the association rulesagainst expected association rules and generates recommended variable valuesto address identified anomalies. For example, if a particular association rule indicates a particular bias not within the operative boundaries of the dataset's use case (e.g., a social bias in a financial risk assessment use case), the remediation enginecan identify the particular association rule as an anomaly.
14 FIG. 2 FIG. 23 FIG. 21 FIG. 1400 226 1400 2300 2100 1400 is a flow diagram illustrating an example processof remediating anomalies using the data management engineof. In some implementations, the processis performed by components of example devicesillustrated and described in more detail with reference to. Particular entities, for example, the AI model(s), are illustrated and described in more detail with reference to AI systemin. Implementations of processcan include different and/or additional operations or can perform the operations in different orders.
1402 226 226 In operation, the data management enginecan receive a dataset (structured or non-structured) including an observed set of values for one or more variables in a set of variables. In some implementations, the data management enginecan receive the dataset through various data ingestion methods, such as integrating with one or more application programming interface(s) (API(s)). The dataset can be sourced from multiple origins, including databases, data lakes, cloud storage, or external APIs. For structured data, the dataset can be in formats such as CSV, JSON, XML, or SQL tables, while unstructured data can include text files, images, audio recordings, or video files.
1404 226 226 In operation, the data management enginecan identify, using a first set of AI models, a set of anomalies in the observed set of values of one or more variables in the structured dataset. The data management enginecan determine multiple reference patterns that correspond to an expected set of values for the set of variables and/or compare an observed set of patterns in the observed set of values against the multiple reference patterns.
1102 11 FIG. In some implementations, the models can include supervised learning models, such as decision trees, support vector machines, and neural networks, which are trained on labeled datasets to recognize normal and anomalous patterns. The models can additionally or alternatively include unsupervised learning models, such as clustering algorithms (e.g., K-means, DBSCAN) to group similar observations together based on their features and anomaly detection models (e.g., isolation forests, one-class SVMs), which do not require labeled data and can detect anomalies based on deviations from the learned patterns of the dataset. For example, an isolation forest model (e.g., anomaly detection modelin) can construct multiple binary trees to partition the data and isolate observations that deviate beyond a certain threshold from the norm. Observations that require fewer splits to isolate can be considered anomalies. In some implementations, the engine can use ensemble methods, combining the outputs of multiple models. For example, the engine can use a combination of statistical tests, machine learning models, and clustering algorithms and aggregate the results of the multiple models by using a weighted score or using a majority vote.
226 226 226 226 226 The data management enginecan determine multiple reference patterns that correspond to an expected set of values for the set of variables. The reference patterns can be derived from historical data, statistical analysis, and/or domain-specific knowledge. For example, the data management enginecan use time-series analysis to identify seasonal trends and patterns in the data, or the data management enginecan calculate expected ranges and distributions for the variables (e.g., using chi-square goodness of fit tests). In some implementations, the engine can use dynamic threshold models to adaptively set thresholds based on historical behavior. For example, the data management enginecan evaluate past data to determine the typical range of values for a variable during different times of the year and set dynamic thresholds that account for seasonal variations. Thus, the data management enginecan detect anomalies that deviate from the expected reference patterns while accounting for natural fluctuations in the data.
226 226 226 The data management enginecan compare the observed set of patterns in the observed set of values against the multiple reference patterns to identify anomalies. For example, the data management enginecan calculate the Z-score (e.g., how many standard deviations an element is from the mean of the dataset) for each observed value to determine whether the value significantly deviates from the mean. Values with a Z-score above a certain threshold (e.g., ±3) can be considered anomalies, indicating that they are rare and unusual compared to the rest of the data. In another example, the data management enginecan calculate the Mahalanobis distance, which measures the distance between a point and a distribution, to identify multivariate anomalies (i.e., data points that are beyond a certain threshold from the center of the distribution).
1406 226 226 In operation, using a second set of AI models, the data management enginecan evaluate the identified set of anomalies by dynamically generating an observed set of association rules configured to cause the second set of AI models to generate the observed set of values in the structured dataset and/or compare the observed set of association rules with an expected set of association rules to determine one or more observed association rules corresponding to the set of anomalies. The generated association rules can describe the relationships between different variables in the dataset. For example, an association rule can state that if variable A has a certain value, then variable B is likely to have a specific value. The data management enginecan identify frequent itemsets, which are combinations of variable values that occur frequently within the dataset, by counting the occurrences of different itemsets and determining which itemsets meet a predefined support threshold, indicating that they are frequent.
226 226 From these frequent itemsets, the data management enginecan generate association rules that describe the relationships between different variables. Each rule can have the form “If antecedent, then consequent,” where the antecedent and consequent are subsets of the itemset. The data management enginecan calculate metrics such as confidence, which measures the proportion of records containing the antecedent that also contain the consequent, and/or lift, which quantifies how much more likely the consequent (the outcome) is to occur when the antecedent (the condition) is present compared to when the antecedent is not present (i.e., the degree to which the occurrence of the antecedent increases the likelihood of the consequent occurring).
The expected set of association rules can be derived from historical data, domain knowledge, or predefined guidelines. By comparing the observed rules with the expected rules, the engine can identify which rules deviate from the norm and are associated with the anomalies. For example, if an observed association rule is not found in the expected association rules, the observed association rule can be flagged as a potential anomaly. Additionally, or alternatively, if an observed association rule shows a significantly higher lift value than a corresponding expected association rule, it may indicate a stronger-than-expected association between the variables, potentially signaling an anomaly. Conversely, if an observed association rule has a much lower support or confidence value than the expected association rule, it may indicate that the expected pattern is not occurring as frequently as anticipated, which could also be a sign of an anomaly.
1408 226 226 226 226 226 In operation, using a third set of AI models, the data management enginecan generate a set of reconfiguration commands to remove the identified set of anomalies. For example, the data management enginecan identify a portion of the observed sets of values corresponding to the one or more observed association rules corresponding to the set of anomalies and map the portion of the observed sets of values to one or more expected association rules configured to adjust the portion of the observed set of values to a corresponding expected set of values. For example, if an observed association rule indicates that a certain combination of variable values is anomalous, the data management enginecan refer to the corresponding expected association rule to identify the expected values. The data management enginecan generate reconfiguration commands that specify the adjustments to transform the observed values to the expected values. In some implementations, the data management enginecan select the third set of AI models from multiple AI models using a respective set of performance metric values (e.g., accuracy, precision, recall, F1 score, mean squared error, and so forth) of each of the multiple AI models.
1410 226 226 226 In operation, the data management enginecan automatically execute the set of reconfiguration commands on the structured dataset to modify the one or more observed association rules corresponding to the set of anomalies to align with the one or more expected association rules. The data management enginecan use SQL queries to select the observations specified in the reconfiguration commands. The data management enginecan update the values of specific variables, recalculate derived fields, adjust the relationships between variables, and so forth. For example, if an observed association rule indicates that a certain combination of variable values is anomalous, the engine updates the values of the affected variables to match the expected combination specified by the corresponding expected association rule.
226 226 226 226 In some implementations, the data management enginecan receive an unstructured dataset from one or more of text documents, emails, chat logs, images, or voice recordings. Using a fourth set of AI models, the data management enginecan evaluate the unstructured dataset against a set of predefined criteria. For example, the data management enginecan extract a set of information from the unstructured dataset, which can include a set of entities. The data management enginecan evaluate the set of information against a set of threshold values of the set of predefined criteria by measuring, for example, a degree of completeness of the extracted information, a degree of accuracy of the extracted information, a degree of satisfaction of the extracted information with specific formats of the set of predefined criteria, and so forth.
226 Using the evaluation, the fourth set of AI models can identify a portion of the extracted information failing to satisfy the set of threshold values. The data management enginecan generate a set of actions (e.g., reconfiguration commands) to increase the degree of satisfaction of the extracted information against a set of predefined criteria. The reconfiguration commands can include, for example, instructions to automatically execute a set of workflows for a first type of anomaly, triggering one or more alerts for a second type of anomaly, and so forth.
226 The data management enginecan display an artifact such as a compliance report indicating one or more of (i) the identified set of anomalies, (ii) the set of actions, (iii) a degree of satisfaction of the unstructured dataset with the set of predefined criteria, and so forth. The reports can be presented in various formats, such as dashboards, charts, tables, chatbots, and so forth.
15 FIG. 15 FIG. 1500 1540 226 200 1502 1500 1540 1500 1540 1506 1514 1510 1506 1508 1512 1514 1540 1508 1514 1514 1516 illustrates an aspect of an environmentfor an out-of-distribution prediction engineof the data management engine(e.g., within the service authorization system) in which an implementation may be practiced. In some implementations, usersof this environmentinclude but are not limited to client users of the out-of-distribution prediction engine. In at least one implementation, as illustrated in, the environmentincludes an out-of-distribution prediction engine, as described herein, that receives a training document of training datathat may be used to train a machine learning model. In at least one implementation, a feature extraction moduleidentifies and extracts relevant features of the training dataor input data, such as documents, to be further processed (e.g., encoding, embedding, and/or masking) by a pre-processing moduleand then provided to the machine learning model. In at least one implementation, the out-of-distribution prediction enginereceives documentsas input data to the machine learning modeland generates, as an output of the machine learning model, an out-of-distribution prediction. The terms “documents” and “document” may be used interchangeably in the present disclosure, where the scope of the implementation can include “one or more documents.”
1502 1500 1540 1502 1540 1502 1540 1540 1502 1502 1540 1502 1508 1540 1514 1516 1508 1502 1514 1516 In at least one implementation, the userof this environmentincludes but is not limited to client users of the out-of-distribution prediction engine. In at least one implementation, the usermay be an individual, a computing system, an executing software application, a computing service, a computing resource, or other entity capable of controlling input to and receiving output from the out-of-distribution prediction engine. The usermay have access to a set of user records and/or a profile with the out-of-distribution prediction engineand may have a set of credentials (e.g., username, password, etc.) registered with the out-of-distribution prediction engine. In at least one implementation, userpresents, or otherwise proves, the possession of security credentials, such as by inputting a password, access key, and/or digital signature, to gain access to out-of-distribution prediction. In at least one implementation, the usercreates, using a user device or other computing device, an account with the out-of-distribution prediction engine. In at least one implementation, useruploads documentsto the out-of-distribution prediction engine, causing the machine learning modelto generate a predictionof whether the documentsare in-distribution or out-of-distribution. For example, the machine learning model expects a specific type of data when it is being trained to perform operations. In at least one implementation, if a useruploads a document that is an “unexpected” document (e.g., a driver's license when the model is being trained to distinguish passports from national identity documents (IDs)), the machine learning modelmay generate an out-of-distribution predictionthat the unexpected document is an outlier or an unknown document to in-distribution documents.
1504 1518 1520 1504 1504 1518 1506 1520 1508 In at least one implementation, the document systemincludes a training data storeand document data store. In at least one implementation, the document systemis a repository providing non-transitory and persistent (non-volatile) storage for data objects. Examples of data stores include file systems, relational databases, non-relational databases, object-oriented databases, comma-delimited files, and other files. In some implementations, the document systemis a distributed data store. In at least one implementation, the training data storemay store training dataand information related to in-distribution data and out-of-distribution data. In at least one implementation, the document data storemay store documentsand information related to user documents (e.g., IDs, passports, or driver's licenses).
1506 1518 1540 1514 1506 1506 1506 In at least one implementation, training datamay be maintained in the training data storeand located, processed, and provided for use in processing by the out-of-distribution prediction enginefor training the machine learning model. For example, training datamay include, but is not limited to, document bundles, national identification, driver's license, or passports. In at least one implementation, each page of training datamay be independently processed separately from other pages. In at least one implementation, each page of training datamay be processed as a whole with all pages included.
1508 1520 1540 1514 1508 1508 1508 In at least one implementation, documentsmay be maintained in the document data storeand located, processed, and provided for use in processing by the out-of-distribution prediction engine, as input, to the machine learning modelto perform inferencing operations. For example, documentsmay include, but are not limited to, document bundles, national identification, driver's license, or passports. In at least one implementation, each page of a document, such as document, may be independently processed separately from other pages. In at least one implementation, each document, such as document, may be processed as a whole with all pages included.
1510 1514 1506 1508 1510 1506 1508 1506 1508 1510 1510 1514 1506 1508 1516 1540 1514 In at least one implementation, a feature extraction modulemay include an encoder that encodes input data to a machine learning model, such as training dataor documents, into one or more feature vectors. In at least one implementation, an encoder of the feature extraction moduleencodes training dataand/or documentinto a sentence embedding vector. In at least one implementation, a processor uses this sentencing embedding vector to perform a nearest neighbor search to generate one or more neighbors. In at least one implementation, one or more neighbors is a value corresponding to a key comprising training dataor documents. In at least one implementation, one or more neighbors comprise plaintext data. In at least one implementation, an encoder of the feature extraction moduleencodes one or more neighbors into a text embedding vector. In at least one implementation, an encoder of the feature extraction moduleencodes one or more neighbors into a sentence embedding vector. In at least one implementation, machine learning modeluses training dataand/or documentsto generate a prediction, such as out-of-distribution prediction. In at least one implementation, a processor of a client device interfaces with an application of the out-of-distribution prediction engineusing a machine learning (ML) model application programming interface(s) (API(s)). In at least one implementation, the processor accesses the machine learning modelusing the machine learning model application programming interface(s) (API(s)).
1512 1506 1508 1514 1514 1514 1508 1516 In at least one implementation, the pre-processing modulemay be a computing system, software, software program, hardware device, module, or component capable of performing the masking of training dataand/or input data, such as documents, to generate masked training data and/or masked input data, respectively. In at least one implementation, further in the implementation, the masked training data is provided to the machine learning modelto perform training operations of the machine learning model, and the masked input data is provided to the machine learning modelto perform inferencing operations associated with classifications and predictions of whether documentsare out-of-distribution prediction.
15 FIG. 15 23 FIGS.- In at least one implementation, parts, methods, and/or systems described in connection withare as further illustrated nonexclusively in any of.
16 FIG. 2 FIG. 16 FIG. 15 FIG. 1600 1640 1540 1608 1616 1514 1616 1608 illustrates an example of a classification system of an out-of-distribution prediction engine of the data management engine of, in accordance with an implementation. As illustrated in, the exampleincludes a classification system, such as the out-of-distribution prediction engine, that receives documents(including documents #1-4) and makes a prediction, such as an out-of-distribution prediction, with a machine learning model, such as machine learning modelin. In at least one implementation, if the out-of-distribution predictionof a document is an unexpected prediction, for example, document #4 of the documentsis unknown in the in-distribution documents, this document may be sent for manual review.
1640 1640 1608 1502 1640 1640 1506 15 FIG. 15 FIG. In at least one implementation, the classification systemgenerates a classification of a document. For example, the classification system may be used to distinguish between national identifications (IDs) and a passport. In at least one implementation, if the classification systemreceives documents, from a user of the system, such as userin, the classification systemmay classify the documents as a passport or an ID and obtain an associated confidence score with that decision. In at least one implementation, a processor of the classification systemperforms operations to compare the confidence score to a threshold value. In at least one implementation, the threshold value is determined by using training data, such as training datain.
1640 1608 1640 1622 1640 1608 1514 1640 1502 15 FIG. 15 FIG. In at least one implementation, the classification systemgenerates a prediction of the classification of the documents. In at least one implementation, the classification systemis an automated classification library that enables multi-class classification. In at least one implementation, the automated classification library is data agnostic. In at least one implementation, the classification systemclassifies documentsby simultaneously performing image patch and text token masking during the training of a machine learning model, such as machine learning modelin. In at least one implementation, as a result of simultaneous image patch and text token masking during training, the machine learning model may learn the majority of important features for each class. In at least one implementation, the prediction may be expected or unexpected. In at least one implementation, if the prediction is expected, the document is consistent with the in-domain data. In at least one implementation, if the prediction is unexpected, the document is consistent with the out-of-domain data and may be sent out for manual review. In at least one implementation, the classification systemmay cause a user of the system, such as userin, to perform a manual review of the unexpected document or outlier.
16 FIG. 15 23 FIGS.- In at least one implementation, parts, methods, and/or systems described in connection withare as further illustrated nonexclusively in any of.
17 FIG. 15 FIG. 1700 1760 1706 1706 1706 1706 1706 1706 1706 1706 1506 1508 illustrates an exampleof visual token mask masking, in accordance with an implementation. In at least one implementation, this visual token masking includes in-distribution class oneA, in-distribution class twoB, out-of-distribution documentC, and out of distribution documentD that are used to train a machine learning model to distinguish between an in-domain document and out-of-distribution document (or outlier document). Each of the in-distribution class oneA, the in-distribution class twoB, the out-of-distribution documentC, and the out-of-distribution documentD include various shapes (e.g., an oval, a square, and a triangle) that represent features (e.g., tokens) of documents, such as training dataand/or documentsin, that are to be translated into dense vector embeddings for training the machine learning model.
1706 1706 1706 1706 In at least one implementation, an out-of-distribution prediction engine may translate each of the features of the in-distribution class oneA and the features of the in-distribution class twoB into a dense vector that is used to train a machine learning model. In at least one implementation, in-distribution class oneA represents a document including features that correspond to a classification of a document that is in-domain or alternatively known as in-distribution. As an example, this classification may identify a document as a passport. In at least one implementation, in-distribution class twoB represents a document including features that correspond to a different classification of another document that is in-domain. In this example, this different classification may identify a document as a national identification.
1706 1706 1706 1706 1706 1706 In at least one implementation, the out-of-distribution prediction engine may translate each of the features of the out-of-distribution documentC and the features of the out-of-distribution documentD into a dense vector that is used to train a machine learning model. In at least one implementation, the out-of-distribution documentC represents a document including features that correspond to a document that is out-of-distribution. As an example, the out-of-distribution documentC may be used as input to a machine learning model that outputs a prediction that this out-of-distribution documentC is not in-domain. In at least one implementation, out-of-distribution documentD represents another document including a different set of features that correspond to a document that is out-of-distribution.
1706 1706 1706 1706 1706 1706 In at least one implementation, the in-distribution class oneA and the in-distribution class twoB represent documents of in-domain data. For example, in-domain data may be data that a machine learning model is being trained to classify (e.g., passports versus a national identity document). In at least one implementation, the out-of-distribution documentC and out-of-distribution documentD represent a “foreign” or unknown document relative to the in-domain documents that the machine learning model is being trained to classify. In at least one implementation, as a result of the masking, the machine learning model may be more robust at identifying in-domain documents (e.g., in-distribution class oneA and the in-distribution class twoB). For example, the machine learning model is able to classify documents as in-domain or in-distribution that have more similar features to the original in-distribution documents used to train the model than to the original out-of-distribution documents (used to train the model).
In at least one implementation, a processor of the out-of-distribution prediction engine masks image data during training to make the machine learning model more robust to a variety of features, such as described above. In at least one implementation, the processor masks image data of input data (e.g., a passport or national identity document) during inferencing.
17 FIG. 15 23 FIGS.- In at least one implementation, parts, methods, and/or systems described in connection withare as further illustrated nonexclusively in any of.
18 FIG. 1800 1860 1806 1806 1806 1806 1806 1806 illustrates an exampleof visual patch mask masking, in accordance with an implementation. In at least one implementation, this visual patch mask masking includes in-distribution class oneA, in-distribution class twoB, and out-of-distribution documentC that are used to train a machine learning model to distinguish between an in-domain document and out-of-distribution document or outlier document. Each of the in-distribution class oneA, the in-distribution class twoB, and the out-of-distribution documentC includes various shapes that represent features (e.g., tokens) of documents, and some of the shapes are overlaid with a “patch” to mask or omit the corresponding features from those features to be used for training the machine learning model. In at least one implementation, each feature map pixel may be a token. In at least one implementation, the patch that overlays one or more features of a training document or document to be classified is a computer-generated geometric shape. In at least one implementation, the computer-generated shape obfuscates one or more features of a training document or document to be classified by the machine learning model. In at least one implementation, the system translates the features into dense vector embeddings for training the machine learning model, the features lacking those that were omitted by using the patch mask masking.
1806 1806 1806 1806 1806 18 FIG. In at least one implementation, the out-of-distribution prediction engine may translate each of the features of the in-distribution class oneA and the features of the in-distribution class twoB into a dense vector that is used to train the machine learning model. In at least one implementation, the system used masking of features in training documents (and documents for inferencing, not shown in) to increase the distance between learned dense embeddings of out-of-distribution data from in-distribution data. In at least one implementation, masking the features that resemble an oval and an equilateral triangle in out-of-distribution documentC results in in-distribution classes and out-of-distribution documents that do not share any features in common. In at least one implementation, the system omits or masks features in documents for training machine learning models to create more robust trained machine learning models. In at least one implementation, in-distribution class oneA represents a document including features that correspond to a classification of a document that is in-domain. In at least one implementation, in-distribution class twoB represents a document including features that correspond to a different classification of another document that is in-domain.
17 18 FIGS.and Not shown inis token “text” masking. For example, the features (e.g., shapes) may represent tokens from a random sentence to be used in an array. In at least one implementation, token text matching may implement feature extraction and feature masking to train a machine learning model to distinguish in-domain documents from out-of-domain documents. In at least one implementation, the system performs image patch masking and text token matching simultaneously during training of the machine learning model. The simultaneous patch and text token masking allows for more separation in the extracted dense vectors between the in-domain and out-of-distribution data, as out-of-distribution data is dissimilar to the in-domain data and thus has less relevant features. In at least one implementation, token text masking comprises attention masking to inform the machine learning model which tokens are padding and which tokens are to be processed.
1540 15 FIG. 17 14 FIG.or 18 FIG. 15 23 FIGS.- In at least one implementation, a processor of a computer system of the out-of-distribution prediction engine, such as out-of-distribution prediction enginein, may perform masking of image data or text image (not shown in). In at least one implementation, parts, methods, and/or systems described in connection withare as further illustrated nonexclusively in any of.
19 FIG. 15 FIG. 1900 1540 1906 1908 1922 1924 1914 1916 illustrates an exampleof an out-of-distribution (and outlier) prediction system, in accordance with an implementation. In at least one implementation, this out-of-distribution prediction engine, which is similar to out-of-distribution prediction enginein, includes masked training dataand masked input datathat are translated into dense vector embeddings, such as dense vector training (data)and dense vector input (data), which are used to train a machine learning model. In at least one implementation, the machine learning model generates a predictionof whether a document or input data is an in-domain document and out-of-distribution document or outlier document.
1506 1906 15 FIG. In at least one implementation, the system performs masked feature learning to train a machine learning model to detect out-of-distribution documents or outlier documents. In at least one implementation, the system extracts a set of features from a training document, such as training datain, to generate the masked training data. As described above, the system may perform visual token masking, visual patch masking, and token text masking to perform contrastive learning techniques. For example, contrastive learning is a deep learning technique using contrasting data samples against each other to learn attributes that are common between data classifications and attributes that set apart a data classification from others (e.g., a representation of data with similar instances being close together in a distribution space and dissimilar instances are set far apart).
1906 1906 1906 In at least one implementation, as a result of performing feature masking, the system generates the masked training data. In at least one implementation, the masked training datamay include features from pixel image data, plaintext data, or layout data or a combination of either image, plaintext, or layout data. In at least one implementation, these features include a set of features that result from omitting some features from both in-distribution training documents and out-of-distribution documents. In at least one implementation, some features that are omitted from training material to generate the masked training datamay include features that are common to both in-distribution training documents and out-of-distribution documents. For example, if some of these features that are common, to both in-distribution and out-of-distribution documents, were left in the training material, it may serve little purpose in learning contrasting features of various classifications of training documents.
1906 1922 1914 1922 1922 1906 1922 1922 1914 In at least one implementation, the system translates the masked training datainto dense vector training datato train the machine learning model. In at least one implementation, the dense vector training datamay be an array of numbers with each element having a significant value. For example, in a random sentence, each word will have a significant value represented in a dense vector and may be used to learn other words in the sentence (“neighbors”). In at least one implementation, a training document (or input document) that may include plaintext data, image data, or layout data (or combination thereof) goes through an embedding layer and is converted into this dense vector training, alternatively known as a dense embedding vector. In at least one implementation, the masked training dataincludes features of a training document that are concatenated together to generate the dense vector training data. In at least one implementation, the dense (embedding) vector training datais encoded and processed in the machine learning model.
1922 1914 1914 1914 In at least one implementation, the dense vector training datamay be a training forward propagation used to train the machine learning model. In at least one implementation, the training forward propagation may include a storage of variables for input to the machine learning model. In at least one implementation, the training forward propagation may include output of the machine learning model.
1906 1508 1608 1914 1916 1908 1924 1914 1914 1916 1924 1922 15 FIG. 16 FIG. In at least one implementation, the system extracts a set of features from an input document to generate the masked training data. The input document is similar to documentsinand documentsin. In at least one implementation, the system receives the input document to be processed by the machine learning modelto generate the prediction. In at least one implementation, the system translates the masked input datainto dense vector input datato be used by the machine learning modelto generate an inference. Here, the machine learning modelgenerates a predictionof whether the input document is an in-distribution or out-of-distribution document. In at least one implementation, the dense vector input datais similar to the dense vector training data, described above.
1916 1914 1916 1508 1916 1914 1916 1924 1908 1914 1916 15 FIG. In at least one implementation, the predictionis an output of the machine learning model. In at least one implementation, the predictionmay be a classification of an input document, such as documentsin, that the machine learning model is trained to classify. In at least one implementation, the predictionmay be generated by the machine learning modelby using a threshold value on model confidence scores as a decision boundary to classify an unknown document into in-domain or out-of-distribution. The confidence scores may be generated during training of the machine learning model. In at least one implementation, the predictionmay generated by calculating a distance score according to a Mahalanobis distance method, such as by calculating the distance between an extracted dense vector, such as dense vector input dataof the document associated with the masked input dataand classification conditional Gaussian distributions learned by the machine learning modelduring training. In at least one implementation, the predictionis generated by using a combination of the threshold value of the confidence scores and the distance score.
19 FIG. 15 23 FIGS.- In at least one implementation, parts, methods, and/or systems described in connection withare as further illustrated nonexclusively in any of.
20 FIG. 23 FIG. 2000 2000 2300 2000 2000 is flowchart illustrating an example of an out-of-distribution prediction engine that trains a machine learning model to identify whether a data object is out-of-distribution, in accordance with an implementation. Some or all of the process(or any other processes described or variations and/or combinations of those processes) may be performed by one or more computer systems configured with executable instructions and/or other data and may be implemented as executable instructions executing collectively on one or more processors. The executable instructions and/or other data may be stored on a non-transitory, computer-readable storage medium (e.g., a computer program persistently stored on magnetic, optical, or flash media). For example, some or all of processmay be performed by any suitable system, such as the computing deviceof. The processincludes a series of operations wherein the system is performing processto extract features from a training document, select features to mask from the training document to create masked training data, and train a machine learning model using the masked training data to detect an out-of-distribution document.
2002 1510 15 FIG. In, in at least one implementation, one or more processors of the out-of-distribution prediction engine, or alternatively known as a computing system or system, extract features from a training document for training a machine learning model. In at least one implementation, the features are extracted from the training document using a feature extraction module such as the feature extraction modulein. In at least one implementation, the features may include plaintext, image, and/or layout data.
2004 In, in at least one implementation, one or more processors of the out-of-distribution prediction engine select a subset of features to omit from a training forward propagation. In at least one implementation, the one or more processors select the subset of features from the set of features extracted from the training document. In at least one implementation, the subset of features to omit or mask may be determined based on a pseudorandom process. In at least one implementation, a pseudorandom process to omit features may include masking plaintext data, input data, layout data, or a combination thereof in a stochastically distributed manner. In at least one implementation, the pseudorandom process to omit features includes pseudorandomly determining data in a training document to mask for training the machine learning model. In at least one implementation, the pseudorandom process to omit features includes pseudorandomly determining data in a document to mask that is to be classified during inferencing operations. In at least one implementation, the pseudorandom process to omit features includes pseudorandomly determining data to mask in training operations of the machine learning model and in inferencing operations of the machine learning model. In this disclosure, for example, the system masks different parts of a document in a statistically random manner so that masking performed over time results in predictions of documents with features that are expected for a given in-domain classification and remaining features are unknown, creating greater separation between in-domain and out-of-distribution data.
In some implementations, the pseudorandom process to omit features results in more robust predictions of in-domain documents by training the machine learning model within domain documents that have much more relevant features (for what the model is trained to predict) than out-of-distribution documents. In some implementations, the pseudorandom process to omit features includes pseudorandomly selecting features to mask that are common to in-domain and out-of-distribution documents. For example, to train a model to predict whether a document is a passport or a national identification (both in-domain classifications), the system may mask features of name and date of birth, which are features also found in a driver's license that, in this example, is out-of-distribution. This masking of common features would result in a greater separation between features remaining in “in-domain” documents and features in out-of-distribution documents that are irrelevant for passports or national identifications (e.g., a license #, a medical condition, or if the person is registered as an organ donor.)
In at least one implementation, the subset of features to omit may be determined based on selecting features of a training document or new document (e.g., input data) at a consistent (e.g., approximately the same) location in the documents. In at least one implementation, the subset of features to omit may be determined by using a percentage or number (e.g., a parameter) specified by a user, client device, computer system, hardware, or software application of the system.
2006 1508 15 FIG. In, in at least one implementation, one or more processors of the out-of-distribution prediction engine train the machine learning model to produce a trained machine learning model, by using another subset of the features, from the training document, in the training forward propagation. In at least one implementation, the other subset of the features is different from the subset of features that is omitted from the training forward propagation (e.g., the other subset of features is disjoint from the omitted subset of features). In some implementations, a subset of features is disjoint from another subset of features when neither of the subsets has any features in common. In some implementations, the subset of features is disjoint from another subset of features if there is no “intersection” or “overlap” between the two subsets of features. For example, a set of features {1, 3, 5, 7} is disjoint from another set of features {2, 4, 6, 8}, as none of the features or elements of the two sets of features are in common. In at least one implementation, a training forward propagation includes a process of passing (“propagating”) input data through a network (e.g., neural network) and generating an output (e.g., prediction). In at least one implementation, the trained machine learning model outputs information usable to classify documents, such as documentsin. In at least one implementation, the trained machine learning model outputs information usable to differentiate between an out-of-distribution document and an in-distribution document (alternatively known as an in-domain document). In at least one implementation, the system trains the machine learning model using a masked training document to produce a trained machine learning model.
2000 The dashed line indicates a separation in the processbetween training the machine learning model and using the machine learning model.
2008 In, in at least one implementation, one or more processors of the out-of-distribution prediction engine receive a document as input data to the machine learning model. In at least one implementation, a processor of the out-of-distribution prediction engine performs operations to mask at least a portion of the input data to produce a masked input data. In at least one implementation, the processor performs operations to provide the masked input data to the trained machine learning model as input.
2010 In, in at least one implementation, one or more processors of the out-of-distribution prediction engine perform operations to receive a classification of the document as an output of the machine learning model. In at least one implementation, the classification is generated by the system extracting a dense vector embedding of the document, comparing it to an in-domain dense vector embedding to obtain a confidence score and then comparing that confidence score to a threshold value of confidence scores, obtained during the training of the machine learning model.
2012 1502 15 FIG. In, in at least one implementation, one or more processors of the out-of-distribution prediction engine perform instructions to determine that the document is an out-of-distribution document. In at least one implementation, the processor may perform operations to cause the out-of-distribution document to be sent for manual review. In at least one implementation, the manual review may be performed by a user of the out-of-distribution prediction engine, such as userin, or by any entity designated as an in-domain data expert.
20 FIG. 15 23 FIGS.- 2002 14 In at least one implementation, parts, methods, and/or systems described in connection withare as further illustrated nonexclusively in any. Note that one or more of the operations performed in-may be performed in various orders and combinations, including in parallel.
Note that, in the context of describing disclosed implementations, unless otherwise specified, use of expressions regarding executable instructions (also referred to as code, applications, agents, etc.) performing operations that “instructions” do not ordinarily perform unaided (e.g., transmission of data, calculations, etc.) denotes that the instructions are being executed by a machine, thereby causing the machine to perform the specified operations.
21 FIG. 2 FIG. 2100 200 200 200 2100 illustrates a layered architecture of an AI systemthat can implement the ML models of the service authorization systemof, in accordance with some implementations of the present technology. Example ML models can include the models executed by the service authorization system, such as remediation models, anomaly detection models, and so forth. Accordingly, the AI models of the service authorization systemcan include one or more components of the AI system.
2100 2100 2100 2102 2104 2106 2108 2116 2104 2120 2122 2106 2126 2124 2128 2102 2108 As shown, the AI systemcan include a set of layers, which conceptually organize elements within an example network topology for the AI system's architecture to implement a particular AI model. Generally, an AI model is a computer-executable program implemented by the AI systemthat analyzes data to make predictions. Information can pass through each layer of the AI systemto generate outputs for the AI model. The layers can include a data layer, a structure layer, a model layer, and an application layer. The algorithmof the structure layerand the model structureand model parametersof the model layertogether form an example AI model. The optimizer, loss function engine, and regularization enginework to refine and optimize the AI model, and the data layerprovides resources and support for application of the AI model by the application layer.
2102 2100 2102 2110 2112 2110 2110 2110 2110 2110 1 23 FIGS.and The data layeracts as the foundation of the AI systemby preparing data for the AI model. As shown, the data layercan include two sub-layers: a hardware platformand one or more software libraries. The hardware platformcan be designed to perform operations for the AI model and include computing resources for storage, memory, logic and networking, such as the resources described in relation to. The hardware platformcan process amounts of data using one or more servers. The servers can perform backend operations such as matrix calculations, parallel calculations, machine learning (ML) training, and the like. Examples of servers used by the hardware platforminclude central processing units (CPUs) and graphics processing units (GPUs). CPUs are electronic circuitry designed to execute instructions for computer programs, such as arithmetic, logic, controlling, and input/output (I/O) operations, and can be implemented on integrated circuit (IC) microprocessors, such as application specific integrated circuits (ASIC). GPUs are electric circuits that were originally designed for graphics manipulation and output but may be used for AI applications due to their vast computing and memory resources. GPUs use a parallel structure that generally makes their processing more efficient than that of CPUs. In some instances, the hardware platformcan include computing resources (e.g., servers, memory, etc.) offered by a cloud services provider. The hardware platformcan also include computer memory for storing data about the AI model, application of the AI model, and training data for the AI model. The computer memory can be a form of random-access memory (RAM), such as dynamic RAM, static RAM, and non-volatile RAM.
2112 2110 2110 2112 2100 The software librariescan be thought of as suites of data and programming code, including executables, used to control the computing resources of the hardware platform. The programming code can include low-level primitives (e.g., fundamental language elements) that form the foundation of one or more low-level programming languages, such that servers of the hardware platformcan use the low-level primitives to carry out specific operations. The low-level programming languages do not require much, if any, abstraction from a computing resource's instruction set architecture, enabling them to run quickly with a small memory footprint. Examples of software librariesthat can be included in the AI systeminclude INTEL Math Kernel Library, NVIDIA cuDNN, EIGEN, and OpenBLAS.
2104 2114 2116 2114 2114 2114 2110 2114 2114 2114 2100 The structure layercan include an ML frameworkand an algorithm. The ML frameworkcan be thought of as an interface, library, or tool that enables users to build and deploy the AI model. The ML frameworkcan include an open-source library, an application programming interface (API), a gradient-boosting library, an ensemble method, and/or a deep learning toolkit that work with the layers of the AI system facilitate development of the AI model. For example, the ML frameworkcan distribute processes for application or training of the AI model across multiple resources in the hardware platform. The ML frameworkcan also include a set of pre-built components that have the functionality to implement and train the AI model and enable users to use pre-built functions and classes to construct and train the AI model. Thus, the ML frameworkcan be used to facilitate data engineering, development, hyperparameter tuning, testing, and training for the AI model. Examples of ML frameworksthat can be used in the AI systeminclude TENSORFLOW, PYTORCH, SCIKIT-LEARN, KERAS, LightGBM, RANDOM FOREST, and AMAZON WEB SERVICES.
2116 2116 2116 2110 2116 2116 2116 The algorithmcan be an organized set of computer-executable operations used to generate output data from a set of input data and can be described using pseudocode. The algorithmcan include complex code that enables the computing resources to learn from new input data and create new/modified outputs based on what was learned. In some implementations, the algorithmcan build the AI model through being trained while running computing resources of the hardware platform. This training enables the algorithmto make predictions or decisions without being explicitly programmed to do so. Once trained, the algorithmcan run at the computing resources as part of the AI model to make predictions or decisions, improve computing resource performance, or perform tasks. The algorithmcan be trained using supervised learning, unsupervised learning, semi-supervised learning, and/or reinforcement learning.
2116 502 200 2116 2114 2116 2116 2116 2116 2116 5 FIG. 5 FIG. 2 FIG. Using supervised learning, the algorithmcan be trained to learn patterns (e.g., map input data to output data) based on labeled training data. The training data may be labeled by an external user or operator. For instance, a user may collect a set of training data, such as by capturing data from sensors, images from a camera, outputs from a model, and the like. In an example implementation, the training data can include native-format data collected (e.g., in the form of datasetof) from various source computing systems described in relation to. Furthermore, training data can include pre-processed data generated by various engines of the service authorization systemdescribed in relation to. The user may label the training data based on one or more classes and trains the AI model by inputting the training data to the algorithm. The algorithm determines how to label the new data based on the labeled training data. The user can facilitate collection, labeling, and/or input via the ML framework. In some instances, the user may convert the training data to a set of feature vectors for input to the algorithm. Once trained, the user can test the algorithmon new data to determine if the algorithmis predicting accurate labels for the new data. For example, the user can use cross-validation methods to test the accuracy of the algorithmand retrain the algorithmon new training data if the results of the cross-validation are below an accuracy threshold.
2116 2116 2116 2116 Supervised learning can include classification and/or regression. Classification techniques involve teaching the algorithmto identify a category of new observations based on training data and are used when input data for the algorithmis discrete. Said differently, when learning through classification techniques, the algorithmreceives training data labeled with categories (e.g., classes) and determines how features observed in the training data (e.g., various claim elements, policy identifiers, tokens extracted from unstructured data) relate to the categories (e.g., risk propensity categories, claim leakage propensity categories, complaint propensity categories). Once trained, the algorithmcan categorize new data by analyzing the new data for features that map to the categories. Examples of classification techniques include boosting, decision tree learning, genetic programming, learning vector quantization, k-nearest neighbor (k-NN) algorithm, and statistical classification.
2116 2116 2116 2116 2116 2116 Regression techniques include estimating relationships between independent and dependent variables and are used when input data to the algorithmis continuous. Regression techniques can be used to train the algorithmto predict or forecast relationships between variables. To train the algorithmusing regression techniques, a user can select a regression method for estimating the parameters of the model. The user collects and labels training data that is input to the algorithmsuch that the algorithmis trained to understand the relationship between data features and the dependent variable(s). Once trained, the algorithmcan predict missing historic data or future outcomes based on input data. Examples of regression methods include linear regression, multiple linear regression, logistic regression, regression tree analysis, least squares method, and gradient descent. In an example implementation, regression techniques can be used, for example, to estimate and fill in missing data for machine learning-based pre-processing operations.
2116 2116 2116 2116 2116 200 226 226 502 Under unsupervised learning, the algorithmlearns patterns from unlabeled training data. In particular, the algorithmis trained to learn hidden patterns and insights of input data, which can be used for data exploration or for generating new data. Here, the algorithmdoes not have a predefined output, unlike the labels output when the algorithmis trained using supervised learning. Said another way, unsupervised learning is used to train the algorithmto find an underlying structure of a set of data, group the data according to similarities, and represent that set of data in a compressed format. The service authorization system(e.g., data management engine) can use unsupervised learning to identify patterns in claim history (e.g., to identify particular event sequences) and so forth. In some implementations, performance of the AI models of the data management enginethat can use unsupervised learning is improved because the incoming datasetis pre-processed and reduced, based on the relevant triggers, as described herein.
2116 2116 2116 A few techniques can be used in supervised learning: clustering, anomaly detection, and techniques for learning latent variable models. Clustering techniques include grouping data into different clusters that include similar data, such that other clusters contain dissimilar data. For example, during clustering, data with possible similarities remain in a group that has less or no similarities to another group. Examples of clustering techniques include density-based methods, hierarchical-based methods, partitioning methods, and grid-based methods. In one example, the algorithmmay be trained to be a k-means clustering algorithm, which partitions n observations in k clusters such that each observation belongs to the cluster with the nearest mean serving as a prototype of the cluster. Anomaly detection techniques are used to detect previously unseen rare objects or events represented in data without prior knowledge of these objects or events. Anomalies can include data that occur rarely in a set, a deviation from other observations, outliers that are inconsistent with the rest of the data, patterns that do not conform to well-defined normal behavior, and the like. When using anomaly detection techniques, the algorithmmay be trained to be an Isolation Forest, local outlier factor (LOF) algorithm, or k-nearest neighbor (k-NN) algorithm. Latent variable techniques include relating observable variables to a set of latent variables. These techniques assume that the observable variables are the result of an individual's position on the latent variables and that the observable variables have nothing in common after controlling for the latent variables. Examples of latent variable techniques that may be used by the algorithminclude factor analysis, item response theory, latent profile analysis, and latent class analysis.
2106 2116 2114 2104 2100 2106 2120 2122 2124 2126 2128 The model layerimplements the AI model using data from the data layer and the algorithmand ML frameworkfrom the structure layer, thus enabling decision-making capabilities of the AI system. The model layerincludes a model structure, model parameters, a loss function engine, an optimizer, and a regularization engine.
2120 2100 2120 2120 2120 2120 2120 The model structuredescribes the architecture of the AI model of the AI system. The model structuredefines the complexity of the pattern/relationship that the AI model expresses. Examples of structures that can be used as the model structureinclude decision trees, support vector machines, regression analyses, Bayesian networks, Gaussian processes, genetic algorithms, and artificial neural networks (or, simply, neural networks). The model structurecan include a number of structure layers, a number of nodes (or neurons) at each structure layer, and activation functions of each node. Each node's activation function defines how to node converts data received to data output. The structure layers may include an input layer of nodes that receive input data and an output layer of nodes that produce output data. The model structuremay include one or more hidden layers of nodes between the input and output layers. The model structurecan be an Artificial Neural Network (or, simply, neural network) that connects the nodes in the structured layers such that the nodes are interconnected. Examples of neural networks include Feedforward Neural Networks, convolutional neural networks (CNNs), Recurrent Neural Networks (RNNs), Autoencoder, and Generative Adversarial Networks (GANs).
2122 2122 2120 2120 2122 2122 2122 2116 The model parametersrepresent the relationships learned during training and can be used to make predictions and decisions based on input data. The model parameterscan weight and bias the nodes and connections of the model structure. For instance, when the model structureis a neural network, the model parameterscan weight and bias the nodes in each layer of the neural networks, such that the weights determine the strength of the nodes and the biases determine the thresholds for the activation functions of each node. The model parameters, in conjunction with the activation functions of the nodes, determine how input data is transformed into desired outputs. The model parameterscan be determined and/or altered during training of the algorithm.
2124 2124 2114 2116 2116 The loss function enginecan determine a loss function, which is a metric used to evaluate the AI model's performance during training. For instance, the loss function enginecan measure the difference between a predicted output of the AI model and the actual output of the AI model and is used to guide optimization of the AI model during training to minimize the loss function. The loss function may be presented via the ML framework, such that a user can determine whether to retrain or otherwise alter the algorithmif the loss function is over a threshold. In some instances, the algorithmcan be retrained automatically if the loss function is over the threshold. Examples of loss functions include a binary-cross entropy function, hinge loss function, regression loss function (e.g., mean square error, quadratic loss, etc.), mean absolute error function, smooth mean absolute error function, log-cosh loss function, and quantile loss function.
2126 2122 2116 2126 2124 2126 2120 2102 The optimizeradjusts the model parametersto minimize the loss function during training of the algorithm. In other words, the optimizeruses the loss function generated by the loss function engineas a guide to determine what model parameters lead to the most accurate AI model. Examples of optimizers include Gradient Descent (GD), Adaptive Gradient Algorithm (AdaGrad), Adaptive Moment Estimation (Adam), Root Mean Square Propagation (RMSprop), Radial Base Function (RBF), and Limited-memory BFGS (L-BFGS). The type of optimizerused may be determined based on the type of model structureand the size of data and the computing resources available in the data layer.
2128 2116 2116 2126 2116 The regularization engineexecutes regularization operations. Regularization is a technique that prevents over-and under-fitting of the AI model. Overfitting occurs when the algorithmis overly complex and too adapted to the training data, which can result in poor performance of the AI model. Under-fitting occurs when the algorithmis unable to recognize even basic patterns from the training data such that it cannot perform well on training data or on validation data. The optimizercan apply one or more regularization techniques to fit the algorithmto the training data properly, which helps constrain the resulting AI model and improves its ability for generalized application. Examples of regularization techniques include lasso (L1) regularization, ridge (L2) regularization, and elastic (L1 and L2) regularization.
2108 2100 2108 200 2 FIG. The application layerdescribes how the AI systemis used to solve problems or perform tasks. In an example implementation, the application layercan be communicatively coupled (e.g., display application data, receive user input, and/or the like) to an interactable user interface of the service authorization systemof.
To assist in understanding the present disclosure, some concepts relevant to neural networks and machine learning (ML) are discussed herein. Generally, a neural network comprises a number of computation units (sometimes referred to as “neurons”). Each neuron receives an input value and applies a function to the input to generate an output value. The function typically includes a parameter (also referred to as a “weight”) whose value is learned through the process of training. A plurality of neurons may be organized into a neural network layer (or simply “layer”) and there may be multiple such layers in a neural network. The output of one layer may be provided as input to a subsequent layer. Thus, input to a neural network may be processed through a succession of layers until an output of the neural network is generated by a final layer. This is a simplistic discussion of neural networks and there may be more complex neural network designs that include feedback connections, skip connections, and/or other such possible connections between neurons and/or layers, which are not discussed in detail here.
A deep neural network (DNN) is a type of neural network having multiple layers and/or a large number of neurons. The term DNN may encompass any neural network having multiple layers, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), multilayer perceptrons (MLPs), Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Auto-regressive Models, among others.
DNNs are often used as ML-based models for modeling complex behaviors (e.g., human language, image recognition, object classification) in order to improve the accuracy of outputs (e.g., more accurate predictions) such as, for example, as compared with models with fewer layers. In the present disclosure, the term “ML-based model” or more simply “ML model” may be understood to refer to a DNN. Training an ML model refers to a process of learning the values of the parameters (or weights) of the neurons in the layers such that the ML model is able to model the target behavior to a desired degree of accuracy. Training typically requires the use of a training dataset, which is a set of data that is relevant to the target behavior of the ML model.
As an example, to train an ML model that is intended to model human language (also referred to as a language model), the training dataset may be a collection of text documents, referred to as a text corpus (or simply referred to as a corpus). The corpus may represent a language domain (e.g., a single language), a subject domain (e.g., scientific papers), and/or may encompass another domain or domains, be they larger or smaller than a single language or subject domain. For example, a relatively large, multilingual and non-subject-specific corpus may be created by extracting text from online webpages and/or publicly available social media posts. Training data may be annotated with ground truth labels (e.g., each data entry in the training dataset may be paired with a label), or may be unlabeled.
Training an ML model generally involves inputting into an ML model (e.g., an untrained ML model) training data to be processed by the ML model, processing the training data using the ML model, collecting the output generated by the ML model (e.g., based on the inputted training data), and comparing the output to a desired set of target values. If the training data is labeled, the desired target values may be, e.g., the ground truth labels of the training data. If the training data is unlabeled, the desired target value may be a reconstructed (or otherwise processed) version of the corresponding ML model input (e.g., in the case of an autoencoder), or can be a measure of some target observable effect on the environment (e.g., in the case of a reinforcement learning agent). The parameters of the ML model are updated based on a difference between the generated output value and the desired target value. For example, if the value outputted by the ML model is excessively high, the parameters may be adjusted so as to lower the output value in future training iterations. An objective function is a way to quantitatively represent how close the output value is to the target value. An objective function represents a quantity (or one or more quantities) to be optimized (e.g., minimize a loss or maximize a reward) in order to bring the output value as close to the target value as possible. The goal of training the ML model typically is to minimize a loss function or maximize a reward function.
The training data may be a subset of a larger data set. For example, a data set may be split into three mutually exclusive subsets: a training set, a validation (or cross-validation) set, and a testing set. The three subsets of data may be used sequentially during ML model training. For example, the training set may be first used to train one or more ML models, each ML model, e.g., having a particular architecture, having a particular training procedure, being describable by a set of model hyperparameters, and/or otherwise being varied from the other of the one or more ML models. The validation (or cross-validation) set may then be used as input data into the trained ML models to, e.g., measure the performance of the trained ML models and/or compare performance between them. Where hyperparameters are used, a new set of hyperparameters may be determined based on the measured performance of one or more of the trained ML models, and the first step of training (i.e., with the training set) may begin again on a different ML model described by the new set of determined hyperparameters. In this way, these steps may be repeated to produce a more performant trained ML model. Once such a trained ML model is obtained (e.g., after the hyperparameters have been adjusted to achieve a desired level of performance), a third step of collecting the output generated by the trained ML model applied to the third subset (the testing set) may begin. The output generated from the testing set may be compared with the corresponding desired target values to give a final assessment of the trained ML model's accuracy. Other segmentations of the larger data set and/or schemes for using the segments for training one or more ML models are possible.
Backpropagation is an algorithm for training an ML model. Backpropagation is used to adjust (also referred to as update) the value of the parameters in the ML model, with the goal of optimizing the objective function. For example, a defined loss function is calculated by forward propagation of an input to obtain an output of the ML model and a comparison of the output value with the target value. Backpropagation calculates a gradient of the loss function with respect to the parameters of the ML model, and a gradient algorithm (e.g., gradient descent) is used to update (i.e., “learn”) the parameters to reduce the loss function. Backpropagation is performed iteratively so that the loss function is converged or minimized. Other techniques for learning the parameters of the ML model may be used. The process of updating (or learning) the parameters over many iterations is referred to as training. Training may be carried out iteratively until a convergence condition is met (e.g., a predefined maximum number of iterations has been performed, or the value outputted by the ML model is sufficiently converged with the desired target value), after which the ML model is considered to be sufficiently trained. The values of the learned parameters may then be fixed and the ML model may be deployed to generate output in real-world applications (also referred to as “inference”).
In some examples, a trained ML model may be fine-tuned, meaning that the values of the learned parameters may be adjusted slightly in order for the ML model to better model a specific task. Fine-tuning of an ML model typically involves further training the ML model on a number of data samples (which may be smaller in number/cardinality than those used to train the model initially) that closely target the specific task. For example, an ML model for generating natural language that has been trained generically on publically-available text corpora may be, e.g., fine-tuned by further training using specific training samples. The specific training samples can be used to generate language in a certain style or in a certain format. For example, the ML model can be trained to generate a blog post having a particular style and structure with a given topic.
Some concepts in ML-based language models are now discussed. It may be noted that, while the term “language model” has been commonly used to refer to a ML-based language model, there could exist non-ML language models. In the present disclosure, the term “language model” may be used as shorthand for an ML-based language model (i.e., a language model that is implemented using a neural network or other ML architecture), unless stated otherwise. For example, unless stated otherwise, the “language model” encompasses LLMs.
A language model may use a neural network (typically a DNN) to perform natural language processing (NLP) tasks. A language model may be trained to model how words relate to each other in a textual sequence, based on probabilities. A language model may contain hundreds of thousands of learned parameters or in the case of a large language model (LLM) may contain millions or billions of learned parameters or more. As non-limiting examples, a language model can generate text, translate text, summarize text, answer questions, write code (e.g., Phyton, JavaScript, or other programming languages), classify text (e.g., to identify spam emails), create content for various purposes (e.g., social media content, factual content, or marketing content), or create personalized content for a particular individual or group of individuals. Language models can also be used for chatbots (e.g., virtual assistance).
In recent years, there has been interest in a type of neural network architecture, referred to as a transformer, for use as language models. For example, the Bidirectional Encoder Representations from Transformers (BERT) model, the Transformer-XL model, and the Generative Pre-trained Transformer (GPT) models are types of transformers. A transformer is a type of neural network architecture that uses self-attention mechanisms in order to generate predicted output based on input data that has some sequential meaning (i.e., the order of the input data is meaningful, which is the case for most text input). Although transformer-based language models are described herein, it should be understood that the present disclosure may be applicable to any ML-based language model, including language models based on other neural network architectures such as recurrent neural network (RNN)-based language models.
22 FIG. 2212 is a block diagram of an example transformerthat can implement aspects of the present technology. A transformer is a type of neural network architecture that uses self-attention mechanisms to generate predicted output based on input data that has some sequential meaning (i.e., the order of the input data is meaningful, which is the case for most text input). Self-attention is a mechanism that relates different positions of a single sequence to compute a representation of the same sequence. Although transformer-based language models are described herein, it should be understood that the present disclosure may be applicable to any machine learning (ML)-based language model, including language models based on other neural network architectures such as recurrent neural network (RNN)-based language models.
2212 2208 2210 2208 2210 The transformerincludes an encoder(which can comprise one or more encoder layers/blocks connected in series) and a decoder(which can comprise one or more decoder layers/blocks connected in series). Generally, the encoderand the decodereach include a plurality of neural network layers, at least one of which can be a self-attention layer. The parameters of the neural network layers can be referred to as the parameters of the language model.
2212 2212 The transformercan be trained to perform certain functions on a natural language input. For example, the functions include summarizing existing content, brainstorming ideas, writing a rough draft, fixing spelling and grammar, and translating content. Summarizing can include extracting key points from an existing content in a high-level summary. Brainstorming ideas can include generating a list of ideas based on provided input. For example, the ML model can generate a list of names for a startup or costumes for an upcoming party. Writing a rough draft can include generating writing in a particular style that could be useful as a starting point for the user's writing. The style can be identified as, e.g., an email, a blog post, a social media post, or a poem. Fixing spelling and grammar can include correcting errors in an existing input text. Translating can include converting an existing input text into a variety of different languages. In some implementations, the transformeris trained to perform certain functions on other input formats than natural language input. For example, the input can include objects, images, audio content, or video content, or a combination thereof.
2212 2212 22 FIG. The transformercan be trained on a text corpus that is labeled (e.g., annotated to indicate verbs, nouns) or unlabeled. Large language models (LLMs) can be trained on a large unlabeled corpus. The term “language model,” as used herein, can include an ML-based language model (e.g., a language model that is implemented using a neural network or other ML architecture), unless stated otherwise. Some LLMs can be trained on a large multi-language, multi-domain corpus to enable the model to be versatile at a variety of language-based tasks such as generative tasks (e.g., generating human-like natural language responses to natural language input).illustrates an example of how the transformercan process textual input data. Input to a language model (whether transformer-based or otherwise) typically is in the form of natural language that can be parsed into tokens. It should be appreciated that the term “token” in the context of language models and Natural Language Processing (NLP) has a different meaning from the use of the same term in other contexts such as data security. Tokenization, in the context of language models and NLP, refers to the process of parsing textual input (e.g., a character, a word, a phrase, a sentence, a paragraph) into a sequence of shorter segments that are converted to numerical representations referred to as tokens (or “compute tokens”). Typically, a token can be an integer that corresponds to the index of a text segment (e.g., a word) in a vocabulary dataset. Often, the vocabulary dataset is arranged by frequency of use. Commonly occurring text, such as punctuation, can have a lower vocabulary index in the dataset and thus be represented by a token having a smaller integer value than less commonly occurring text. Tokens frequently correspond to words, with or without white space appended. In some examples, a token can correspond to a portion of a word.
For example, the word “greater” can be represented by a token for [great] and a second token for [er]. In another example, the text sequence “write a summary” can be parsed into the segments [write], 2, and [summary], each of which can be represented by a respective numerical token. In addition to tokens that are parsed from the textual sequence (e.g., tokens that correspond to words and punctuation), there can also be special tokens to encode non-textual information. For example, a [CLASS] token can be a special token that corresponds to a classification of the textual sequence (e.g., can classify the textual sequence as a list, a paragraph), an [EOT] token can be another special token that indicates the end of the textual sequence, other tokens can provide formatting information, etc.
22 FIG. 22 FIG. 2202 2212 2202 2212 2212 2202 2206 2206 2206 2202 2206 2202 2206 2206 In, a short sequence of tokenscorresponding to the input text is illustrated as input to the transformer. Tokenization of the text sequence into the tokenscan be performed by some pre-processing tokenization module such as, for example, a byte-pair encoding tokenizer (the “pre” referring to the tokenization occurring prior to the processing of the tokenized input by the LLM), which is not shown infor simplicity. In general, the token sequence that is inputted to the transformercan be of any length up to a maximum length defined based on the dimensions of the transformer. Each tokenin the token sequence is converted into an embedding vector(also referred to simply as an embedding). An embeddingis a learned numerical representation (such as, for example, a vector) of a token that captures some semantic meaning of the text segment represented by the token. The embeddingrepresents the text segment corresponding to the tokenin a way such that embeddings corresponding to semantically related text are closer to each other in a vector space than embeddings corresponding to semantically unrelated text. For example, assuming that the words “write,” “a,” and “summary” each correspond to, respectively, a “write” token, an “a” token, and a “summary” token when tokenized, the embeddingcorresponding to the “write” token will be closer to another embedding corresponding to the “jot down” token in the vector space as compared to the distance between the embeddingcorresponding to the “write” token and another embedding corresponding to the “summary”token.
2202 2206 2202 2206 2202 2206 2206 2202 2206 2202 2204 2212 The vector space can be defined by the dimensions and values of the embedding vectors. Various techniques can be used to convert a tokento an embedding. For example, another trained ML model can be used to convert the tokeninto an embedding. In particular, another trained ML model can be used to convert the tokeninto an embeddingin a way that encodes additional information into the embedding(e.g., a trained ML model can encode positional information about the position of the tokenin the text sequence into the embedding). In some examples, the numerical value of the tokencan be used to look up the corresponding embedding in an embedding matrix(which can be learned during training of the transformer).
2206 2208 2208 2206 2214 2206 2208 2214 2214 2214 2214 2214 2208 The generated embeddingsare input into the encoder. The encoderserves to encode the embeddingsinto feature vectorsthat represent the latent features of the embeddings. The encodercan encode positional information (i.e., information about the sequence of the input) in the feature vectors. The feature vectorscan have very high dimensionality (e.g., on the order of thousands or tens of thousands), with each element in a feature vectorcorresponding to a respective feature. The numerical weight of each element in a feature vectorrepresents the importance of the corresponding feature. The space of all possible feature vectorsthat can be generated by the encodercan be referred to as the latent space or feature space.
2210 2214 2212 2212 2210 2214 2202 2210 2214 2210 2216 2216 2210 2216 2210 2216 2210 2216 2216 2216 2216 Conceptually, the decoderis designed to map the features represented by the feature vectorsinto meaningful output, which can depend on the task that was assigned to the transformer. For example, if the transformeris used for a translation task, the decodercan map the feature vectorsinto text output in a target language different from the language of the original tokens. Generally, in a generative language model, the decoderserves to decode the feature vectorsinto a sequence of tokens. The decodercan generate output tokensone by one. Each output tokencan be fed back as input to the decoderin order to generate the next output token. By feeding back the generated output and applying self-attention, the decoderis able to generate a sequence of output tokensthat has sequential meaning (e.g., the resulting output text sequence is understandable as a sentence and obeys grammatical rules). The decodercan generate output tokensuntil a special [EOT] token (indicating the end of the text) is generated. The resulting sequence of output tokenscan then be converted to a text sequence in post-processing. For example, each output tokencan be an integer number that corresponds to a vocabulary index. By looking up the text segment using the vocabulary index, the text segment corresponding to each output tokencan be retrieved, the text segments can be concatenated together, and the final output text sequence can be obtained.
2212 In some examples, the input provided to the transformerincludes instructions to perform a function on an existing text. In some examples, the input provided to the transformer includes instructions to perform a function on an existing text. The output can include, for example, a modified version of the input text and instructions to modify the text. The modification can include summarizing, translating, correcting grammar or spelling, changing the style of the input text, lengthening or shortening the text, or changing the format of the text. For example, the input can include the question “What is the weather like in Australia?” and the output can include a description of the weather in Australia.
Although a general transformer architecture for a language model and its theory of operation have been described above, this is not intended to be limiting. Existing language models include language models that are based only on the encoder of the transformer or only on the decoder of the transformer. An encoder-only language model encodes the input text sequence into feature vectors that can then be further processed by a task-specific layer (e.g., a classification layer). BERT is an example of a language model that can be considered to be an encoder-only language model. A decoder-only language model accepts embeddings as input and can use auto-regression to generate an output text sequence. Transformer-XL and GPT-type models can be language models that are considered to be decoder-only language models.
Because GPT-type language models tend to have a large number of parameters, these language models can be considered LLMs. An example of a GPT-type LLM is GPT-3. GPT-3 is a type of GPT language model that has been trained (in an unsupervised manner) on a large corpus derived from documents available to the public online. GPT-3 has a very large number of learned parameters (on the order of hundreds of billions), is able to accept a large number of tokens as input (e.g., up to 2,048 input tokens), and is able to generate a large number of tokens as output (e.g., up to 2,048 tokens). GPT-3 has been trained as a generative model, meaning that it can process input text sequences to predictively generate a meaningful output text sequence. ChatGPT is built on top of a GPT-type LLM and has been fine-tuned with training datasets based on text-based chats (e.g., chatbot conversations). ChatGPT is designed for processing natural language, receiving chat-like inputs, and generating chat-like outputs.
A computer system can access a remote language model (e.g., a cloud-based language model), such as ChatGPT or GPT-3, via a software interface (e.g., an API). Additionally or alternatively, such a remote language model can be accessed via a network such as, for example, the Internet. In some implementations, such as, for example, potentially in the case of a cloud-based language model, a remote language model can be hosted by a computer system that can include a plurality of cooperating (e.g., cooperating via a network) computer systems that can be in, for example, a distributed arrangement. Notably, a remote language model can employ a plurality of processors (e.g., hardware processors such as, for example, processors of cooperating computer systems). Indeed, processing of inputs by an LLM can be computationally expensive/can involve a large number of operations (e.g., many instructions can be executed/large data structures can be accessed from memory), and providing output in a required timeframe (e.g., real time or near real time) can require the use of a plurality of processors/cooperating computing devices as discussed above.
Inputs to an LLM can be referred to as a prompt, which is a natural language input that includes instructions to the LLM to generate a desired output. A computer system can generate a prompt that is provided as input to the LLM via its API. As described above, the prompt can optionally be processed or pre-processed into a token sequence prior to being provided as input to the LLM via its API. A prompt can include one or more examples of the desired output, which provides the LLM with additional information to enable the LLM to generate output according to the desired output. Additionally or alternatively, the examples included in a prompt can provide inputs (e.g., example inputs) corresponding to/as can be expected to result in the desired outputs provided. A one-shot prompt refers to a prompt that includes one example, and a few-shot prompt refers to a prompt that includes multiple examples. A prompt that includes no examples can be referred to as a zero-shot prompt.
23 FIG. 23 FIG. 2300 2300 2302 2306 2310 2312 2318 2320 2322 2324 2326 2330 2316 2316 2300 is a block diagram that illustrates an example of a computer systemin which at least some operations described herein can be implemented. As shown, the computer systemcan include: one or more processors, main memory, non-volatile memory, a network interface device, a video display device, an input/output device, a control device(e.g., keyboard and pointing device), a drive unitthat includes a machine-readable (storage) medium, and a signal generation devicethat are communicatively connected to a bus. The busrepresents one or more physical buses and/or point-to-point connections that are connected by appropriate bridges, adapters, or controllers. Various common components (e.g., cache memory) are omitted fromfor brevity. Instead, the computer systemis intended to illustrate a hardware device on which components illustrated or described relative to the examples of the figures and any other components described in this specification can be implemented.
2300 2300 2300 2300 2300 The computer systemcan take any suitable physical form. For example, the computing systemcan share a similar architecture as that of a server computer, personal computer (PC), tablet computer, mobile telephone, game console, music player, wearable electronic device, network-connected (“smart”) device (e.g., a television or home assistant device), AR/VR systems (e.g., head-mounted display), or any electronic device capable of executing a set of instructions that specify action(s) to be taken by the computing system. In some implementations, the computer systemcan be an embedded computer system, a system-on-chip (SOC), a single-board computer system (SBC), or a distributed system such as a mesh of computer systems, or it can include one or more cloud components in one or more networks. Where appropriate, one or more computer systemscan perform operations in real time, in near real time, or in batch mode.
2312 2300 2314 2300 2300 2312 The network interface deviceenables the computing systemto mediate data in a networkwith an entity that is external to the computing systemthrough any communication protocol supported by the computing systemand the external entity. Examples of the network interface deviceinclude a network adapter card, a wireless network interface card, a router, an access point, a wireless router, a switch, a multilayer switch, a protocol converter, a gateway, a bridge, a bridge router, a hub, a digital media receiver, and/or a repeater, as well as all wireless elements noted herein.
2306 2310 2326 2326 2328 2326 2300 2326 The memory (e.g., main memory, non-volatile memory, machine-readable medium) can be local, remote, or distributed. Although shown as a single medium, the machine-readable mediumcan include multiple media (e.g., a centralized/distributed database and/or associated caches and servers) that store one or more sets of instructions. The machine-readable mediumcan include any medium that is capable of storing, encoding, or carrying a set of instructions for execution by the computing system. The machine-readable mediumcan be non-transitory or comprise a non-transitory device. In this context, a non-transitory storage medium can include a device that is tangible, meaning that the device has a concrete physical form, although the device can change its physical state. Thus, for example, non-transitory refers to a device remaining tangible despite this change in state.
2310 Although implementations have been described in the context of fully functioning computing devices, the various examples are capable of being distributed as a program product in a variety of forms. Examples of machine-readable storage media, machine-readable media, or computer-readable media include recordable-type media such as volatile and non-volatile memory, removable flash memory, hard disk drives, optical disks, and transmission-type media such as digital and analog communication links.
2304 2308 2328 2302 2300 In general, the routines executed to implement examples herein can be implemented as part of an operating system or a specific application, component, program, object, module, or sequence of instructions (collectively referred to as “computer programs”). The computer programs typically comprise one or more instructions (e.g., instructions,,) set at various times in various memory and storage devices in computing device(s). When read and executed by the processor, the instruction(s) cause the computing systemto perform operations to execute elements involving the various aspects of the disclosure.
The terms “example,” “embodiment,” and “implementation” are used interchangeably. For example, references to “one example” or “an example” in the disclosure can be, but not necessarily are, references to the same implementation; and such references mean at least one of the implementations. The appearances of the phrase “in one example” are not necessarily all referring to the same example, nor are separate or alternative examples mutually exclusive of other examples. A feature, structure, or characteristic described in connection with an example can be included in another example of the disclosure. Moreover, various features are described that can be exhibited by some examples and not by others. Similarly, various requirements are described that can be requirements for some examples but not for other examples.
The terminology used herein should be interpreted in its broadest reasonable manner, even though it is being used in conjunction with certain specific examples of the invention. The terms used in the disclosure generally have their ordinary meanings in the relevant technical art, within the context of the disclosure, and in the specific context where each term is used. A recital of alternative language or synonyms does not exclude the use of other synonyms. Special significance should not be placed upon whether or not a term is elaborated or discussed herein. The use of highlighting has no influence on the scope and meaning of a term. Further, it will be appreciated that the same thing can be said in more than one way.
Unless the context clearly requires otherwise, throughout the description and the claims, the words “comprise,” “comprising,” and the like are to be construed in an inclusive sense, as opposed to an exclusive or exhaustive sense—that is to say, in the sense of “including, but not limited to.” As used herein, the terms “connected,” “coupled,” and any variants thereof mean any connection or coupling, either direct or indirect, between two or more elements; the coupling or connection between the elements can be physical, logical, or a combination thereof. Additionally, the words “herein,” “above,” “below,” and words of similar import can refer to this application as a whole and not to any specific portions of this application. Where context permits, words in the above Detailed Description using the singular or plural number may also include the plural or singular number, respectively. The word “or” in reference to a list of two or more items covers all of the following interpretations of the word: any of the items in the list, all of the items in the list, and any combination of the items in the list. The term “module” refers broadly to software components, firmware components, and/or hardware components.
While specific examples of technology are described above for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize. For example, while processes or blocks are presented in a given order, alternative implementations can perform routines having steps, or employ systems having blocks, in a different order, and some processes or blocks may be deleted, moved, added, subdivided, combined, and/or modified to provide alternative or sub-combinations. Each of these processes or blocks can be implemented in a variety of different ways. Also, while processes or blocks are at times shown as being performed in series, these processes or blocks can instead be performed or implemented in parallel, or can be performed at different times. Further, any specific numbers noted herein are only examples such that alternative implementations can employ differing values or ranges.
Details of the disclosed implementations can vary considerably in specific implementations while still being encompassed by the disclosed teachings. As noted above, particular terminology used when describing features or aspects of the invention should not be taken to imply that the terminology is being redefined herein to be restricted to any specific characteristics, features, or aspects of the invention with which that terminology is associated. In general, the terms used in the following claims should not be construed to limit the invention to the specific examples disclosed herein, unless the above Detailed Description explicitly defines such terms. Accordingly, the actual scope of the invention encompasses not only the disclosed examples but also all equivalent ways of practicing or implementing the invention under the claims. Some alternative implementations can include additional elements to those implementations described above or include fewer elements.
Any patents and applications and other references noted above, and any that may be listed in accompanying filing papers, are incorporated herein by reference in their entireties, except for any subject matter disclaimers or disavowals, and except to the extent that the incorporated material is inconsistent with the express disclosure herein, in which case the language in this disclosure controls. Aspects of the invention can be modified to employ the systems, functions, and concepts of the various references described above to provide yet further implementations of the invention.
To reduce the number of claims, certain implementations are presented below in certain claim forms, but the applicant contemplates various aspects of an invention in other forms. For example, aspects of a claim can be recited in a means-plus-function form or in other forms, such as being embodied in a computer-readable medium. A claim intended to be interpreted as a means-plus-function claim will use the words “means for.” However, the use of the term “for” in any other context is not intended to invoke a similar interpretation. The applicant reserves the right to pursue such additional claim forms either in this application or in a continuing application.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
December 12, 2025
April 16, 2026
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.