Patentable/Patents/US-20260140855-A1
US-20260140855-A1

Automated Agent Testing and Evaluation Using Large Language Models

PublishedMay 21, 2026
Assigneenot available in USPTO data we have
Technical Abstract

A system may utilize a large language model (LLM). The system may train and test agents via an LLM by using the LLM to generate a set of scenarios and a set of training data and then training the agents using the training data and executing the agents accordingly. The system may also perform evaluations of data within a data table via LLMs and evaluations of prompt templates using a set of data from the data table. In one example, the system may receive a set of instructions for data evaluation, evaluate the set of data via the LLM, and obtain a set of output data from the LLM. In another example, the system may receive a prompt template for the LLM, execute the prompt template via the LLM, obtain a set of output data from the LLM, and evaluate the performance of the prompt template via the LLM.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

1

receiving, from a first user via a first user interface, a selection of one or more agents of a plurality of agents to perform a first task, the plurality of agents being associated with one or more artificial intelligence or machine learning (AI/ML) models to execute natural language queries, wherein each agent is associated with a set of parameters for performing one or more actions via the one or more AI/ML models; generating, via the LLM, a plurality of scenarios for testing the one or more agents, the plurality of scenarios corresponding to the one or more actions each agent of the one or more agents is instructed to perform based at least in part on the set of parameters of each agent; generating, via the LLM and based at least in part on generation of the plurality of scenarios, a set of training data for training the one or more agents to perform the first task, the set of training data being based at least in part on the set of parameters of the one or more agents and the plurality of scenarios; training the one or more AI/ML models associated with the one or more agents using the set of training data generated via the LLM; and executing the one or more agents to perform the first task based at least in part on training the one or more AI/ML models associated with the one or more agents. . A method for agent testing generation via a large language model (LLM), comprising:

2

claim 1 obtaining, in response to receiving the selection of the one or more agents, the set of parameters for each agent of the one or more agents, wherein generating the plurality of scenarios is based at least in part on obtaining the set of parameters for the one or more agents. . The method of, further comprising:

3

claim 1 receiving, via the selection of the one or more agents, an indication of the first task for the one or more agents to perform, wherein executing the one or more agents to perform the first task is based at least in part on receiving the indication of the first task. . The method of, wherein receiving the selection of the one or more agents comprises:

4

claim 1 obtaining, from the one or more AI/ML models associated with the one or more agents, a set of output data in response to the set of training data being provided as input to the one or more AI/ML models, the set of output data being associated with the first task; and evaluating the set of output data to obtain a performance metric indication of the one or more agents. . The method of, wherein training the one or more agents comprises:

5

claim 4 . The method of, wherein the set of output data is evaluated via code-based testing, via the LLM, via user feedback, or any combination thereof.

6

claim 1 . The method of, wherein the one or more agents are executed sequentially to perform the first task.

7

obtaining, via a first user interface, a set of data within an interactive data table, the interactive data table comprising a plurality of fields and the plurality of fields comprising a plurality of data records from the set of data; receiving a set of instructions for evaluation of the set of data within the interactive data table by an LLM, the set of instructions comprising an indication of one or more fields of the plurality of fields and an output format; evaluating, via the LLM and in accordance with the set of instructions for the evaluation, a subset of data of the set of data within the interactive data table, the subset of data being associated with the one or more fields of the interactive data table indicated via the set of instructions; and obtaining, from the LLM and based at least in part on the evaluation, a set of output data within the output format indicated via the set of instructions for the evaluation. . A method for data evaluation via large language models (LLMs), comprising:

8

claim 7 obtaining, from one or more user inputs or from the LLM, an evaluation of the set of output data based at least in part on the set of output data from the LLM. . The method of, further comprising:

9

claim 7 displaying, via the first user interface, the set of output data within an output field of the plurality of fields of the interactive data table. . The method of, wherein obtaining the set of output data comprises:

10

claim 7 obtaining, via the first user interface, the set of data from a first user, from the LLM, or both. . The method of, wherein obtaining the set of data comprises:

11

claim 10 obtaining, from the first user and via the first user interface, the set of data via one or more user inputs comprising the set of data, via an indication of a file comprising the set of data, or a combination thereof. . The method of, wherein obtaining the set of data from the first user comprises:

12

claim 10 transmitting, to the LLM, a query to generate the set of data, the query comprising one or more LLM prompt parameters for the LLM to utilize to generate the set of data. . The method of, wherein obtaining the set of data from the LLM comprises:

13

claim 7 . The method of, wherein the set of data comprises text data, a single select data, multi-select data, numerical data, currency data, percentage data, one or more attachments, one or more formulas, record data, one or more images, audio data, or any combination thereof.

14

claim 7 . The method of, wherein the set of output data comprises textual data, numerical data, single tag data, multiple tag data, currency data, percentage data, or any combination thereof based at least in part on the output format indicated via the set of instructions for the evaluation of the set of data.

15

claim 7 . The method of, wherein the plurality of fields of the interactive data table are associated with a set of columns of the interactive data table.

16

obtaining, via a first user interface, a set of data within an interactive data table, the interactive data table comprising a plurality of fields and the plurality of fields comprising a plurality of data records from the set of data; receiving a prompt template for an LLM that is associated with evaluation of the set of data obtained via the first user interface, the prompt template comprising a set of instructions for the LLM to evaluate the set of data, wherein the set of instructions comprises an indication of one or more fields of the plurality of fields of the interactive data table; executing, via the LLM and based at least in part on receiving the prompt template, the set of instructions of the prompt template using the set of data of within the interactive data table to obtain a set of output data; and evaluating, via the LLM and in response to executing the prompt template, the set of output data according to obtain one or more indications of a performance of the prompt template. . A method for large language model (LLM) prompt evaluation, comprising:

17

claim 16 . The method of, wherein the one or more indications of the performance of the prompt template comprise a coherency indication, a relevance indication, an accuracy indication, or any combination thereof.

18

claim 16 obtaining, from the set of data within the interactive data table, a subset of data associated with the one or more fields indicated via the set of instructions; and transmitting, to the LLM, the set of instructions of the prompt template with the subset of data included within the set of instructions in place of the indication of the one or more fields, wherein the set of output data is obtained based at least in part on the subset of data being included within the set of instructions. . The method of, wherein executing the set of instructions of the prompt template comprises:

19

claim 16 displaying, via the first user interface, the set of output data within an output field of the plurality of fields of the interactive data table. . The method of, wherein executing the set of instructions of the prompt template comprises:

20

claim 16 . The method of, wherein the set of data comprises text data, a single select data, multi-select data, numerical data, currency data, percentage data, one or more attachments, one or more formulas, record data, one or more images, audio data, or any combination thereof.

Detailed Description

Complete technical specification and implementation details from the patent document.

The present Application for Patent claims the benefit of U.S. Provisional Ser. No. 63/722,388 by Singh et al., entitled “AUTOMATED AGENT TESTING AND EVALUATION USING LARGE LANGUAGE MODELS,” filed Nov. 19, 2024, assigned to the assignee hereof, and expressly incorporated by reference herein.

The present disclosure relates generally to database systems and data processing, and more specifically to automated agent testing and evaluation using large language models (LLMs).

A cloud platform (i.e., a computing platform for cloud computing) may be employed by multiple users to store, manage, and process data using a shared network of remote servers. Users may develop applications on the cloud platform to handle the storage, management, and processing of data. In some cases, the cloud platform may utilize a multi-tenant database system. Users may access the cloud platform using various user devices (e.g., desktop computers, laptops, smartphones, tablets, or other computing systems, etc.).

In one example, the cloud platform may support customer relationship management (CRM) solutions. This may include support for sales, service, marketing, community, analytics, applications, and the Internet of Things. A user may utilize the cloud platform to help manage contacts of the user. For example, managing contacts of the user may include analyzing data, storing and preparing communications, and tracking opportunities and sales.

In some examples, users or organizations may use a relatively large quantity agents to perform a relatively large quantity of tasks, including tasks that involve multiple agents interacting with each other. However, training and testing agents to perform tasks may be relatively complex. For example, users may have to manually generate training data and testing scenarios which may be relatively time-consuming. Further, as the agents may be updated and retrained, having a user perform such updating and (re)training may be relatively time-consuming, resource computationally expensive, and inefficient, thus resulting in relatively unreliable agents. Moreover, training and testing agents using relatively large sets of data, different prompts, different models, may be relatively time consuming, computationally expensive, and complex. Additionally, or alternatively, data tables may be used for evaluating data and users may be restricted with how to input data within data tables, use the input data, and evaluate the input data, resulting in relatively inefficient data evaluation procedures.

In some examples, users may use generative artificial intelligence (AI) systems to perform tasks such as content creation. In some cases, generative AI models such as large language models (LLMs) may be a subset of AI models and machine learning (ML) models (e.g., AI/ML models) that utilize algorithms and models to create content, such as text, images, audio, and video, by learning patterns from existing data. In some cases, users may also use agents (e.g., AI agents or non-human identities (NHIs)) which may be autonomous software entities designed to perform specific tasks or actions by processing and interpreting data. In some examples, an agent may use one or more AI/ML models (e.g., LLMs) to execute natural language queries and interact with users or systems. In some cases, users or organizations may use a relatively large quantity agents to perform a relatively large quantity of tasks, including tasks that involve multiple agents interacting with each other. However, training and testing agents to perform tasks may be relatively complex. For example, users may have to manually generate training data and testing scenarios which may be relatively time-consuming. Further, as the agents may be updated and retrained, having a user perform such updating and (re)training may be relatively time-consuming, resource computationally expensive, and inefficient, thus resulting in relatively unreliable agents. Additionally, or alternatively, training and testing agents using relatively large sets of data, different prompts, different models, may be relatively time consuming, computationally expensive, and complex. These technical challenges create specific computational inefficiencies, including exponential increases in training time as agent complexity grows, memory bottlenecks when processing large multi-agent interaction datasets, and system failures when agents encounter edge cases not covered by manually generated scenarios. Traditional approaches fail to provide the technical scalability and reliability required for enterprise-level agent deployment.

To enhance the training and utilization of agents the techniques of the present disclosure may enable the generation of testing scenarios and training data for agents and the utilization of an interactive data table. For example, in accordance with the techniques of the present disclosure, an agent testing generation service that is associated with an LLM may receive a selection of one or more agents to perform a first task and each agent may be associated with a set of parameters for performing one or more actions via AI/ML models. The LLM may then generate a set of scenarios for testing the one or more agents that correspond to the one or more actions each agent is instructed to perform. Moreover, the LLM may generate a set of training data, based on generating the set of scenarios, for training the one or more agents to perform the first task. In response, the agent testing generation service may train the AI/ML models of the agents using the set of training data generated by the LLM and execute the agents to perform the first task based on the training. The set of parameters may include: (i) agent configuration data comprising model type identifiers, temperature settings, token limits, and API endpoint specifications; (ii) behavioral parameters defining response patterns, interaction protocols, and decision tree structures; (iii) task-specific metadata including input/output schemas, data validation rules, and performance thresholds; (iv) execution parameters such as timeout values, retry logic, error handling procedures, and resource allocation limits; and (v) inter-agent communication protocols defining data exchange formats, message routing algorithms, and synchronization mechanisms that ensure deterministic multi-agent interactions.

Further, in accordance with the techniques of the present disclosure, an LLM service may utilize an interactive data table for data evaluation and LLM prompt evaluation to ensure reliability and accuracy of the agents. For example, for data evaluation, the LLM service may obtain a set of data within the interactive data table and receive a set of instructions for evaluation of the set of within the interactive data table by an LLM. In response, the LLM may use the set of instructions to evaluate at least a subset of the set of data to obtain a set of output data that is displayed within the interactive data table. Further, for prompt evaluation, the LLM service may receive a prompt template that includes a set of instructions for the LLM to evaluate the set of data obtained from the interactive data table. The LLM service may then execute the prompt template to obtain a set of output data and use the LLM to evaluate the set of output data to obtain an indication of a performance of the prompt template. Therefore, the techniques of the present disclosure, may enable relatively more reliable and efficient techniques for training agents, utilizing agents to evaluate data and perform actions, and for evaluating the performance of prompts for agents to improve the performance of agents and increase the efficiency and reliability of utilizing agents within a system.

In some examples, the generation of the training data and testing scenarios for the selected agents may be for training the selected agents to perform the first task where each of the agents perform actions sequentially. For example, the first task may involve the selected agents to interact with each other. That is, an output of a first agent may be given as an input to a second agent to perform a respective task. In such examples, training the agents to perform one or more actions and to perform actions as part of a task that involves multiple agents may be relatively complex. Thus, the techniques of the present disclosure may reduce the complexity of training agents for tasks that involve interactions between multiple agents. Further, the set of data within the interactive data table may be obtained from user inputs, a file that includes the set of data being imported, or from an LLM. For example, the LLM service may enable users to generate data for testing agents on-demand via an LLM, thus improving the agent testing capabilities. Additionally, or alternatively, users may use prompt templates to perform actions on relatively large sets of data and then evaluate the output of an LLM using the prompt templates to determine the performance of the prompt template. Based on the performance of the prompt template, a user or system may further determine whether a prompt template should be updated or changed. Therefore, the techniques of the present disclosure may enable automatic agent testing and training sequential agent execution and a utilization of an interactive data table for data evaluation and prompt evaluation to improve the performance, accuracy, efficiency, and reliability of agents within a system.

Aspects of the disclosure are initially described in the context of an environment supporting an on-demand database service. Additional aspects of the disclosure are described with reference to a computing system, a flowchart, user interfaces, and process flow. Aspects of the disclosure are further illustrated by and described with reference to apparatus diagrams, system diagrams, and flowcharts that relate to automated agent testing and evaluation using LLMs.

1 FIG. 100 100 105 110 115 120 115 105 115 135 105 105 105 105 105 105 a b c illustrates an example of a systemfor cloud computing that supports automated agent testing and evaluation using LLMs in accordance with various aspects of the present disclosure. The systemincludes cloud clients, contacts, cloud platform, and data center. Cloud platformmay be an example of a public or private cloud network. A cloud clientmay access cloud platformover network connection. The network may implement transfer control protocol and internet protocol (TCP/IP), such as the Internet, or may implement other network protocols. A cloud clientmay be an example of a user device, such as a server (e.g., cloud client-), a smartphone (e.g., cloud client-), or a laptop (e.g., cloud client-). In other examples, a cloud clientmay be a desktop computer, a tablet, a sensor, or another computing device or system capable of generating, analyzing, transmitting, or receiving communications. In some examples, a cloud clientmay be operated by a user that is part of a business, an enterprise, a non-profit, a startup, or any other organization type.

105 110 130 105 110 130 105 115 130 105 105 115 A cloud clientmay interact with multiple contacts. The interactionsmay include communications, opportunities, purchases, sales, or any other interaction between a cloud clientand a contact. Data may be associated with the interactions. A cloud clientmay access cloud platformto store, manage, and process the data associated with the interactions. In some cases, the cloud clientmay have an associated security or permission level. A cloud clientmay have access to certain applications, data, and database information within cloud platformbased on the associated security or permission level and may not have access to others.

110 105 130 130 130 130 130 110 110 110 110 110 110 110 110 a b c d a b c d Contactsmay interact with the cloud clientin person or via phone, email, web, text messages, mail, or any other appropriate form of interaction (e.g., interactions-,-,-, and-). The interactionmay be a business-to-business (B2B) interaction or a business-to-consumer (B2C) interaction. A contactmay also be referred to as a customer, a potential customer, a lead, a client, or some other suitable terminology. In some cases, the contactmay be an example of a user device, such as a server (e.g., contact-), a laptop (e.g., contact-), a smartphone (e.g., contact-), or a sensor (e.g., contact-). In other cases, the contactmay be another computing system. In some cases, the contactmay be operated by a user or group of users. The user or group of users may be associated with a business, a manufacturer, or any other appropriate organization.

115 105 115 115 105 115 115 130 105 135 115 130 110 105 105 115 115 120 Cloud platformmay offer an on-demand database service to the cloud client. In some cases, cloud platformmay be an example of a multi-tenant database system. In this case, cloud platformmay serve multiple cloud clientswith a single instance of software. However, other types of systems may be implemented, including—but not limited to—client-server systems, mobile device systems, and mobile network systems. In some cases, cloud platformmay support CRM solutions. This may include support for sales, service, marketing, community, analytics, applications, and the Internet of Things. Cloud platformmay receive data associated with contact interactionsfrom the cloud clientover network connection, and may store and analyze the data. In some cases, cloud platformmay receive data directly from an interactionbetween a contactand the cloud client. In some cases, the cloud clientmay develop applications to run on cloud platform. Cloud platformmay be implemented using remote servers. In some cases, the remote servers may be located at one or more data centers.

120 120 115 140 105 130 110 105 120 120 Data centermay include multiple servers. The multiple servers may be used for data storage, management, and processing. Data centermay receive data from cloud platformvia connection, or directly from the cloud clientor an interactionbetween a contactand the cloud client. Data centermay utilize multiple redundancies for security purposes. In some cases, the data stored at data centermay be backed up by copies of the data at a different data center (not pictured).

125 105 115 120 125 105 120 Subsystemmay include cloud clients, cloud platform, and data center. In some cases, data processing may occur at any of the components of subsystem, or at a combination of these components. In some cases, servers may perform the data processing. The servers may be a cloud clientor located at data center.

100 100 100 100 100 The systemmay be an example of a multi-tenant system. For example, the systemmay store data and provide applications, solutions, or any other functionality for multiple tenants concurrently. A tenant may be an example of a group of users (e.g., an organization) associated with a same tenant identifier (ID) who share access, privileges, or both for the system. The systemmay effectively separate data and processes for a first tenant from data and processes for other tenants using a system architecture, logic, or both that support secure multi-tenancy. In some examples, the systemmay include or be an example of a multi-tenant database system. A multi-tenant database system may store data for different tenants in a single database or a single set of databases. For example, the multi-tenant database system may store data for multiple tenants within a single table (e.g., in different rows) of a database. To support multi-tenant security, the multi-tenant database system may prohibit (e.g., restrict) a first tenant from accessing, viewing, or interacting in any way with data or rows associated with a different tenant. As such, tenant data for the first tenant may be isolated (e.g., logically isolated) from tenant data for a second tenant, and the tenant data for the first tenant may be invisible (or otherwise transparent) to the second tenant. The multi-tenant database system may additionally use encryption techniques to further protect tenant-specific data from unauthorized access (e.g., by another tenant).

100 Additionally, or alternatively, the multi-tenant system may support multi-tenancy for software applications and infrastructure. In some cases, the multi-tenant system may maintain a single instance of a software application and architecture supporting the software application in order to serve multiple different tenants (e.g., organizations, customers). For example, multiple tenants may share the same software application, the same underlying architecture, the same resources (e.g., compute resources, memory resources), the same database, the same servers or cloud-based resources, or any combination thereof. For example, the systemmay run a single instance of software on a processing device (e.g., a server, server cluster, virtual machine) to serve multiple tenants. Such a multi-tenant system may provide for efficient integrations (e.g., using application programming interfaces (APIs)) by applying the integrations to the same software application and underlying architectures supporting multiple tenants. In some cases, processing resources, memory resources, or both may be shared by multiple tenants.

100 100 100 100 As described herein, the systemmay support any configuration for providing multi-tenant functionality. For example, the systemmay organize resources (e.g., processing resources, memory resources) to support tenant isolation (e.g., tenant-specific resources), tenant isolation within a shared resource (e.g., within a single instance of a resource), tenant-specific resources in a resource group, tenant-specific resource groups corresponding to a same subscription, tenant-specific subscriptions, or any combination thereof. The systemmay support scaling of tenants within the multi-tenant system, for example, using scale triggers, automatic scaling procedures, scaling requests, or any combination thereof. In some cases, the systemmay implement one or more scaling rules to enable relatively fair sharing of resources across tenants. For example, a tenant may have a threshold quantity of processing resources, memory resources, or both to use, which in some cases may be tied to a subscription by the tenant.

100 145 145 145 145 145 145 145 In some examples, the systemmay include a generative artificial intelligence (AI) component. The generative AI componentmay be an example or a component of a LLM, such as a generative AI model. In some examples, the generative AI componentmay additionally, or alternatively, be referred to as any of an AI, a generative AI (GAI), a GAI model, an LLM, a machine learning model, or any similar terminology. The generative AI componentmay be a model that is trained on a corpus of input data, which may include text, images, video, audio, structured data, or any combination thereof. Such data may represent general-purpose data, domain-specific data, or any combination thereof. Further, the generative AI componentmay be supplemented with additional training on data associated with a role, function, or generation outcome to further specialize the generative AI componentand increase the accuracy and relevance of information generated with the generative AI component.

115 105 145 115 145 145 115 In some examples, the cloud platformmay receive a query from a cloud clientthat may include a request to produce a response (e.g., text, images, video, audio, or other information) to the query using the generative AI component. The cloud platformmay input a prompt to the generative AI componentthat includes, or otherwise indicates, the query (or information included therein). The generative AI componentmay generate an output (e.g., text, images, video, audio, or other information) that is responsive to the prompt. In some examples, the cloud platformmay modify or supplement one or more aspects of the query to increase the quality of the response. In some examples, such modification or supplementation may be referred to as grounding.

100 145 125 145 115 125 125 145 145 145 110 120 1 FIG. The systemmay support any configuration for the use of generative AI models. In, the generative AI componentis depicted as being located external to the subsystem. However, the generative AI componentmay be hosted on the cloud platform, elsewhere within the subsystem, or outside the subsystem(e.g., a publicly-hosted platform). Additionally, or alternatively, multiple generative AI componentsmay be employed to perform one or more of the actions described as being performed by a single generative AI component. Further, in some examples, the generative AI componentmay communicate with one or more other elements, such as a contact, the data center, one or more other elements, or any combination thereof, to receive additional information (e.g., that may be indicated in the query or the prompt) that is to be considered for performing generative processes.

145 In various implementations, the models and/or modules described herein (e.g., including, but not limited to, the generative AI component) may be classification, predictive, generative, conversational, or another form of AI technology, such as AI model(s), agents, etc., implementing one or more forms of machine learning, a neural network, statistical modeling, deep learning, automation, natural language processing, or other similar technology. The AI technology may be included as part of a network or system comprising a hardware-or software-based framework for training, processing, fine-tuning, or performing any other implementation steps. Furthermore, the AI technology may include a hardware-or software-based framework that performs one or more functions, such as retrieving, generating, accessing, transmitting, etc. The AI technology may be implemented by a computer including a register coupled with a processor or a central processing unit (CPU).

Moreover, the AI technology may be trained or fine-tuned using supervised, unsupervised, or other AI training techniques. In various implementations, the AI technology may be trained or fine-tuned using a set of general datasets or a set of datasets directed to a particular field or task. Additionally, or alternatively, the AI technology may be intermittently updated at a set interval or in real time based on resulting output or additional data to further train the AI technology. The AI technology may offer a variety of capabilities including text, audio, image, and other content generation, translation, summarization, classification, prediction, recommendation, time-series forecasting, searching, matching, pairing, and more. These capabilities may be provided in the form of output produced by the AI technology in response to a particular prompt or other input. Furthermore, the AI technology may implement Retrieval-Augmented Generation (RAG) or other techniques after training or fine-tuning by accessing a set of documents or knowledge base directed to a particular field or website other than the training or fine-tuning data to influence the AI technology's output with the set of documents or knowledge base.

To further guide and train output of the AI technology, one or more input prompts may be provided to the AI technology for the purpose of eliciting particular responses. In various implementations, the input prompts may correspond to the particular field or task to which the AI technology is trained. Additionally, or alternatively, the AI technology may be implemented along with one or more additional AI technologies. For example, a first AI model may produce a first output, which is used as input for a second AI model to produce a second output. These AI technologies may be used in succession of one another, in parallel with another, or a combination of both. Furthermore, the AI technologies may be merged in a variety of implementations, for example, by bagging, boosting, stacking, etc. the AI technologies.

100 110 105 145 115 120 105 110 145 In some examples of the system, users operating on contactsor cloud clientsmay use agents associated with the generative AI componentto perform tasks or actions by processing and interpreting data (e.g., data from the user, the cloud platform, the data center, other cloud clients, other contacts, or any combination thereof). Further, in some cases, users or organizations may use a relatively large quantity agents to perform a relatively large quantity of tasks, including tasks that involve multiple agents interacting with each other. However, training and testing agents to perform tasks may be relatively complex. For example, users may have to manually generate training data and testing scenarios which may be relatively time-consuming. Further, as the agents may be expected to be updated and retrained, having a user perform such training may be relatively time-consuming, resource computationally expensive, and inefficient, thus resulting in unreliable agents. Additionally, or alternatively, training and testing agents using large sets of data, different prompts, different models, may be relatively time consuming and computationally expensive and complex. Therefore, in some examples, the training and testing of agents associated with the generative AI componentmay result in the agents being inaccurate and uncapable of performing tasks for users

145 For example, a user may select one or more agents to utilize for determining whether a customer should be part of a marketing campaign based on a likelihood that a user will purchase a product of an organization. In some cases, such task may involve interaction between the one or more agents. For example, a first agent may be used to obtain information about a user, a second agent may be used to obtain data associated with how the user has historically responded or interacted with marketing campaigns, and a third agent may be used to obtain a semantic score for the user and the organization (e.g., a score indication a user's opinions or feelings for the organization). In such cases, the third agent may use the output of the first agent and the second agent as input, the second agent may use the output of the first agent as input, and the output of the first agent, the second agent, and the third agent may be used as input to a fourth agent to generate a likelihood score indicating how a likelihood that the user purchases a product. In such cases, training such agents may be relatively complex. For example, the fourth agent should be trained and tested on various different sets of data and for different scenarios However, generating enough test cases and corresponding training data to effectively and efficiently train the agents to generate an accurate likelihood score may be relatively time-consuming and computationally expensive. Therefore, in accordance with the techniques of the present disclosure, an agent training service may use a testing generation service that is associated with the generative AI componentmay obtain a selection of one or more agents to perform a task and then use an LLM to generate a set of scenarios for testing the one or more agents and to generate a set of training data, based on generating the set of scenarios, for training the one or more agents. The agent training service may then train the ML models of the agents using the set of training data generated by the LLM and execute the agents to perform the first task based on the training.

100 115 120 145 145 Further, in some examples of the system, agents may utilize data within data tables (e.g., data within the cloud platform, the data center, or both) for testing and training. In some cases, the data within the data tables may be manually inputted or imported into the data table and may be static within the data tables (e.g., the data may not change). However, utilizing static data may result in the actions performed by the agents being inaccurate as conditions change. Therefore, the techniques of the present disclosure may describe utilizing an interactive data table for testing agents associated with the generative AI component. For example, in accordance with the techniques of the present disclosure, an LLM service that is associated with the generative AI component, may utilize an interactive data table for data evaluation and LLM prompt evaluation to ensure reliability and accuracy of the agents.

The interactive data table may a multi-modal LLM integration architecture that enables real-time processing of heterogeneous data types. The system may employ: (i) a dynamic schema inference engine that automatically detects data types and relationships within the interactive data table; (ii) streaming data processors that handle large datasets through chunked processing and incremental updates; (iii) a context-aware prompt generation system that dynamically constructs LLM queries based on data characteristics and user instructions; (iv) a result caching and memorization layer that optimizes repeated operations on similar data; and (v) a conflict resolution system for handling concurrent user modifications and LLM updates. This multi-modal integration architecture specifically improves computer functionality by reducing memory fragmentation through optimized data chunking, minimizing API call overhead through intelligent result caching, and limiting system deadlocks through the conflict resolution mechanisms. These improvements address specific technical limitations of conventional data processing systems when handling concurrent LLM operations and large-scale data manipulation tasks.

145 In some examples, for data evaluation, the LLM service may evaluate at least a subset of a set of data within an interactive data table to obtain a set of output data that is displayed within the interactive data table. For example, the interactive data table may include a set of columns or fields that include data and the LLM service (e.g., the generative AI component) may receive a set of instructions to evaluate the data within one or more fields of the set of fields of the interactive data table. Based on receiving the instructions, the LLM service may execute the set of instructions to generate a set of output data that indicates the data evaluation, and the set of output data may be displayed within the interactive data table. The technical innovation lies in the seamless integration of structured data operations with unstructured LLM processing, enabling users to perform complex analytical operations that would traditionally require separate tools and manual data transformation steps. Moreover, utilizing the interactive data for the data evaluation may enable users to evaluate different types of data relatively efficiently and view the evaluations for additional control.

The disclosed system provides measurable technical improvements over conventional agent training approaches. Specifically, the automated scenario generation reduces training data preparation time by eliminating manual data entry bottlenecks, while the LLM-driven approach ensures comprehensive edge case coverage that manual methods typically miss. The interactive data table architecture enables real-time data validation and processing that prevents the data inconsistency errors common in static table implementations. These technical improvements result in more reliable agent deployment with reduced system failures and improved computational efficiency in multi-agent environments.

100 Further, for prompt evaluation, the LLM service may receive a prompt template that includes a set of instructions for the LLM to evaluate the set of data obtained from the interactive data table. In some cases, the prompt template may be used to generate an output from a relatively large set of data. For example, the interactive data table may indicate multiple columns or fields that include data and to perform actions, such as content generation, using the data within multiple fields, a user may generate the prompt template to reference the multiple fields. Thus, the prompt template may be capable of accessing the data within the multiple fields and using an LLM to evaluate the data and perform actions on the data for row of the interactive data table. Moreover, once the content is generated by the LLM, the user may utilize the LLM to evaluate a performance of the prompt template in generating accurate and reliable content. Therefore, the techniques of the present disclosure, may improve the performance of agents and increase the efficiency and reliability of utilizing agents within the systemby enabling relatively more reliable and efficient techniques for training agents, utilizing agents to evaluate data and perform actions, and for evaluating the performance of prompts for agents.

100 It should be appreciated by a person skilled in the art that one or more aspects of the disclosure may be implemented in a systemto additionally, or alternatively, solve other problems than those described above. Furthermore, aspects of the disclosure may provide technical improvements to “conventional” systems or processes as described herein. However, the description and appended drawings only include example technical improvements resulting from implementing aspects of the disclosure, and accordingly do not represent all of the technical improvements provided within the scope of the claims.

2 FIG. 1 FIG. 200 200 100 200 205 210 215 220 220 220 220 225 a b c shows an example of a computing systemthat supports automated agent testing and evaluation using LLMs in accordance with aspects of the present disclosure. In some examples, the computing systemimplements or may be implemented by the system. For example, the computing systemmay illustrate a computing device, a user interface, an LLM service, one or more agents(e.g., an agent-, an agent-, and an agent-), and a data table, that may be implemented by devices or services described with reference to.

215 230 205 230 210 215 215 235 230 230 215 230 235 230 235 230 215 235 205 235 200 In some examples, users may use generative AI services such as the LLM serviceto generate content based on natural language queries (e.g., a query). For example, a user may transmit, via a computing device, the queryto a user interfaceof the LLM servicethat is a natural language query requesting the LLM serviceto generate a set of content for the user. In response, the user may receive a query responsethat includes the content requested by the user via the query. For example, the querymay be a request to generate a set of text for an electronic communication message (e.g., an email or text message) and the LLM servicemay process the queryto generate the query responseIn some examples, the querymay include various types of requests, such as data retrieval, content generation, or task execution. Further, the query responsemay include text, images, audio, or other types of content (e.g., multi-modal content) based on the nature of the query. Moreover, in some cases, the LLM servicemay transmit the query responseback to the user of the computing devicefor the user to review. The query responsemay also be used as input for further processing or actions by other components of the computing system.

215 220 220 220 220 220 220 220 220 220 220 220 220 220 220 220 220 220 220 220 220 220 220 a b c a b b a a b c c In some cases, users may use the LLM serviceto control one or more agents. The one or more agentsmay represent autonomous software entities configured to perform specific tasks or actions by processing and interpreting data. In some examples, the one or more agentsmay interact with or utilize one or more AI/ML models (e.g., LLMs) to execute natural language queries and interact with users or systems. Further, the one or more agentsmay be associated with a set of parameters that indicate the behavior and actions of the one or more agents. Additionally, or alternatively, the one or more agentsmay interact with other agentsto perform various tasks. For example, the agent-may interact with the agent-and the agent-to perform a single task. Thus, the output of one agentmay serve as input for another agent, enabling complex interactions and task execution. For example, the agent-may receive a first input (e.g., an initial input) and generate a first output which is further utilized as an input to the agent-. In such cases, the agent-may receive a second input from the agent-where the second input corresponds to, is associated with, or is the same as the first output generated by the agent-. The agent-may utilize the second input to generate a second output which is further given to the agent-as a third input for the agent-to generate a third output (e.g., a final result to the initial input) In such cases, the first output and the second input may be the same and the second output and the third input may be the same as the output of one agentmay serve as input for another agent.

220 220 225 220 225 225 220 225 220 220 To enable the one or more agentsto perform tasks (e.g., to perform tasks individually or collectively or to perform tasks sequentially or in parallel), the one or more agentsmay be trained and tested to perform various tasks. In some examples, a user may use a data tableto store data for training and testing the one or more agents. In some cases, users may manually enter data into the data tableor users may import data into the data tablefrom one or more files. As such, the one or more agentsmay be trained and tested on the data within the data table. Moreover, to efficiently train and test the one or more agents, the one or more agentsmay be expected to be trained and tested on a relatively large set of data.

220 220 22 220 220 225 225 220 220 220 220 215 220 In some cases, organizations may train and test the one or more agentson relatively large sets of data to ensure that the outputs of the one or more agentsare accurate and reliable. For example, some organizations or industries may expect a level of accuracy and a lack of failure in order to use agentsto perform tasks on the behalf of a user due to regulations associated with the industry. For example, the medical industry is highly regulated and may expect a level of accuracy and reliability of the one or more agents. However, the testing and training of the one or more agentsmay be manual and can be relatively time consuming. For example, a relatively large set of data may have to be generated and manually input into the data tableor into a file that can be imported into the data tablesuch that the one or more agentscan be trained and tested. Further, the testing of the one or more agentsmay include a relatively large set of testing scenarios and the testing may be expected to changes as the agentsare further developed (e.g., testing may have to change throughout a development lifecycle). Such testing procedures may result in the testing of the one or more agentsand the generation of testing scenarios being relatively complex. Additionally, or alternatively, testing (e.g., batch testing) multiple different prompts or models via the LLM service, multiple different agents, or both may be relatively complex, time-consuming, and computationally expensive.

215 220 220 220 220 215 220 220 215 220 225 215 215 220 215 220 215 3 FIG. 4 8 FIGS.through In accordance with the techniques of the present disclosure, the LLM serviceand the one or more agentsmay be utilized to aid in training and testing the one or more agents, evaluating data via the one or more agents, and evaluating the performance of prompts used by an LLM for the one or more agents. For example, as described elsewhere herein, such as with reference to, the techniques of the present disclosure may describe the LLM servicegenerating testing scenarios and training data based on information associated with one or more agentsthat are selected for a task. For example, a user may select one or more agentsto perform a first task and the LLM servicemay automatically generate data for testing and training the one or more agentsto perform the respective task. Further, as described elsewhere herein, such as with reference to, the techniques of the present disclosure may describe the data tablebeing an interactive data table to provide flexibility and efficiency in inputting data for training and testing the one or more agents or generating, via the LLM service, data for training and testing the one or more agents. Further, in accordance with the techniques of the present disclosure, the interactive data table and the LLM servicemay be utilized for evaluating outputs from the one or more agentsutilizing the LLM serviceand the data of the interactive data table, evaluating the performance of prompts for the one or more agentsutilizing the LLM service, or any combination thereof.

3 FIG. 300 300 shows an example of a flowchartthat supports automated agent testing and evaluation using LLMs in accordance with aspects of the present disclosure. The operations of the flowchartmay be implemented by an agent training service or its components as described herein. In some examples, an agent training service may execute a set of instructions to control the functional elements of the agent training service to perform the described functions. Additionally, or alternatively, the agent training service may perform aspects of the described functions using special-purpose hardware.

310 215 305 215 215 At, a LLM servicemay receive, from a first user operating a computing deviceand via a first user interface, a selection of one or more agents from a set of agents to perform a first task. The set of agents may be associated with one or more AI/ML models to execute natural language queries. Each agent may be associated with a set of parameters for performing one or more actions via the one or more AI/ML models. In some examples, the LLM servicemay receive, via the selection of the one or more agents, an indication of the first task for the one or more agents to perform. Further, in response to receiving the selection of the one or more agents, the LLM servicemay obtain the set of parameters for each agent of the one or more agents. In some examples, the set of parameters for the agents may include agent details, topics, instructions, actions, metadata, or any combination thereof associated with the agents. For example, the set of parameters may indicate identifiers associated with the agent, topics that the agent is configured for, instructions for the ML models of the agent, actions that the agent is configured to perform, metadata associated with the agent, or any combination thereof.

315 215 215 215 At, the LLM servicemay generate a set of scenarios for testing the one or more agents. The set of scenarios may correspond to the one or more actions each agent of the one or more agents is instructed to perform based on the set of parameters of each agent. That is, the LLM servicemay automatically create a testing plan for testing the one or more agents based on the parameters of the one or more agents. In some examples, the set of scenarios may be based on an agent job to be done (JBTD) or the actions that the agent is configured to perform as part of execution of the first task. Moreover, in accordance with the techniques of the present disclosure, the LLM servicemay use an LLM to generate a list of diverse test cases to ensure that the one or more agents are tested for a relatively wide variety of different scenarios. The LLM may generate diverse test cases by: (i) analyzing agent parameter vectors to identify capability boundaries; (ii) applying combinatorial testing algorithms to generate edge case scenarios; (iii) utilizing Monte Carlo sampling techniques to support statistical coverage across parameter spaces; (iv) implementing constraint satisfaction algorithms to generate valid multi-agent interaction sequences; and/or (v) employing adversarial generation techniques to create challenging test scenarios that stress-test agent performance limits.

320 215 215 215 325 215 330 At, the LLM servicemay generate a set of training data for training the one or more agents to perform the first task. The generation of the set of training data may be based on the generation of the set of scenarios and the set of parameters of the one or more agents. In some examples, generating the set of training data may include the LLM serviceautomatically creating the test cases for the training of the one or more agents. For example, the LLM servicemay generate a set of test scripts for training the one or more agents on the set of training data. Thus, at, an agent training service may train the one or more AI/ML models associated with the one or more agents using the set of training data generated via the LLM service. At, in response to the one or more AI/ML models of the one or more agents being trained, the agent training service may then execute the one or more agents to perform the first task. In some examples, executing the one or more agents to perform the first task may be based on receiving the indication of the first task. Further, in some cases, the one or more agents may be executed sequentially to perform the first task. For example, an output of a first agent may be used as an input to a second agent in order to perform the first task using the one or more agents.

215 215 215 215 215 4 8 FIGS.through Further, in some examples, based on executing the one or more agents, the LLM servicemay obtain, from the one or more AI/ML models associated with the one or more agents, a set of output data in response to the set of training data being provided as input to the one or more AI/ML models. Moreover, the set of output data may be associated with the first task. Using the set of output data, the LLM servicemay evaluate the set of output data to obtain a performance metric indication of the one or more agents. In some cases, the set of output data may be evaluated via code-based testing, via the LLM service, via user feedback, or any combination thereof. For example, the set of output data may be compared to a set of expected output data, a user may rate the set of output data (e.g., a user indicates a positive or negative indication to the output data), or an LLM of the LLM servicemay evaluate the data based on a set of instructions of LLM prompt. In some cases, when an LLM is being used to evaluate the set of output data, a user may set an evaluation scale for the LLM to evaluate the set of output data. Further descriptions of evaluating data with an LLM associated with the LLM servicemay be described elsewhere herein, such as with reference to.

4 FIG. 1 2 FIGS.and 400 400 100 200 400 405 shows an example of a user interfacethat supports automated agent testing and evaluation using LLMs in accordance with aspects of the present disclosure. In some examples, the user interfacemay implement or may be implemented by the system, the computing system, or both. For example, the user interfacemay illustrate a user interface of an interactive data tablethat is used for training and testing one or more agents as described elsewhere herein including with reference to.

405 405 In some examples, data tables may be fundamental tools used across various industries for data management and analysis. In some cases, data tables may rely on manual data entry and users importing data from files. However, using AI models (e.g., LLMs), the techniques of the present disclosure may provide enhancements to the functionality and usability of data tables by enabling a relatively more versatile and efficient data input system that leverages the use of LLMs. As illustrated herein and in accordance with the techniques of the present disclosure, the interactive data tablemay be configured to improve data input and data manipulation procedures. For example, the interactive data tablemay support multiple data entry techniques such as manual user input, file imports, and data generation using an LLM, thus making data entry relatively more efficient and flexible.

405 405 400 405 405 405 405 405 For manual user inputs, users may be capable of manually entering data directly into table cells of the interactive data table. In some examples, the interactive data tablemay also support or include features for data validation, auto-completion, and error correction to assist users in entering accurate and consistent data. Further, the manual user input may be facilitated by the user interfaceof the interactive data tableand may ensure accessible use by users with varying levels of technical expertise. For file imports, users may import data from various different file formats (e.g., comma separated value (CSV) files, spreadsheet files, JavaScript object notation (JSON) files, and the like). In some cases, the interactive data tablemay automatically parse the imported files and map the data to one or more fields of columns of the interactive data table. In some examples, the import functionality may include error handling mechanisms to manage issues such as missing data, incorrect formats, and duplicate entries. Additionally, or alternatively, users may be capable of previewing and editing the imported data within the interactive data tablebefore finalizing the import process to ensure data integrity within the interactive data table.

405 405 Further, to generate date utilizing LLMs, the interactive data tablemay interact with an LLM to generate data based on user-defined parameters and prompts. For example, users may indicate a type of data to be generated (e.g., synthetic data for testing, data augmentation for AI/ML models, and the like) and may provide contextual prompts for the LLM to generate the data. Thus, the LLM may analyze the prompts and generate data that is contextually relevant and adheres to criteria indicated by a user. Moreover, the generated data may be directly inserted into the interactive data tablefor the user, thus providing an efficient data input method for users.

405 410 415 405 410 115 420 425 425 425 400 405 425 400 400 425 430 400 430 420 430 420 In some examples, when inputting data into the interactive data table, the user may utilize a menuto select a field from a set of fieldsfor a column of the interactive data table. For example, a user may select to input data for a text field, a single select field, a multi-select field, a numerical field, a currency field, a percentage field, an attachment field, a formula field, a data record field, or an evaluation field. The menumay also give a user an option to pull (e.g., retrieve) data from a cloud platformor from a uniform resource locator (URL) that indicates an address of a unique resource on the internet. Further, using a menu, a user may select an input typefrom a set of input types. For example, a user may select a drop down menu that lists the set of input typesfor the user to select from. If a user selects user input as the input typefor the respective column, the user may then be prompted by the user interfaceto begin entering data into the interactive data table. If the user selects import data as the input type, a menu may be displayed via the user interfacefor the user to select a file from a file manager of a computing device that the user interfaceis displayed on and is being operated by the user. Further, if the user selects LLM as the input type, an LLM configuration displaymay be displayed within the user interface. In some cases, the LLM configuration displaymay be an extension of the menuor the LLM configuration displaymay be separate from the menu.

430 435 435 435 430 435 435 430 415 410 430 440 430 405 Within the LLM configuration display, the user may also be capable of selecting a model configurationfrom a set of model configurations. For example, a model configurationfield may be displayed via the LLM configuration displayand the user may select a drop down arrow to display a drop down menu that lists the set of model configurationsthat a user is capable of selecting and utilizing to generate data for the respective column. In some cases, the set of model configurationsdisplayed to the user for selection via the LLM configuration displaymay be based on a fieldtype selected via the menu. Further, within the LLM configuration display, the user may also input a set of LLM instructionswithin a text input field of the LLM configuration displayto instruct the LLM to generate the set of data for the interactive data table.

405 405 405 405 405 405 405 400 405 405 405 2 3 FIGS.and 5 FIG. Therefore, by enabling the integration of the LLM for data generation, the relatively time consumption and user effort associated with creating data and populating the interactive data tablemay be reduced. Further, the LLM may also be used to perform automated data validation and error correction when data is input manually by users or imported via files to improve the data quality of the interactive data tableand reduce the quantity of manual corrections expected. Additionally, or alternatively, in accordance with the techniques of the present disclosure, a user may utilize LLM data generation in combination with manual data entry, importing data from a file, or both to improve the data population of the interactive data table. For example, a user may import or manually input a relatively small quantity of data into the interactive data tableand then use the LLM to generate a relatively larger set of data that includes the initial data. In such cases, if the interactive data tableis used for training and testing agents, as described elsewhere herein such as with reference to, a user may be capable of creating a relatively large data set for training. Moreover, having the interactive data tableallow users to have the ability to input data manually, import from files, or generate the data via an LLM may increase the flexibility of the interactive data tableand allow the user to select a data entry technique that is most convenient and efficient for the user. Further, in some cases, the user interfaceof the interactive data tablemay also be designed with a user-centric approach to create a user-friendly interface that ensures an ease of use and accessibility for users of different technical skill levels. Additionally, or alternatively, using LLMs for data generation within the interactive data tablemay ensure that users are capable of producing relatively high quality and contextually relevant data on-demand. Further descriptions of the techniques of the present disclosure such as users generating evaluations for data within the interactive data tableusing LLMs may be provided elsewhere herein, such as with reference to.

5 FIG. 1 2 FIGS.and 500 500 100 200 400 500 500 505 510 515 520 525 530 shows an example of a user interfacethat supports automated agent testing and evaluation using LLMs in accordance with aspects of the present disclosure. In some examples, the user interfacemay implement or may be implemented by the system, the computing system, the user interface, or any combination thereof. For example, the user interfacemay illustrate a user interface for configuring a set of instructions for evaluating a set of data within an interactive data table as described elsewhere herein including with reference to. In some cases, the user interfacemay include a column name field, a field type field, an input type field, a model configuration field, an LLM instructions field, and an output option field.

405 4 FIG. In some examples, when using an interactive data table (e.g., an interactive data tabledescribed with reference to), users may interact with or otherwise utilize LLMs to perform evaluations on various types of data within the interactive data table, such as text data, image data, audio data, and the like. Further, the interactive data table may enable users to input data into a spreadsheet type interface and run (e.g., execute) evaluations by generating instructions for an LLM that reference one or more fields or columns of the interactive data table. Moreover, the LLM may process the data according to a set of specifications indicated by a user and then output a result within a variety of different formats.

500 500 For example, as illustrated herein via the user interface, the user interfacemay enable a user to evaluate data within an interactive data sheet in accordance with the techniques of the present disclosure. For example, users may be capable of creating evaluations of data using LLMs by inputting data into an interactive data sheet and generating a set of instructions for the LLM to evaluate the data. In some cases, the set of instructions for the LLM to evaluate the data within the interactive data table may include an indication of one or more columns within the interactive data table for the evaluation. Further, a user may also indicate an output format of the LLM evaluation. For example, the results of the LLM evaluations may be output in various different formats to provide users with the flexibility of tailoring an output to the expectations of a user. In some cases, the output formats that an LLM may support may include a text format for plain text outputs, a number format for numerical outputs, a single tag format for labeling the output with single categorical label, a multiple tag format for labeling the output with multiple categorical labels, a currency format for financial figure outputs, a percentage format for percentage value outputs, or any combination thereof. Additionally, or alternatively, using LLMs for evaluations may enable users to perform various different type of data evaluations with an LLM acting as the judge, thus enhancing the utility of interactive data sheets across different industries, organizations, applications, and the like.

500 505 500 510 515 510 500 515 515 500 520 525 530 In some cases, to configure the LLM to generate data evaluations, a user may configure the LLM via the user interface. For example, a user may indicate a column name in the column name fieldof the user interfaceto indicate where the set of output data from the LLM should be displayed or input within an interactive data table. Further, the user may select a field type for the indicated column via the field type fieldand an input type via the input type field. Moreover, in some cases, the field type indicated via the field type fieldof the user interfacemay refer to an output type for the set of output data generated by the LLM and the input type fieldmay indicate that the data for the respective column (e.g., the output data) is being generated and input by an LLM. Thus, in response to selecting an LLM input type via the input type field, the user interfacemay then enable the user to select a corresponding model configuration via the model configuration field. Moreover, as described elsewhere herein, the model configurations displayed as options for a user to select may be based on the field type of a respective column. Further, the user may also indicate a set of instructions for an LLM to evaluate a set of data within the LLM instructions fieldand a format for the data evaluation output via the output option field.

500 525 510 530 530 In some examples, as illustrated herein, the user interfacemay indicate a configuration for evaluating data for engagement using an LLM. For example, the set of instructions within the LLM instructions fieldmay indicate for then LLM to evaluate data within a response field by referencing the response field within the set of instructions. Further, a user may then indicate an output format via the field type fieldas a single select type such that the output option fieldindicates two possible labels for the output data. For example, as illustrated herein, the output option fieldmay indicate an engaging tag and a not engaging tag such that the individual data items within the responses column can be evaluated by the LLM and labeled as engaging or not engaging within an engagement column of the interactive data table. In another example, the LLM may evaluate for politeness of responses within a response column of the interactive data table and may label the data as polite or impolite in a politeness column. Further, in some cases, the LLM may evaluate image data within an image column of the interactive data table. For example, in the context of a medical industry scenario, the image column may include medical x-ray images and the LLM may evaluate the images to detect items such as whether fractures are visible (e.g., via a fracture visible label or a fracture not visible label), whether tumors are visible (e.g., via a tumor visible label or a no tumor visible label), or the like.

500 6 FIG. 7 FIG. Therefore, in accordance with the techniques of the present disclosure, the user interfacemay be used for users to define and perform various different types of evaluations. For example, the techniques of the present disclosure may support users using various different data types for evaluation and configuring the LLM evaluations to the expectations and requirements of the user. Moreover, the techniques of the present disclosure may also enable users to generate data for multiple different output formats to ensure that the evaluation results are valuable to the user. Thus, the interactive data table that is configured for intuitive and simple use may allow users to perform data evaluations relatively more efficiently and accurately. Further descriptions of the techniques of the present disclosure may be provided elsewhere herein such as with reference to. Moreover, descriptions of the techniques of the present disclosure related to using prompt templates for data evaluations may be provided elsewhere herein such as with reference to.

6 FIG. 1 2 FIGS.and 600 600 100 200 400 500 600 605 610 615 shows an example of a process flowthat supports automated agent testing and evaluation using LLMs in accordance with aspects of the present disclosure. In some aspects, the process flowmay implement or may be implemented by the system, the computing system, the user interface, the user interface, or any combination thereof. The process flowmay include a computing device, an interactive data table, and an LLM, which may be examples of devices or services described elsewhere herein including with reference to.

600 605 610 615 600 600 605 610 615 1 2 FIGS.through In the following description of the process flow, the operations may be performed by the computing device, the interactive data table, and the LLMin different orders or at different times. Some operations may also be left out of the process flow, or other operations may be added. Although the process flowmay be described as being performed by the computing device, the interactive data table, and the LLM, some aspects of some operations may also be performed by other devices, services, or models described elsewhere herein including with reference to.

620 610 605 615 610 610 610 605 615 610 615 615 At, the interactive data tablemay obtain, from a user of the computing device, the LLM, or both, a set of data via a first user interface within the interactive data table. The interactive data tablemay include a set of fields and the set of fields may include a set of data records from the set of data. In some cases, the interactive data tablemay obtain the set of data from the first user of the computing devicevia one or more user inputs comprising the set of data, via an indication of a file comprising the set of data, or a combination thereof. In some other cases, to obtain the set of data from the LLM, the interactive data tabletransmit, to the LLM, a query to generate the set of data, the query comprising one or more LLM prompt parameters for the LLMto utilize to generate the set of data. Additionally, or alternatively, the set of data may include text data, a single select data, multi-select data, numerical data, currency data, percentage data, one or more attachments, one or more formulas, record data, one or more images, audio data, or any combination thereof. Further, the set of fields of the interactive data table may be associated with a set of columns of the interactive data table.

625 610 615 630 615 610 635 610 615 605 610 615 At, the interactive data tablemay receive, a set of instructions for evaluation of the set of data by the. The set of instructions may include an indication of one or more fields of the set of fields and an output format. Thus, at, the LLMmay evaluate a subset of data of the set of data within the interactive data tablein accordance with the set of instructions for the evaluation. The subset of data may be associated with the one or more fields of the interactive data table indicated via the set of instructions. At, the interactive data tablemay obtain, from the LLMand based on the evaluation, a set of output data within the output format indicated via the set of instructions for the evaluation. In some examples, the set of output data may include textual data, numerical data, single tag data, multiple tag data, currency data, percentage data, or any combination thereof based on the output format indicated via the set of instructions for the evaluation of the set of data. In some cases, the set of output data may be displayed via the first user interface of the computing devicewithin an output field of the set of fields of the interactive data table. In some other cases, the interactive data tablemay obtain, from one or more user inputs or from the LLM, an evaluation of the set of output data.

7 FIG. 1 2 FIGS.and 700 700 100 200 700 705 shows an example of a user interfacethat supports automated agent testing and evaluation using LLMs in accordance with aspects of the present disclosure. In some examples, the user interfacemay implement or may be implemented by the system, the computing system, or both. For example, the user interfacemay illustrate a user interface of an interactive data tablethat is used for training and testing one or more agents as described elsewhere herein including with reference to.

705 705 705 705 705 705 In some examples, the interactive data tablemay be used for data processing and natural language processing using LLMs. In some cases, the interactive data tablemay allow for users to input data and organize data into various columns where each column can include different types of data, such as text data, numerical data, dates, and other relevant information. Further, in accordance with the techniques of the present disclosure, the interactive data tablemay be used for batch testing prompt templates with LLMs. In some cases, to batch test prompt templates for LLMs using the interactive data table, users may indicate or reference merge field data of different columns of the interactive data tablewithin a prompt template. A merge field may refer to a field or column of the interactive data tablethat is references within a set of instructions for an LLM with a given symbol or non-letter and non-numerical character (e.g., the ‘@’ symbol) to indicate that the column indication should be replaced with the data from the column. Further, using prompt templates with varied inputs and outputs may allow users to evaluate the effectiveness and accuracy of the prompts.

705 705 710 710 705 710 710 a b a b In accordance with the techniques of the present disclosure, to batch test prompt templates using the interactive data table, data may be input and organized into columns of the interactive data table. For example, a first set of data may be input into the column-and a second set of data may be input into the column-. A user may then generate prompt templates that reference one or more columns in the interactive data table(e.g., the column-and the column-) as merge tags. An LLM may further process the prompt templates using the indicated data and a user or LLM may evaluate the output of the LLM responses across various data inputs to determine the performance and accuracy of the prompts. Moreover, the techniques of the present disclosure may be adaptable for a wide range of applications involving text data and other types of input data as described elsewhere herein.

715 710 710 705 715 720 725 730 730 735 725 740 a b In some examples, within a column configuration display, a user may configure an LLM to generate a set of data that is an output of an evaluation of the data within the column-and the column-of the interactive data table. For example, within the column configuration display, a user may generate a column name within a column name field, select a field type for the column within a field type field, and select an input type for the column within an input type field. In response to an LLM input type being selected within the input type field, the user may select, via a model configuration fielda model configuration for the LLM and based on the field type indicated via the field type fieldand the user may input a set of instructions for a prompt template within an LLM instructions field.

710 710 715 740 710 710 705 745 710 705 710 705 705 a b a b c c 8 FIG. In some examples, as illustrated herein, the column-may represent a column indicating a set of customer records and the column-may represent a column indicating a set of product records associated with the corresponding set of customer records. Moreover, the column configuration displaymay indicate a configuration of a prompt response column where data for the column is to be generated via a set of LLM instructions indicated via the LLM instructions field. Further, the set of LLM instructions may reference the column-and the column-as merge tags. For example, the set of LLM instructions may indicate: “Please draft an email to @CustomerName thanking them for their purchase of @ProductName” where “@CustomerName” and “@ProductName” are merge tags referencing columns in the interactive data table. In response, the LLM may generate the data using the prompt template and input the data (e.g., a set of output data) into a column-that may represent a column indicating prompt responses. Further, the LLM process the templates by merging the data from the indicated columns and transmitting the completed prompts to the LLM. The LLM may further generate responses based on the merged prompts, thus enabling users to perform batch testing by evaluating how the LLM responds to different inputs. Additionally, or alternatively, the LLM responses may be output back into the interactive data table(e.g., into the column-) to allow users to review and evaluate the performance of the prompt templates. In some examples, such evaluation may include evaluating metrics such as coherence, relevance, and accuracy of the responses. Further, in some examples, the LLM or a different LLM may be used to evaluate the performance of the prompt templates. Thus, the interactive data tablemay be used to input and organize data for use by LLM prompt templates and the output of the LLM prompt templates can be displayed within the interactive data tablefor further evaluation to improve the performance of the prompt templates. Further descriptions of techniques of the present disclosure may be describe elsewhere herein, such as with reference to. The lexical analyzer may employ finite state machine algorithms for efficient token recognition, while the dependency graph builder may implement topological sorting algorithms to optimize processing sequences. The type-safe data binding system prevents runtime errors through compile-time validation, and the batch optimization engine reduces computational overhead through intelligent operation grouping and parallel processing techniques.

The merge field processing system may implement a sophisticated template parsing and data substitution engine. The system may use: (i) a lexical analyzer that tokenizes prompt templates and identifies merge field references using configurable delimiter patterns; (ii) a dependency graph builder that analyzes inter-field relationships and determines optimal processing order; (iii) a type-safe data binding system that validates data compatibility between merge fields and target LLM input requirements; (iv) a batch optimization engine that groups similar merge operations to reduce LLM API calls; and (v) a rollback mechanism that handles partial failures in multi-field substitution operations. The system maintains referential integrity by tracking data lineage from source fields through template processing to final LLM output, supporting precise error attribution and debugging capabilities.

8 FIG. 1 2 FIGS.and 800 800 100 200 700 800 805 810 815 shows an example of a process flowthat supports automated agent testing and evaluation using LLMs in accordance with aspects of the present disclosure. In some aspects, the process flowmay implement or may be implemented by the system, the computing system, the user interface, any combination thereof. The process flowmay include a computing device, an interactive data table, and an LLM, which may be examples of devices or services described elsewhere herein including with reference to.

800 805 810 815 800 800 805 810 815 1 2 FIGS.through In the following description of the process flow, the operations may be performed by the computing device, the interactive data table, and the LLM, in different orders or at different times. Some operations may also be left out of the process flow, or other operations may be added. Although the process flowmay be described as being performed by the computing device, the interactive data table, and the LLM, some aspects of some operations may also be performed by other devices, services, or models described elsewhere herein including with reference to.

820 810 805 815 810 810 At, the interactive data tablemay obtain, from a user of the computing device, the LLM, or both, a set of data via a first user interface within an interactive data table. The interactive data tablemay include a set of fields and the set of fields may include a set of data records from the set of data. In some examples, the set of data may include text data, a single select data, multi-select data, numerical data, currency data, percentage data, one or more attachments, one or more formulas, record data, one or more images, audio data, or any combination thereof.

825 810 805 815 At, the interactive data tablemay receive, from the computing device, a prompt template for the LLMthat is associated with evaluation of the set of data obtained via the first user interface. The prompt template may include a set of instructions for the LLM to evaluate the set of data. The set of instructions may include an indication of one or more fields of the set of fields of the interactive data table.

810 815 810 810 810 815 810 810 835 810 815 At 830, the interactive data tablemay execute, via the LLMand based on receiving the prompt template, the set of instructions of the prompt template using the set of data within the interactive data table to obtain a set of output data. In some cases, to execute the prompt template, the interactive data tablemay obtain, from the set of data within the interactive data table, a subset of data associated with the one or more fields indicated via the set of instructions. Further, the interactive data tablemay transmit, to the LLM, the set of instructions of the prompt template with the subset of data included within the set of instructions in place of the indication of the one or more fields. Moreover, in some examples, the interactive data tablemay obtain the set of output data may be obtained the subset of data being included within the set of instructions. In some other cases, the set of output data may be displayed via the first user interface within an output field of the set of fields of the interactive data table. Further, at, the interactive data tablemay evaluate, via the LLM, the set of output data in response to executing the prompt template to obtain one or more indications of a performance of the prompt template. In some examples, the one or more indications of the performance of the prompt template may include a coherency indication, a relevance indication, an accuracy indication, or any combination thereof.

9 FIG. 900 905 905 910 915 920 905 905 910 915 920 shows a block diagramof a devicethat supports automated agent testing and evaluation using LLMs in accordance with aspects of the present disclosure. The devicemay include an input module, an output module, and a data generation module. The device, or one or more components of the device(e.g., the input module, the output module, the data generation module), may include at least one processor, which may be coupled with at least one memory, to support the described techniques. Each of these components may be in communication with one another (e.g., via one or more buses).

910 905 910 910 910 905 910 920 910 1110 11 FIG. The input modulemay manage input signals for the device. For example, the input modulemay identify input signals based on an interaction with a modem, a keyboard, a mouse, a touchscreen, or a similar device. These input signals may be associated with user input or processing at other components or devices. In some cases, the input modulemay utilize an operating system such as iOS®, ANDROID®, MS-DOS®, MS-WINDOWS®, OS/2®, UNIX®, LINUX®, or another known operating system to handle input signals. The input modulemay send aspects of these input signals to other components of the devicefor processing. For example, the input modulemay transmit input signals to the data generation moduleto support automated agent testing and evaluation using LLMs. In some cases, the input modulemay be a component of an input/output (I/O) controlleras described with reference to.

915 905 915 905 920 915 915 1110 11 FIG. The output modulemay manage output signals for the device. For example, the output modulemay receive signals from other components of the device, such as the data generation module, and may transmit these signals to other components or devices. In some examples, the output modulemay transmit output signals for display in a user interface, for storage in a database or data store, for further processing at a server or server cluster, or for any other processes at any number of devices or systems. In some cases, the output modulemay be a component of an I/O controlleras described with reference to.

920 925 930 935 940 945 920 910 915 920 910 915 910 915 For example, the data generation modulemay include an agent selection component, a testing scenario generation component, a training data generation component, an agent training component, an agent execution component, or any combination thereof. In some examples, the data generation module, or various components thereof, may be configured to perform various operations (e.g., receiving, monitoring, transmitting) using or otherwise in cooperation with the input module, the output module, or both. For example, the data generation modulemay receive information from the input module, send information to the output module, or be integrated in combination with the input module, the output module, or both to receive information, transmit information, or perform various other operations as described herein.

920 925 930 935 940 945 The data generation modulemay support agent testing generation via a LLM in accordance with examples as disclosed herein. The agent selection componentmay be configured to support receiving, from a first user via a first user interface, a selection of one or more agents of a set of multiple agents to perform a first task, the set of multiple agents being associated with one or more AI/ML models to execute natural language queries, where each agent is associated with a set of parameters for performing one or more actions via the one or more AI/ML models. The testing scenario generation componentmay be configured to support generating, via the LLM, a set of multiple scenarios for testing the one or more agents, the set of multiple scenarios corresponding to the one or more actions each agent of the one or more agents is instructed to perform based on the set of parameters of each agent. The training data generation componentmay be configured to support generating, via the LLM and based on generation of the set of multiple scenarios, a set of training data for training the one or more agents to perform the first task, the set of training data being based on the set of parameters of the one or more agents and the set of multiple scenarios. The agent training componentmay be configured to support training the one or more AI/ML models associated with the one or more agents using the set of training data generated via the LLM. The agent execution componentmay be configured to support executing the one or more agents to perform the first task based on training the one or more AI/ML models associated with the one or more agents.

10 FIG. 1000 1020 1020 920 1020 1020 1025 1030 1035 1040 1045 1050 shows a block diagramof a data generation modulethat supports automated agent testing and evaluation using LLMs in accordance with aspects of the present disclosure. The data generation modulemay be an example of aspects of a data generation module or a data generation module, or both, as described herein. The data generation module, or various components thereof, may be an example of means for performing various aspects of automated agent testing and evaluation using LLMs as described herein. For example, the data generation modulemay include an agent selection component, a testing scenario generation component, a training data generation component, an agent training component, an agent execution component, an agent parameter acquisition component, or any combination thereof. Each of these components, or components of subcomponents thereof (e.g., one or more processors, one or more memories), may communicate, directly or indirectly, with one another (e.g., via one or more buses).

1020 1025 1030 1035 1040 1045 The data generation modulemay support agent testing generation via a LLM in accordance with examples as disclosed herein. The agent selection componentmay be configured to support receiving, from a first user via a first user interface, a selection of one or more agents of a set of multiple agents to perform a first task, the set of multiple agents being associated with one or more AI/ML models to execute natural language queries, where each agent is associated with a set of parameters for performing one or more actions via the one or more AI/ML models. The testing scenario generation componentmay be configured to support generating, via the LLM, a set of multiple scenarios for testing the one or more agents, the set of multiple scenarios corresponding to the one or more actions each agent of the one or more agents is instructed to perform based on the set of parameters of each agent. The training data generation componentmay be configured to support generating, via the LLM and based on generation of the set of multiple scenarios, a set of training data for training the one or more agents to perform the first task, the set of training data being based on the set of parameters of the one or more agents and the set of multiple scenarios. The agent training componentmay be configured to support training the one or more AI/ML models associated with the one or more agents using the set of training data generated via the LLM. The agent execution componentmay be configured to support executing the one or more agents to perform the first task based on training the one or more AI/ML models associated with the one or more agents.

1050 In some examples, the agent parameter acquisition componentmay be configured to support obtaining, in response to receiving the selection of the one or more agents, the set of parameters for each agent of the one or more agents, where generating the set of multiple scenarios is based on obtaining the set of parameters for the one or more agents.

1025 In some examples, to support receiving the selection of the one or more agents, the agent selection componentmay be configured to support receiving, via the selection of the one or more agents, an indication of the first task for the one or more agents to perform, where executing the one or more agents to perform the first task is based on receiving the indication of the first task.

1040 1040 In some examples, to support training the one or more agents, the agent training componentmay be configured to support obtaining, from the one or more AI/ML models associated with the one or more agents, a set of output data in response to the set of training data being provided as input to the one or more AI/ML models, the set of output data being associated with the first task. In some examples, to support training the one or more agents, the agent training componentmay be configured to support evaluating the set of output data to obtain a performance metric indication of the one or more agents.

In some examples, the set of output data is evaluated via code-based testing, via the LLM, via user feedback, or any combination thereof.

In some examples, the one or more agents are executed sequentially to perform the first task.

11 FIG. 1100 1105 1105 905 1105 1120 1110 1115 1125 1130 1135 1140 shows a diagram of a systemincluding a devicethat supports automated agent testing and evaluation using LLMs in accordance with aspects of the present disclosure. The devicemay be an example of or include components of a deviceas described herein. The devicemay include components for bi-directional data communications including components for transmitting and receiving communications, such as a data generation module, an I/O controller, such as an I/O controller, a database controller, at least one memory, at least one processor, and a database. These components may be in electronic communication or otherwise coupled (e.g., operatively, communicatively, functionally, electronically, electrically) via one or more buses (e.g., a bus).

1110 1145 1150 1105 1110 1105 1110 1110 1110 1110 1130 1105 1110 1110 The I/O controllermay manage input signalsand output signalsfor the device. The I/O controllermay also manage peripherals not integrated into the device. In some cases, the I/O controllermay represent a physical connection or port to an external peripheral. In some cases, the I/O controllermay utilize an operating system such as iOS®, ANDROID®, MS-DOS®, MS-WINDOWS®, OS/2®, UNIX®, LINUX®, or another known operating system. In other cases, the I/O controllermay represent or interact with a modem, a keyboard, a mouse, a touchscreen, or a similar device. In some cases, the I/O controllermay be implemented as part of a processor. In some examples, a user may interact with the devicevia the I/O controlleror via hardware components controlled by the I/O controller.

1115 1135 1115 1115 1135 The database controllermay manage data storage and processing in a database. In some cases, a user may interact with the database controller. In other cases, the database controllermay operate automatically without user interaction. The databasemay be an example of a single database, a distributed database, multiple distributed databases, a data store, a data lake, or an emergency backup database.

1125 1125 1130 1125 1125 1105 1125 Memorymay include random-access memory (RAM) and read-only memory (ROM). The memorymay store computer-readable, computer-executable software including instructions that, when executed, cause at least one processorto perform various functions described herein. In some cases, the memorymay contain, among other things, a basic I/O system (BIOS) which may control basic hardware or software operation such as the interaction with peripheral components or devices. The memorymay be an example of a single memory or multiple memories. For example, the devicemay include one or more memories.

1130 1130 1130 1130 1125 1130 1105 1130 The processormay include an intelligent hardware device (e.g., a general-purpose processor, a digital signal processor (DSP), a central processing unit (CPU), a microcontroller, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a programmable logic device, a discrete gate or transistor logic component, a discrete hardware component, or any combination thereof). In some cases, the processormay be configured to operate a memory array using a memory controller. In other cases, a memory controller may be integrated into the processor. The processormay be configured to execute computer-readable instructions stored in at least one memoryto perform various functions (e.g., functions or tasks supporting automated agent testing and evaluation using LLMs). The processormay be an example of a single processor or multiple processors. For example, the devicemay include one or more processors.

1120 1120 1120 1120 1120 1120 The data generation modulemay support agent testing generation via a LLM in accordance with examples as disclosed herein. For example, the data generation modulemay be configured to support receiving, from a first user via a first user interface, a selection of one or more agents of a set of multiple agents to perform a first task, the set of multiple agents being associated with one or more AI/ML models to execute natural language queries, where each agent is associated with a set of parameters for performing one or more actions via the one or more AI/ML models. The data generation modulemay be configured to support generating, via the LLM, a set of multiple scenarios for testing the one or more agents, the set of multiple scenarios corresponding to the one or more actions each agent of the one or more agents is instructed to perform based on the set of parameters of each agent. The data generation modulemay be configured to support generating, via the LLM and based on generation of the set of multiple scenarios, a set of training data for training the one or more agents to perform the first task, the set of training data being based on the set of parameters of the one or more agents and the set of multiple scenarios. The data generation modulemay be configured to support training the one or more AI/ML models associated with the one or more agents using the set of training data generated via the LLM. The data generation modulemay be configured to support executing the one or more agents to perform the first task based on training the one or more AI/ML models associated with the one or more agents.

1120 1105 By including or configuring the data generation modulein accordance with examples as described herein, the devicemay support techniques for testing and training agents to support improved testing of agents to execute tasks where agents interact with other agents, improved coordination between agents, more efficient utilization of computational and time resources, and reduced complexity for training various agents to perform tasks.

12 FIG. 1200 1205 1205 1210 1215 1220 1205 1205 1210 1215 1220 shows a block diagramof a devicethat supports automated agent testing and evaluation using LLMs in accordance with aspects of the present disclosure. The devicemay include an input module, an output module, and a data sheet module. The device, or one or more components of the device(e.g., the input module, the output module, the data sheet module), may include at least one processor, which may be coupled with at least one memory, to support the described techniques. Each of these components may be in communication with one another (e.g., via one or more buses).

1210 1205 1210 1210 1210 1205 1210 1220 1210 1410 14 FIG. The input modulemay manage input signals for the device. For example, the input modulemay identify input signals based on an interaction with a modem, a keyboard, a mouse, a touchscreen, or a similar device. These input signals may be associated with user input or processing at other components or devices. In some cases, the input modulemay utilize an operating system such as iOS®, ANDROID®, MS-DOS®, MS-WINDOWS®, OS/2®, UNIX®, LINUX®, or another known operating system to handle input signals. The input modulemay send aspects of these input signals to other components of the devicefor processing. For example, the input modulemay transmit input signals to the data sheet moduleto support automated agent testing and evaluation using LLMs. In some cases, the input modulemay be a component of an input/output (I/O) controlleras described with reference to.

1215 1205 1215 1205 1220 1215 1215 1410 14 FIG. The output modulemay manage output signals for the device. For example, the output modulemay receive signals from other components of the device, such as the data sheet module, and may transmit these signals to other components or devices. In some examples, the output modulemay transmit output signals for display in a user interface, for storage in a database or data store, for further processing at a server or server cluster, or for any other processes at any number of devices or systems. In some cases, the output modulemay be a component of an I/O controlleras described with reference to.

1220 1225 1230 1235 1240 1245 1250 1255 1220 1210 1215 1220 1210 1215 1210 1215 For example, the data sheet modulemay include a data acquisition component, a data evaluation instructions receiver, a data evaluation component, an output data acquisition component, a prompt template receiver, a prompt template execution component, an output data evaluation acquisition component, or any combination thereof. In some examples, the data sheet module, or various components thereof, may be configured to perform various operations (e.g., receiving, monitoring, transmitting) using or otherwise in cooperation with the input module, the output module, or both. For example, the data sheet modulemay receive information from the input module, send information to the output module, or be integrated in combination with the input module, the output module, or both to receive information, transmit information, or perform various other operations as described herein.

1220 1225 1230 1235 1240 The data sheet modulemay support data evaluation via LLMs (LLMs) in accordance with examples as disclosed herein. The data acquisition componentmay be configured to support obtaining, via a first user interface, a set of data within an interactive data table, the interactive data table including a set of multiple fields and the set of multiple fields including a set of multiple data records from the set of data. The data evaluation instructions receivermay be configured to support receiving a set of instructions for evaluation of the set of data within the interactive data table by an LLM, the set of instructions including an indication of one or more fields of the set of multiple fields and an output format. The data evaluation componentmay be configured to support evaluating, via the LLM and in accordance with the set of instructions for the evaluation, a subset of data of the set of data within the interactive data table, the subset of data being associated with the one or more fields of the interactive data table indicated via the set of instructions. The output data acquisition componentmay be configured to support obtaining, from the LLM and based on the evaluation, a set of output data within the output format indicated via the set of instructions for the evaluation.

1220 1225 1245 1250 1255 Additionally, or alternatively, the data sheet modulemay support LLM prompt evaluation in accordance with examples as disclosed herein. The data acquisition componentmay be configured to support obtaining, via a first user interface, a set of data within an interactive data table, the interactive data table including a set of multiple fields and the set of multiple fields including a set of multiple data records from the set of data. The prompt template receivermay be configured to support receiving a prompt template for an LLM that is associated with evaluation of the set of data obtained via the first user interface, the prompt template including a set of instructions for the LLM to evaluate the set of data, where the set of instructions includes an indication of one or more fields of the set of multiple fields of the interactive data table. The prompt template execution componentmay be configured to support executing, via the LLM and based on receiving the prompt template, the set of instructions of the prompt template using the set of data of within the interactive data table to obtain a set of output data. The output data evaluation acquisition componentmay be configured to support evaluating, via the LLM and in response to executing the prompt template, the set of output data according to obtain one or more indications of a performance of the prompt template.

13 FIG. 1300 1320 1320 1220 1320 1320 1325 1330 1335 1340 1345 1350 1355 shows a block diagramof a data sheet modulethat supports automated agent testing and evaluation using LLMs in accordance with aspects of the present disclosure. The data sheet modulemay be an example of aspects of a data sheet module or a data sheet module, or both, as described herein. The data sheet module, or various components thereof, may be an example of means for performing various aspects of automated agent testing and evaluation using LLMs as described herein. For example, the data sheet modulemay include a data acquisition component, a data evaluation instructions receiver, a data evaluation component, an output data acquisition component, a prompt template receiver, a prompt template execution component, an output data evaluation acquisition component, or any combination thereof. Each of these components, or components of subcomponents thereof (e.g., one or more processors, one or more memories), may communicate, directly or indirectly, with one another (e.g., via one or more buses).

1320 1325 1330 1335 1340 The data sheet modulemay support data evaluation via LLMs (LLMs) in accordance with examples as disclosed herein. The data acquisition componentmay be configured to support obtaining, via a first user interface, a set of data within an interactive data table, the interactive data table including a set of multiple fields and the set of multiple fields including a set of multiple data records from the set of data. The data evaluation instructions receivermay be configured to support receiving a set of instructions for evaluation of the set of data within the interactive data table by an LLM, the set of instructions including an indication of one or more fields of the set of multiple fields and an output format. The data evaluation componentmay be configured to support evaluating, via the LLM and in accordance with the set of instructions for the evaluation, a subset of data of the set of data within the interactive data table, the subset of data being associated with the one or more fields of the interactive data table indicated via the set of instructions. The output data acquisition componentmay be configured to support obtaining, from the LLM and based on the evaluation, a set of output data within the output format indicated via the set of instructions for the evaluation.

1355 In some examples, the output data evaluation acquisition componentmay be configured to support obtaining, from one or more user inputs or from the LLM, an evaluation of the set of output data based on the set of output data from the LLM.

1340 In some examples, to support obtaining the set of output data, the output data acquisition componentmay be configured to support displaying, via the first user interface, the set of output data within an output field of the set of multiple fields of the interactive data table.

1340 In some examples, to support obtaining the set of data, the output data acquisition componentmay be configured to support obtaining, via the first user interface, the set of data from a first user, from the LLM, or both.

1340 In some examples, to support obtaining the set of data from the first user, the output data acquisition componentmay be configured to support obtaining, from the first user and via the first user interface, the set of data via one or more user inputs including the set of data, via an indication of a file including the set of data, or a combination thereof.

1340 In some examples, to support obtaining the set of data from the LLM, the output data acquisition componentmay be configured to support transmitting, to the LLM, a query to generate the set of data, the query including one or more LLM prompt parameters for the LLM to utilize to generate the set of data.

In some examples, the set of data includes text data, a single select data, multi-select data, numerical data, currency data, percentage data, one or more attachments, one or more formulas, record data, one or more images, audio data, or any combination thereof.

In some examples, the set of output data includes textual data, numerical data, single tag data, multiple tag data, currency data, percentage data, or any combination thereof based on the output format indicated via the set of instructions for the evaluation of the set of data.

In some examples, the set of multiple fields of the interactive data table are associated with a set of columns of the interactive data table.

1320 1325 1345 1350 1355 Additionally, or alternatively, the data sheet modulemay support LLM prompt evaluation in accordance with examples as disclosed herein. In some examples, the data acquisition componentmay be configured to support obtaining, via a first user interface, a set of data within an interactive data table, the interactive data table including a set of multiple fields and the set of multiple fields including a set of multiple data records from the set of data. The prompt template receivermay be configured to support receiving a prompt template for an LLM that is associated with evaluation of the set of data obtained via the first user interface, the prompt template including a set of instructions for the LLM to evaluate the set of data, where the set of instructions includes an indication of one or more fields of the set of multiple fields of the interactive data table. The prompt template execution componentmay be configured to support executing, via the LLM and based on receiving the prompt template, the set of instructions of the prompt template using the set of data of within the interactive data table to obtain a set of output data. The output data evaluation acquisition componentmay be configured to support evaluating, via the LLM and in response to executing the prompt template, the set of output data according to obtain one or more indications of a performance of the prompt template.

In some examples, the one or more indications of the performance of the prompt template include a coherency indication, a relevance indication, an accuracy indication, or any combination thereof.

1350 1350 In some examples, to support executing the set of instructions of the prompt template, the prompt template execution componentmay be configured to support obtaining, from the set of data within the interactive data table, a subset of data associated with the one or more fields indicated via the set of instructions. In some examples, to support executing the set of instructions of the prompt template, the prompt template execution componentmay be configured to support transmitting, to the LLM, the set of instructions of the prompt template with the subset of data included within the set of instructions in place of the indication of the one or more fields, where the set of output data is obtained based on the subset of data being included within the set of instructions.

1350 In some examples, to support executing the set of instructions of the prompt template, the prompt template execution componentmay be configured to support displaying, via the first user interface, the set of output data within an output field of the set of multiple fields of the interactive data table.

In some examples, the set of data includes text data, a single select data, multi-select data, numerical data, currency data, percentage data, one or more attachments, one or more formulas, record data, one or more images, audio data, or any combination thereof.

14 FIG. 1400 1405 1405 1205 1405 1420 1410 1415 1425 1430 1435 1440 shows a diagram of a systemincluding a devicethat supports automated agent testing and evaluation using LLMs in accordance with aspects of the present disclosure. The devicemay be an example of or include components of a deviceas described herein. The devicemay include components for bi-directional data communications including components for transmitting and receiving communications, such as a data sheet module, an I/O controller, such as an I/O controller, a database controller, at least one memory, at least one processor, and a database. These components may be in electronic communication or otherwise coupled (e.g., operatively, communicatively, functionally, electronically, electrically) via one or more buses (e.g., a bus).

1410 1445 1450 1405 1410 1405 1410 1410 1410 1410 1430 1405 1410 1410 The I/O controllermay manage input signalsand output signalsfor the device. The I/O controllermay also manage peripherals not integrated into the device. In some cases, the I/O controllermay represent a physical connection or port to an external peripheral. In some cases, the I/O controllermay utilize an operating system such as iOS®, ANDROID®, MS-DOS®, MS-WINDOWS®, OS/2®, UNIX®, LINUX®, or another known operating system. In other cases, the I/O controllermay represent or interact with a modem, a keyboard, a mouse, a touchscreen, or a similar device. In some cases, the I/O controllermay be implemented as part of a processor. In some examples, a user may interact with the devicevia the I/O controlleror via hardware components controlled by the I/O controller.

1415 1435 1415 1415 1435 The database controllermay manage data storage and processing in a database. In some cases, a user may interact with the database controller. In other cases, the database controllermay operate automatically without user interaction. The databasemay be an example of a single database, a distributed database, multiple distributed databases, a data store, a data lake, or an emergency backup database.

1425 1425 1430 1425 1425 1405 1425 Memorymay include random-access memory (RAM) and read-only memory (ROM). The memorymay store computer-readable, computer-executable software including instructions that, when executed, cause at least one processorto perform various functions described herein. In some cases, the memorymay contain, among other things, a basic I/O system (BIOS) which may control basic hardware or software operation such as the interaction with peripheral components or devices. The memorymay be an example of a single memory or multiple memories. For example, the devicemay include one or more memories.

1430 1430 1430 1430 1425 1430 1405 1430 The processormay include an intelligent hardware device (e.g., a general-purpose processor, a digital signal processor (DSP), a central processing unit (CPU), a microcontroller, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a programmable logic device, a discrete gate or transistor logic component, a discrete hardware component, or any combination thereof). In some cases, the processormay be configured to operate a memory array using a memory controller. In other cases, a memory controller may be integrated into the processor. The processormay be configured to execute computer-readable instructions stored in at least one memoryto perform various functions (e.g., functions or tasks supporting automated agent testing and evaluation using LLMs). The processormay be an example of a single processor or multiple processors. For example, the devicemay include one or more processors.

1420 1420 1420 1420 1420 The data sheet modulemay support data evaluation via LLMs (LLMs) in accordance with examples as disclosed herein. For example, the data sheet modulemay be configured to support obtaining, via a first user interface, a set of data within an interactive data table, the interactive data table including a set of multiple fields and the set of multiple fields including a set of multiple data records from the set of data. The data sheet modulemay be configured to support receiving a set of instructions for evaluation of the set of data within the interactive data table by an LLM, the set of instructions including an indication of one or more fields of the set of multiple fields and an output format. The data sheet modulemay be configured to support evaluating, via the LLM and in accordance with the set of instructions for the evaluation, a subset of data of the set of data within the interactive data table, the subset of data being associated with the one or more fields of the interactive data table indicated via the set of instructions. The data sheet modulemay be configured to support obtaining, from the LLM and based on the evaluation, a set of output data within the output format indicated via the set of instructions for the evaluation.

1420 1420 1420 1420 1420 Additionally, or alternatively, the data sheet modulemay support LLM prompt evaluation in accordance with examples as disclosed herein. For example, the data sheet modulemay be configured to support obtaining, via a first user interface, a set of data within an interactive data table, the interactive data table including a set of multiple fields and the set of multiple fields including a set of multiple data records from the set of data. The data sheet modulemay be configured to support receiving a prompt template for an LLM that is associated with evaluation of the set of data obtained via the first user interface, the prompt template including a set of instructions for the LLM to evaluate the set of data, where the set of instructions includes an indication of one or more fields of the set of multiple fields of the interactive data table. The data sheet modulemay be configured to support executing, via the LLM and based on receiving the prompt template, the set of instructions of the prompt template using the set of data of within the interactive data table to obtain a set of output data. The data sheet modulemay be configured to support evaluating, via the LLM and in response to executing the prompt template, the set of output data according to obtain one or more indications of a performance of the prompt template.

1420 1405 By including or configuring the data sheet modulein accordance with examples as described herein, the devicemay support techniques for data evaluation and LLM prompt evaluation to support improved function of data tables, improved accuracy of LLM generations, and improved efficiency for data input, data evaluation, and for prompt template evaluations.

15 FIG. 1 11 FIGS.through 1500 1500 1500 shows a flowchart illustrating a methodthat supports automated agent testing and evaluation using LLMs in accordance with aspects of the present disclosure. The operations of the methodmay be implemented by an agent training service or its components as described herein. For example, the operations of the methodmay be performed by an agent training service as described with reference to. In some examples, an agent training service may execute a set of instructions to control the functional elements of the agent training service to perform the described functions. Additionally, or alternatively, the agent training service may perform aspects of the described functions using special-purpose hardware.

1505 1505 1505 1025 10 FIG. At, the method may include receiving, from a first user via a first user interface, a selection of one or more agents of a set of multiple agents to perform a first task, the set of multiple agents being associated with one or more AI/ML models to execute natural language queries, where each agent is associated with a set of parameters for performing one or more actions via the one or more AI/ML models. The operations ofmay be performed in accordance with examples as disclosed herein. In some examples, aspects of the operations ofmay be performed by an agent selection componentas described with reference to.

1510 1510 1510 1030 10 FIG. At, the method may include generating, via the LLM, a set of multiple scenarios for testing the one or more agents, the set of multiple scenarios corresponding to the one or more actions each agent of the one or more agents is instructed to perform based on the set of parameters of each agent. The operations ofmay be performed in accordance with examples as disclosed herein. In some examples, aspects of the operations ofmay be performed by a testing scenario generation componentas described with reference to.

1515 1515 1515 1035 10 FIG. At, the method may include generating, via the LLM and based on generation of the set of multiple scenarios, a set of training data for training the one or more agents to perform the first task, the set of training data being based on the set of parameters of the one or more agents and the set of multiple scenarios. The operations ofmay be performed in accordance with examples as disclosed herein. In some examples, aspects of the operations ofmay be performed by a training data generation componentas described with reference to.

1520 1520 1520 1040 10 FIG. At, the method may include training the one or more AI/ML models associated with the one or more agents using the set of training data generated via the LLM. The operations ofmay be performed in accordance with examples as disclosed herein. In some examples, aspects of the operations ofmay be performed by an agent training componentas described with reference to.

1525 1525 1525 1045 10 FIG. At, the method may include executing the one or more agents to perform the first task based on training the one or more AI/ML models associated with the one or more agents. The operations ofmay be performed in accordance with examples as disclosed herein. In some examples, aspects of the operations ofmay be performed by an agent execution componentas described with reference to.

16 FIG. 1 8 12 14 FIGS.throughandthrough 1600 1600 1600 shows a flowchart illustrating a methodthat supports automated agent testing and evaluation using LLMs in accordance with aspects of the present disclosure. The operations of the methodmay be implemented by an LLM service or its components as described herein. For example, the operations of the methodmay be performed by an LLM service as described with reference to. In some examples, an LLM service may execute a set of instructions to control the functional elements of the LLM service to perform the described functions. Additionally, or alternatively, the LLM service may perform aspects of the described functions using special-purpose hardware.

1605 1605 1605 1325 13 FIG. At, the method may include obtaining, via a first user interface, a set of data within an interactive data table, the interactive data table including a set of multiple fields and the set of multiple fields including a set of multiple data records from the set of data. The operations ofmay be performed in accordance with examples as disclosed herein. In some examples, aspects of the operations ofmay be performed by a data acquisition componentas described with reference to.

1610 1610 1610 1330 13 FIG. At, the method may include receiving a set of instructions for evaluation of the set of data within the interactive data table by an LLM, the set of instructions including an indication of one or more fields of the set of multiple fields and an output format. The operations ofmay be performed in accordance with examples as disclosed herein. In some examples, aspects of the operations ofmay be performed by a data evaluation instructions receiveras described with reference to.

1615 1615 1615 1335 13 FIG. At, the method may include evaluating, via the LLM and in accordance with the set of instructions for the evaluation, a subset of data of the set of data within the interactive data table, the subset of data being associated with the one or more fields of the interactive data table indicated via the set of instructions. The operations ofmay be performed in accordance with examples as disclosed herein. In some examples, aspects of the operations ofmay be performed by a data evaluation componentas described with reference to.

1620 1620 1620 1340 13 FIG. At, the method may include obtaining, from the LLM and based on the evaluation, a set of output data within the output format indicated via the set of instructions for the evaluation. The operations ofmay be performed in accordance with examples as disclosed herein. In some examples, aspects of the operations ofmay be performed by an output data acquisition componentas described with reference to.

17 FIG. 1 8 12 14 FIGS.throughandthrough 1700 1700 1700 shows a flowchart illustrating a methodthat supports automated agent testing and evaluation using LLMs in accordance with aspects of the present disclosure. The operations of the methodmay be implemented by an LLM service or its components as described herein. For example, the operations of the methodmay be performed by an LLM service as described with reference to. In some examples, an LLM service may execute a set of instructions to control the functional elements of the LLM service to perform the described functions. Additionally, or alternatively, the LLM service may perform aspects of the described functions using special-purpose hardware.

1705 1705 1705 1325 13 FIG. At, the method may include obtaining, via a first user interface, a set of data within an interactive data table, the interactive data table including a set of multiple fields and the set of multiple fields including a set of multiple data records from the set of data. The operations ofmay be performed in accordance with examples as disclosed herein. In some examples, aspects of the operations ofmay be performed by a data acquisition componentas described with reference to.

1710 1710 1710 1345 13 FIG. At, the method may include receiving a prompt template for an LLM that is associated with evaluation of the set of data obtained via the first user interface, the prompt template including a set of instructions for the LLM to evaluate the set of data, where the set of instructions includes an indication of one or more fields of the set of multiple fields of the interactive data table. The operations ofmay be performed in accordance with examples as disclosed herein. In some examples, aspects of the operations ofmay be performed by a prompt template receiveras described with reference to.

1715 1715 1715 1350 13 FIG. At, the method may include executing, via the LLM and based on receiving the prompt template, the set of instructions of the prompt template using the set of data of within the interactive data table to obtain a set of output data. The operations ofmay be performed in accordance with examples as disclosed herein. In some examples, aspects of the operations ofmay be performed by a prompt template execution componentas described with reference to.

1720 1720 1720 1355 13 FIG. At, the method may include evaluating, via the LLM and in response to executing the prompt template, the set of output data according to obtain one or more indications of a performance of the prompt template. The operations ofmay be performed in accordance with examples as disclosed herein. In some examples, aspects of the operations ofmay be performed by an output data evaluation acquisition componentas described with reference to.

A method for agent testing generation via a LLM by an apparatus is described. The method may include receiving, from a first user via a first user interface, a selection of one or more agents of a set of multiple agents to perform a first task, the set of multiple agents being associated with one or more AI/ML models to execute natural language queries, where each agent is associated with a set of parameters for performing one or more actions via the one or more AI/ML models, generating, via the LLM, a set of multiple scenarios for testing the one or more agents, the set of multiple scenarios corresponding to the one or more actions each agent of the one or more agents is instructed to perform based on the set of parameters of each agent, generating, via the LLM and based on generation of the set of multiple scenarios, a set of training data for training the one or more agents to perform the first task, the set of training data being based on the set of parameters of the one or more agents and the set of multiple scenarios, training the one or more AI/ML models associated with the one or more agents using the set of training data generated via the LLM, and executing the one or more agents to perform the first task based on training the one or more AI/ML models associated with the one or more agents.

An apparatus for agent testing generation via a LLM is described. The apparatus may include one or more memories storing processor executable code, and one or more processors coupled with the one or more memories. The one or more processors may individually or collectively be operable to execute the code to cause the apparatus to receive, from a first user via a first user interface, a selection of one or more agents of a set of multiple agents to perform a first task, the set of multiple agents being associated with one or more AI/ML models to execute natural language queries, where each agent is associated with a set of parameters for performing one or more actions via the one or more AI/ML models, generate, via the LLM, a set of multiple scenarios for testing the one or more agents, the set of multiple scenarios corresponding to the one or more actions each agent of the one or more agents is instructed to perform based on the set of parameters of each agent, generate, via the LLM and based on generation of the set of multiple scenarios, a set of training data for training the one or more agents to perform the first task, the set of training data being based on the set of parameters of the one or more agents and the set of multiple scenarios, train the one or more AI/ML models associated with the one or more agents using the set of training data generated via the LLM, and execute the one or more agents to perform the first task based on training the one or more AI/ML models associated with the one or more agents.

Another apparatus for agent testing generation via a LLM is described. The apparatus may include means for receiving, from a first user via a first user interface, a selection of one or more agents of a set of multiple agents to perform a first task, the set of multiple agents being associated with one or more AI/ML models to execute natural language queries, where each agent is associated with a set of parameters for performing one or more actions via the one or more AI/ML models, means for generating, via the LLM, a set of multiple scenarios for testing the one or more agents, the set of multiple scenarios corresponding to the one or more actions each agent of the one or more agents is instructed to perform based on the set of parameters of each agent, means for generating, via the LLM and based on generation of the set of multiple scenarios, a set of training data for training the one or more agents to perform the first task, the set of training data being based on the set of parameters of the one or more agents and the set of multiple scenarios, means for training the one or more AI/ML models associated with the one or more agents using the set of training data generated via the LLM, and means for executing the one or more agents to perform the first task based on training the one or more AI/ML models associated with the one or more agents.

A non-transitory computer-readable medium storing code for agent testing generation via a LLM is described. The code may include instructions executable by one or more processors to receive, from a first user via a first user interface, a selection of one or more agents of a set of multiple agents to perform a first task, the set of multiple agents being associated with one or more AI/ML models to execute natural language queries, where each agent is associated with a set of parameters for performing one or more actions via the one or more AI/ML models, generate, via the LLM, a set of multiple scenarios for testing the one or more agents, the set of multiple scenarios corresponding to the one or more actions each agent of the one or more agents is instructed to perform based on the set of parameters of each agent, generate, via the LLM and based on generation of the set of multiple scenarios, a set of training data for training the one or more agents to perform the first task, the set of training data being based on the set of parameters of the one or more agents and the set of multiple scenarios, train the one or more AI/ML models associated with the one or more agents using the set of training data generated via the LLM, and execute the one or more agents to perform the first task based on training the one or more AI/ML models associated with the one or more agents.

Some examples of the method, apparatus, and non-transitory computer-readable medium described herein may further include operations, features, means, or instructions for obtaining, in response to receiving the selection of the one or more agents, the set of parameters for each agent of the one or more agents, where generating the set of multiple scenarios may be based on obtaining the set of parameters for the one or more agents.

In some examples of the method, apparatus, and non-transitory computer-readable medium described herein, receiving the selection of the one or more agents may include operations, features, means, or instructions for receiving, via the selection of the one or more agents, an indication of the first task for the one or more agents to perform, where executing the one or more agents to perform the first task may be based on receiving the indication of the first task.

In some examples of the method, apparatus, and non-transitory computer-readable medium described herein, training the one or more agents may include operations, features, means, or instructions for obtaining, from the one or more AI/ML models associated with the one or more agents, a set of output data in response to the set of training data being provided as input to the one or more AI/ML models, the set of output data being associated with the first task and evaluating the set of output data to obtain a performance metric indication of the one or more agents.

In some examples of the method, apparatus, and non-transitory computer-readable medium described herein, the set of output data may be evaluated via code-based testing, via the LLM, via user feedback, or any combination thereof.

In some examples of the method, apparatus, and non-transitory computer-readable medium described herein, the one or more agents may be executed sequentially to perform the first task.

A method for data evaluation via LLMs (LLMs) by an apparatus is described. The method may include obtaining, via a first user interface, a set of data within an interactive data table, the interactive data table including a set of multiple fields and the set of multiple fields including a set of multiple data records from the set of data, receiving a set of instructions for evaluation of the set of data within the interactive data table by an LLM, the set of instructions including an indication of one or more fields of the set of multiple fields and an output format, evaluating, via the LLM and in accordance with the set of instructions for the evaluation, a subset of data of the set of data within the interactive data table, the subset of data being associated with the one or more fields of the interactive data table indicated via the set of instructions, and obtaining, from the LLM and based on the evaluation, a set of output data within the output format indicated via the set of instructions for the evaluation.

An apparatus for data evaluation via LLMs (LLMs) is described. The apparatus may include one or more memories storing processor executable code, and one or more processors coupled with the one or more memories. The one or more processors may individually or collectively be operable to execute the code to cause the apparatus to obtain, via a first user interface, a set of data within an interactive data table, the interactive data table including a set of multiple fields and the set of multiple fields including a set of multiple data records from the set of data, receive a set of instructions for evaluation of the set of data within the interactive data table by an LLM, the set of instructions including an indication of one or more fields of the set of multiple fields and an output format, evaluate, via the LLM and in accordance with the set of instructions for the evaluation, a subset of data of the set of data within the interactive data table, the subset of data being associated with the one or more fields of the interactive data table indicated via the set of instructions, and obtain, from the LLM and based on the evaluation, a set of output data within the output format indicated via the set of instructions for the evaluation.

Another apparatus for data evaluation via LLMs (LLMs) is described. The apparatus may include means for obtaining, via a first user interface, a set of data within an interactive data table, the interactive data table including a set of multiple fields and the set of multiple fields including a set of multiple data records from the set of data, means for receiving a set of instructions for evaluation of the set of data within the interactive data table by an LLM, the set of instructions including an indication of one or more fields of the set of multiple fields and an output format, means for evaluating, via the LLM and in accordance with the set of instructions for the evaluation, a subset of data of the set of data within the interactive data table, the subset of data being associated with the one or more fields of the interactive data table indicated via the set of instructions, and means for obtaining, from the LLM and based on the evaluation, a set of output data within the output format indicated via the set of instructions for the evaluation.

A non-transitory computer-readable medium storing code for data evaluation via LLMs (LLMs) is described. The code may include instructions executable by one or more processors to obtain, via a first user interface, a set of data within an interactive data table, the interactive data table including a set of multiple fields and the set of multiple fields including a set of multiple data records from the set of data, receive a set of instructions for evaluation of the set of data within the interactive data table by an LLM, the set of instructions including an indication of one or more fields of the set of multiple fields and an output format, evaluate, via the LLM and in accordance with the set of instructions for the evaluation, a subset of data of the set of data within the interactive data table, the subset of data being associated with the one or more fields of the interactive data table indicated via the set of instructions, and obtain, from the LLM and based on the evaluation, a set of output data within the output format indicated via the set of instructions for the evaluation.

Some examples of the method, apparatus, and non-transitory computer-readable medium described herein may further include operations, features, means, or instructions for obtaining, from one or more user inputs or from the LLM, an evaluation of the set of output data based on the set of output data from the LLM.

In some examples of the method, apparatus, and non-transitory computer-readable medium described herein, obtaining the set of output data may include operations, features, means, or instructions for displaying, via the first user interface, the set of output data within an output field of the set of multiple fields of the interactive data table.

In some examples of the method, apparatus, and non-transitory computer-readable medium described herein, obtaining the set of data may include operations, features, means, or instructions for obtaining, via the first user interface, the set of data from a first user, from the LLM, or both.

In some examples of the method, apparatus, and non-transitory computer-readable medium described herein, obtaining the set of data from the first user may include operations, features, means, or instructions for obtaining, from the first user and via the first user interface, the set of data via one or more user inputs including the set of data, via an indication of a file including the set of data, or a combination thereof.

In some examples of the method, apparatus, and non-transitory computer-readable medium described herein, obtaining the set of data from the LLM may include operations, features, means, or instructions for transmitting, to the LLM, a query to generate the set of data, the query including one or more LLM prompt parameters for the LLM to utilize to generate the set of data.

In some examples of the method, apparatus, and non-transitory computer-readable medium described herein, the set of data includes text data, a single select data, multi-select data, numerical data, currency data, percentage data, one or more attachments, one or more formulas, record data, one or more images, audio data, or any combination thereof.

In some examples of the method, apparatus, and non-transitory computer-readable medium described herein, the set of output data includes textual data, numerical data, single tag data, multiple tag data, currency data, percentage data, or any combination thereof based on the output format indicated via the set of instructions for the evaluation of the set of data.

In some examples of the method, apparatus, and non-transitory computer-readable medium described herein, the set of multiple fields of the interactive data table may be associated with a set of columns of the interactive data table.

A method for LLM prompt evaluation by an apparatus is described. The method may include obtaining, via a first user interface, a set of data within an interactive data table, the interactive data table including a set of multiple fields and the set of multiple fields including a set of multiple data records from the set of data, receiving a prompt template for an LLM that is associated with evaluation of the set of data obtained via the first user interface, the prompt template including a set of instructions for the LLM to evaluate the set of data, where the set of instructions includes an indication of one or more fields of the set of multiple fields of the interactive data table, executing, via the LLM and based on receiving the prompt template, the set of instructions of the prompt template using the set of data of within the interactive data table to obtain a set of output data, and evaluating, via the LLM and in response to executing the prompt template, the set of output data according to obtain one or more indications of a performance of the prompt template.

An apparatus for LLM prompt evaluation is described. The apparatus may include one or more memories storing processor executable code, and one or more processors coupled with the one or more memories. The one or more processors may individually or collectively be operable to execute the code to cause the apparatus to obtain, via a first user interface, a set of data within an interactive data table, the interactive data table including a set of multiple fields and the set of multiple fields including a set of multiple data records from the set of data, receive a prompt template for an LLM that is associated with evaluation of the set of data obtained via the first user interface, the prompt template including a set of instructions for the LLM to evaluate the set of data, where the set of instructions includes an indication of one or more fields of the set of multiple fields of the interactive data table, execute, via the LLM and based on receiving the prompt template, the set of instructions of the prompt template using the set of data of within the interactive data table to obtain a set of output data, and evaluate, via the LLM and in response to executing the prompt template, the set of output data according to obtain one or more indications of a performance of the prompt template.

Another apparatus for LLM prompt evaluation is described. The apparatus may include means for obtaining, via a first user interface, a set of data within an interactive data table, the interactive data table including a set of multiple fields and the set of multiple fields including a set of multiple data records from the set of data, means for receiving a prompt template for an LLM that is associated with evaluation of the set of data obtained via the first user interface, the prompt template including a set of instructions for the LLM to evaluate the set of data, where the set of instructions includes an indication of one or more fields of the set of multiple fields of the interactive data table, means for executing, via the LLM and based on receiving the prompt template, the set of instructions of the prompt template using the set of data of within the interactive data table to obtain a set of output data, and means for evaluating, via the LLM and in response to executing the prompt template, the set of output data according to obtain one or more indications of a performance of the prompt template.

A non-transitory computer-readable medium storing code for LLM prompt evaluation is described. The code may include instructions executable by one or more processors to obtain, via a first user interface, a set of data within an interactive data table, the interactive data table including a set of multiple fields and the set of multiple fields including a set of multiple data records from the set of data, receive a prompt template for an LLM that is associated with evaluation of the set of data obtained via the first user interface, the prompt template including a set of instructions for the LLM to evaluate the set of data, where the set of instructions includes an indication of one or more fields of the set of multiple fields of the interactive data table, execute, via the LLM and based on receiving the prompt template, the set of instructions of the prompt template using the set of data of within the interactive data table to obtain a set of output data, and evaluate, via the LLM and in response to executing the prompt template, the set of output data according to obtain one or more indications of a performance of the prompt template.

In some examples of the method, apparatus, and non-transitory computer-readable medium described herein, the one or more indications of the performance of the prompt template include a coherency indication, a relevance indication, an accuracy indication, or any combination thereof.

In some examples of the method, apparatus, and non-transitory computer-readable medium described herein, executing the set of instructions of the prompt template may include operations, features, means, or instructions for obtaining, from the set of data within the interactive data table, a subset of data associated with the one or more fields indicated via the set of instructions and transmitting, to the LLM, the set of instructions of the prompt template with the subset of data included within the set of instructions in place of the indication of the one or more fields, where the set of output data may be obtained based on the subset of data being included within the set of instructions.

In some examples of the method, apparatus, and non-transitory computer-readable medium described herein, executing the set of instructions of the prompt template may include operations, features, means, or instructions for displaying, via the first user interface, the set of output data within an output field of the set of multiple fields of the interactive data table.

In some examples of the method, apparatus, and non-transitory computer-readable medium described herein, the set of data includes text data, a single select data, multi-select data, numerical data, currency data, percentage data, one or more attachments, one or more formulas, record data, one or more images, audio data, or any combination thereof.

The following provides an overview of aspects of the present disclosure:

Aspect 1: A method for agent testing generation via a LLM, comprising: receiving, from a first user via a first user interface, a selection of one or more agents of a plurality of agents to perform a first task, the plurality of agents being associated with one or more AI/ML models to execute natural language queries, wherein each agent is associated with a set of parameters for performing one or more actions via the one or more AI/ML models; generating, via the LLM, a plurality of scenarios for testing the one or more agents, the plurality of scenarios corresponding to the one or more actions each agent of the one or more agents is instructed to perform based at least in part on the set of parameters of each agent; generating, via the LLM and based at least in part on generation of the plurality of scenarios, a set of training data for training the one or more agents to perform the first task, the set of training data being based at least in part on the set of parameters of the one or more agents and the plurality of scenarios; training the one or more AI/ML models associated with the one or more agents using the set of training data generated via the LLM; and executing the one or more agents to perform the first task based at least in part on training the one or more AI/ML models associated with the one or more agents.

Aspect 2: The method of aspect 1, further comprising: obtaining, in response to receiving the selection of the one or more agents, the set of parameters for each agent of the one or more agents, wherein generating the plurality of scenarios is based at least in part on obtaining the set of parameters for the one or more agents.

Aspect 3: The method of any of aspects 1 through 2, wherein receiving the selection of the one or more agents comprises: receiving, via the selection of the one or more agents, an indication of the first task for the one or more agents to perform, wherein executing the one or more agents to perform the first task is based at least in part on receiving the indication of the first task.

Aspect 4: The method of any of aspects 1 through 3, wherein training the one or more agents comprises: obtaining, from the one or more AI/ML models associated with the one or more agents, a set of output data in response to the set of training data being provided as input to the one or more AI/ML models, the set of output data being associated with the first task; and evaluating the set of output data to obtain a performance metric indication of the one or more agents.

Aspect 5: The method of aspect 4, wherein the set of output data is evaluated via code-based testing, via the LLM, via user feedback, or any combination thereof.

Aspect 6: The method of any of aspects 1 through 5, wherein the one or more agents are executed sequentially to perform the first task.

Aspect 7: A method for data evaluation via LLMs (LLMs), comprising: obtaining, via a first user interface, a set of data within an interactive data table, the interactive data table comprising a plurality of fields and the plurality of fields comprising a plurality of data records from the set of data; receiving a set of instructions for evaluation of the set of data within the interactive data table by an LLM, the set of instructions comprising an indication of one or more fields of the plurality of fields and an output format; evaluating, via the LLM and in accordance with the set of instructions for the evaluation, a subset of data of the set of data within the interactive data table, the subset of data being associated with the one or more fields of the interactive data table indicated via the set of instructions; and obtaining, from the LLM and based at least in part on the evaluation, a set of output data within the output format indicated via the set of instructions for the evaluation.

Aspect 8: The method of aspect 7, further comprising: obtaining, from one or more user inputs or from the LLM, an evaluation of the set of output data based at least in part on the set of output data from the LLM.

Aspect 9: The method of any of aspects 7 through 8, wherein obtaining the set of output data comprises: displaying, via the first user interface, the set of output data within an output field of the plurality of fields of the interactive data table.

Aspect 10: The method of any of aspects 7 through 9, wherein obtaining the set of data comprises: obtaining, via the first user interface, the set of data from a first user, from the LLM, or both.

Aspect 11: The method of aspect 10, wherein obtaining the set of data from the first user comprises: obtaining, from the first user and via the first user interface, the set of data via one or more user inputs comprising the set of data, via an indication of a file comprising the set of data, or a combination thereof.

Aspect 12: The method of any of aspects 10 through 11, wherein obtaining the set of data from the LLM comprises: transmitting, to the LLM, a query to generate the set of data, the query comprising one or more LLM prompt parameters for the LLM to utilize to generate the set of data.

Aspect 13: The method of any of aspects 7 through 12, wherein the set of data comprises text data, a single select data, multi-select data, numerical data, currency data, percentage data, one or more attachments, one or more formulas, record data, one or more images, audio data, or any combination thereof.

Aspect 14: The method of any of aspects 7 through 13, wherein the set of output data comprises textual data, numerical data, single tag data, multiple tag data, currency data, percentage data, or any combination thereof based at least in part on the output format indicated via the set of instructions for the evaluation of the set of data.

Aspect 15: The method of any of aspects 7 through 14, wherein the plurality of fields of the interactive data table are associated with a set of columns of the interactive data table.

Aspect 16: A method for LLM prompt evaluation, comprising: obtaining, via a first user interface, a set of data within an interactive data table, the interactive data table comprising a plurality of fields and the plurality of fields comprising a plurality of data records from the set of data; receiving a prompt template for an LLM that is associated with evaluation of the set of data obtained via the first user interface, the prompt template comprising a set of instructions for the LLM to evaluate the set of data, wherein the set of instructions comprises an indication of one or more fields of the plurality of fields of the interactive data table; executing, via the LLM and based at least in part on receiving the prompt template, the set of instructions of the prompt template using the set of data of within the interactive data table to obtain a set of output data; and evaluating, via the LLM and in response to executing the prompt template, the set of output data according to obtain one or more indications of a performance of the prompt template.

Aspect 17: The method of aspect 16, wherein the one or more indications of the performance of the prompt template comprise a coherency indication, a relevance indication, an accuracy indication, or any combination thereof.

Aspect 18: The method of any of aspects 16 through 17, wherein executing the set of instructions of the prompt template comprises: obtaining, from the set of data within the interactive data table, a subset of data associated with the one or more fields indicated via the set of instructions; and transmitting, to the LLM, the set of instructions of the prompt template with the subset of data included within the set of instructions in place of the indication of the one or more fields, wherein the set of output data is obtained based at least in part on the subset of data being included within the set of instructions.

Aspect 19: The method of any of aspects 16 through 18, wherein executing the set of instructions of the prompt template comprises: displaying, via the first user interface, the set of output data within an output field of the plurality of fields of the interactive data table.

Aspect 20: The method of any of aspects 16 through 19, wherein the set of data comprises text data, a single select data, multi-select data, numerical data, currency data, percentage data, one or more attachments, one or more formulas, record data, one or more images, audio data, or any combination thereof.

Aspect 21: An apparatus for agent testing generation via a LLM, comprising one or more memories storing processor-executable code, and one or more processors coupled with the one or more memories and individually or collectively operable to execute the code to cause the apparatus to perform a method of any of aspects 1 through 6.

Aspect 22: An apparatus for agent testing generation via a LLM, comprising at least one means for performing a method of any of aspects 1 through 6.

Aspect 23: A non-transitory computer-readable medium storing code for agent testing generation via a LLM, the code comprising instructions executable by one or more processors to perform a method of any of aspects 1 through 6.

Aspect 24: An apparatus for data evaluation via LLMs (LLMs), comprising one or more memories storing processor-executable code, and one or more processors coupled with the one or more memories and individually or collectively operable to execute the code to cause the apparatus to perform a method of any of aspects 7 through 15.

Aspect 25: An apparatus for data evaluation via LLMs (LLMs), comprising at least one means for performing a method of any of aspects 7 through 15.

Aspect 26: A non-transitory computer-readable medium storing code for data evaluation via LLMs (LLMs), the code comprising instructions executable by one or more processors to perform a method of any of aspects 7 through 15.

Aspect 27: An apparatus for LLM prompt evaluation, comprising one or more memories storing processor-executable code, and one or more processors coupled with the one or more memories and individually or collectively operable to execute the code to cause the apparatus to perform a method of any of aspects 16 through 20.

Aspect 28: An apparatus for LLM prompt evaluation, comprising at least one means for performing a method of any of aspects 16 through 20.

Aspect 29: A non-transitory computer-readable medium storing code for LLM prompt evaluation, the code comprising instructions executable by one or more processors to perform a method of any of aspects 16 through 20.

It should be noted that the methods described above describe possible implementations, and that the operations and the steps may be rearranged or otherwise modified and that other implementations are possible. Furthermore, aspects from two or more of the methods may be combined.

The description set forth herein, in connection with the appended drawings, describes example configurations and does not represent all the examples that may be implemented or that are within the scope of the claims. The term “exemplary” used herein means “serving as an example, instance, or illustration,” and not “preferred” or “advantageous over other examples.” The detailed description includes specific details for the purpose of providing an understanding of the described techniques. These techniques, however, may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described examples.

In the appended figures, similar components or features may have the same reference label. Further, various components of the same type may be distinguished by following the reference label by a dash and a second label that distinguishes among the similar components. If just the first reference label is used in the specification, the description is applicable to any one of the similar components having the same first reference label irrespective of the second reference label.

Information and signals described herein may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.

The various illustrative blocks and modules described in connection with the disclosure herein may be implemented or performed with a general-purpose processor, a DSP, an ASIC, an FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration).

The functions described herein may be implemented in hardware, software executed by a processor, firmware, or any combination thereof. If implemented in software executed by a processor, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Other examples and implementations are within the scope of the disclosure and appended claims. For example, due to the nature of software, functions described above can be implemented using software executed by a processor, hardware, firmware, hardwiring, or combinations of any of these. Features implementing functions may also be physically located at various positions, including being distributed such that portions of functions are implemented at different physical locations. Also, as used herein, including in the claims, “or” as used in a list of items (for example, a list of items prefaced by a phrase such as “at least one of” or “one or more of”) indicates an inclusive list such that, for example, a list of at least one of A, B, or C means A or B or C or AB or AC or BC or ABC (i.e., A and B and C). Also, as used herein, the phrase “based on” shall not be construed as a reference to a closed set of conditions. For example, an exemplary step that is described as “based on condition A” may be based on both a condition A and a condition B without departing from the scope of the present disclosure. In other words, as used herein, the phrase “based on” shall be construed in the same manner as the phrase “based at least in part on.”

Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A non-transitory storage medium may be any available medium that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, non-transitory computer-readable media can comprise RAM, ROM, electrically erasable programmable ROM (EEPROM), compact disk (CD) ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other non-transitory medium that can be used to carry or store desired program code means in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, include CD, laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above are also included within the scope of computer-readable media.

As used herein, including in the claims, the article “a” before a noun is open-ended and understood to refer to “at least one” of those nouns or “one or more” of those nouns. Thus, the terms “a,” “at least one,” “one or more,” “at least one of one or more” may be interchangeable. For example, if a claim recites “a component” that performs one or more functions, each of the individual functions may be performed by a single component or by any combination of multiple components. Thus, the term “a component” having characteristics or performing functions may refer to “at least one of one or more components” having a particular characteristic or performing a particular function. Subsequent reference to a component introduced with the article “a” using the terms “the” or “said” may refer to any or all of the one or more components. For example, a component introduced with the article “a” may be understood to mean “one or more components,” and referring to “the component” subsequently in the claims may be understood to be equivalent to referring to “at least one of the one or more components.” Similarly, subsequent reference to a component introduced as “one or more components” using the terms “the” or “said” may refer to any or all of the one or more components. For example, referring to “the one or more components” subsequently in the claims may be understood to be equivalent to referring to “at least one of the one or more components.”

The description herein is provided to enable a person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not limited to the examples and designs described herein, but is to be accorded the broadest scope consistent with the principles and novel features disclosed herein.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

Patent Metadata

Filing Date

November 19, 2025

Publication Date

May 21, 2026

Inventors

Manjeet Singh
Jonathon Neal Moore
Avi Shah
Deepak Mukunthu
Nabil Naffar
Magic Johnson
Sky Chen
Reddy Yerradoddi

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “AUTOMATED AGENT TESTING AND EVALUATION USING LARGE LANGUAGE MODELS” (US-20260140855-A1). https://patentable.app/patents/US-20260140855-A1

© 2026 Patentable. All rights reserved.

Patentable is a research and drafting-assistant tool, not a law firm, and does not provide legal advice. Documents we generate are drafts for review by a licensed patent attorney.

AUTOMATED AGENT TESTING AND EVALUATION USING LARGE LANGUAGE MODELS — Manjeet Singh | Patentable