Using container and model information to select containers for executing models is described. A system receives a request from an application and identifies a version of a machine-learning model associated with the request. The system identifies model information associated with machine learning models corresponding to a cluster of available serving containers associated with the version of the machine-learning model. The system uses the model information to select a serving container from the cluster of available serving containers. If the machine-learning model is not loaded in the serving container, the system loads the machine-learning model in the serving container. If the machine-learning model is loaded in the serving container, the system executes, in the serving container, the machine-learning model on behalf of the request. The system responds to the request based on executing the machine-learning model on behalf of the request.
Legal claims defining the scope of protection, as filed with the USPTO.
. A system for using container and model information to select containers for executing models, the system comprising:
. The system of, comprising further instructions, which when executed, cause the one or more processors to update a data structure comprising at least one of model information associated with machine-learning models corresponding to any serving containers in any cluster of serving containers and container information associated with serving containers in any corresponding cluster of serving containers.
. The system of, comprising further instructions, which when executed, cause the one or more processors to:
. The system of, comprising further instructions, which when executed, cause the one or more processors to:
. The system of, comprising further instructions, which when executed, cause the one or more processors to:
. The system of, wherein the application is associated with a first tenant and the extra application is associated with a second tenant.
. The system of, wherein select the existing serving container from the cluster of available serving containers is based on one of leveraging a bin-packing algorithm and leveraging a consistent hashing algorithm with identifiers of each serving container associated with the version of the machine-learning model and with an identifier of the version of the machine-learning model.
. A non-transitory computer program product comprising computer-readable program code to be executed by one or more processors when retrieved from a non-transitory computer-readable medium, the program code including instructions to:
. The non-transitory computer program product of, wherein the program code comprises further instructions to update a data structure comprising at least one of model information associated with machine-learning models corresponding to any serving containers in any cluster of serving containers and container information associated with serving containers in any corresponding cluster of serving containers.
. The non-transitory computer program product of, wherein the program code comprises further instructions to:
. The non-transitory computer program product of, wherein the program code comprises further instructions to:
. The non-transitory computer program product of, wherein the program code comprises further instructions to:
. The non-transitory computer program product of, wherein select the existing serving container from the cluster of available serving containers is based on one of leveraging a bin-packing algorithm and leveraging a consistent hashing algorithm with identifiers of each serving container associated with the version of the machine-learning model and with an identifier of the version of the machine-learning model.
. A computer-implemented method for using container and model information to select containers for executing models, the computer-implemented method comprising:
. The computer-implemented method of, the computer-implemented method further comprising updating a data structure comprising at least one of model information associated with machine-learning models corresponding to any serving containers in any cluster of serving containers and container information associated with serving containers in any corresponding cluster of serving containers.
. The computer-implemented method of, the computer-implemented method further comprising:
. The computer-implemented method of, the computer-implemented method further comprising:
. The computer-implemented method of, the computer-implemented method further comprising:
. The computer-implemented method of, wherein the application is associated with a first tenant and the extra application is associated with a second tenant.
. The computer-implemented method of, wherein selecting the existing serving container from the cluster of available serving containers is based on one of leveraging a bin-packing algorithm and leveraging a consistent hashing algorithm with identifiers of each serving container associated with the version of the machine-learning model and with an identifier of the version of the machine-learning model.
Complete technical specification and implementation details from the patent document.
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The subject matter discussed in the background section should not be assumed to be prior art merely because of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also correspond to implementations of the claimed inventions.
Operating-system-level virtualization, also known as containerization, refers to an operating system feature in which an operating system kernel enables the existence of multiple isolated user-space instances. Each of these instances, also referred to as software containers, partitions, or virtualization engines, can wrap an executing application in a complete environment containing everything that the application needs, such as memory, disk space, network access, and an operating system. Software containers are used by machine-learning serving infrastructures, which are becoming ubiquitous in the emerging machine-learning industry as well as in public cloud computing services. Existing machine-learning serving infrastructures typically provide machine-learning models' services through a one-to-one relationship, by dedicating each individual serving container to hosting only one corresponding machine-learning model and all its required dependencies.
An application can be a computer program or piece of software designed and written to fulfill a particular purpose of a user. A serving container can be an isolated computer program execution environment that is enabled by a computer's operating system, and which executes the main functionality of a machine-learning model. A machine-learning model can be a computer system that scientifically studies algorithms and/or statistical models to perform a specific task effectively by relying on patterns and inference instead of using explicit instructions. A routing container can be an isolated computer program execution environment that is enabled by a computer's operating system, and which executes load-balancing code to direct requests for execution by machine-learning models. A request can be an instruction to a computer to provide information or perform another function.
Automated machine-learning, feature engineering, and training enables a multi-tenancy approach to serving containers hosting machine-learning models, such that a single serving container can host hundreds of machine-learning models for multiple tenants. Within a multitenant architecture, an instance of a software application is designed to provide every tenant with a dedicated share of the instance—including its data, configuration, user management, and tenant individual properties and functionality.
A tenant can be a group of users who share a common access with specific privileges to a software architecture in which a single instance of software serves multiple such groups. A cluster can be a group of similar entities. A cluster of serving containers can be a group of duplicates of an isolated computer program execution environment that is enabled by a computer's operating system, and which executes the main functionality of a machine-learning model for all tenants.
Each routing container, or cluster of routing containers, can authenticate any requesting tenant, and then route any tenant's request for a service by machine-learning models to any serving container in a cluster of serving containers. A machine-learning serving infrastructure can include multiple clusters of serving containers, with each cluster serving a different version of any type of machine-learning model. For example, three clusters of serving containers serve versions 5.7, 5.8, and 5.9 of the type of machine-learning models that generate scores of business opportunities, and one cluster of serving containers serves a new version 1.0 of a different type of machine-learning models that generate recommendations of business opportunities. Each cluster of serving containers can use lazy caching to cache each of its machine-learning models onto all its serving containers. A version can be a form of an entity that differs in certain respects from an earlier form or other forms of the same type of entity.
A cluster of serving containers that can host all machine-learning models of the same version for all tenants is limited by the number of these machine-learning models that a single serving container can hold. Therefore, scaling to accommodate future additions of machine-learning models may become a problem when these machine-learning models exceed the capacity of any individual serving container in the cluster. Since each machine-learning model's size, ranging from hundreds of kilobytes (KB_ to hundreds of megabytes (MB), initialization time, and number of requests can vary widely based on each tenant's underlying database, some clusters of serving containers may be limited by a scarcity of supporting resources, while other clusters of serving containers may have a surplus of supporting resources. The failure or the addition of any container in a cluster of serving containers can create the need to rebalance the supporting resources in the clusters of serving containers. When a machine-learning serving infrastructure adds a new cluster of a serving containers for a new use case, each routing container may need to update software code to route requests to the new cluster of serving containers.
In accordance with embodiments described herein, there are provided methods and systems for using container and model information to select containers for executing models. A system receives a request from an application and identifies a version of a machine-learning model associated with the request. The system identifies model information associated with machine learning models corresponding to a cluster of available serving containers associated with the version of the machine-learning model. The system uses the model information to select a serving container from the cluster of available serving containers. If the machine-learning model is not loaded in the serving container, the system loads the machine-learning model in the serving container. If the machine-learning model is loaded in the serving container, the system executes, in the serving container, the machine-learning model on behalf of the request. The system responds to the request based on executing the machine-learning model on behalf of the request.
For example, a machine-learning serving infrastructure receives a request for scoring a business opportunity from a Customer Relationship Management (CRM) application and identifies the request requires executing a version of an opportunity scoring machine-learning model. A routing container identifies model information about the available cache and available CPU capacity used by the scoring machine learning models in the cluster of scoring serving containers A, B, C, D, E. F, and G. A routing manager applies a multi-dimensional bin-packing algorithm to the model information to select the scoring serving container D which has the largest combination of available cache and available CPU capacity to execute a copy of the specific opportunity scoring machine-learning model, from the cluster of available scoring serving containers A, B, C, D, E. F, and G.
If a copy of the specific opportunity scoring machine-learning model is not already loaded in the scoring serving container D, then the scoring serving container D loads the specific opportunity scoring machine-learning model. When a copy of the specific opportunity scoring machine-learning model is verified to be loaded in the scoring serving container D, the specific opportunity scoring machine-learning model executes the requested service in the scoring serving container D, and the machine-learning serving infrastructure responds to the CRM application's request. Since not every scoring serving container in the cluster of available scoring serving containers A-G will potentially host each machine-learning model, the cluster of available scoring serving containers can scale to host more machine-learning models than any individual scoring serving container can host.
Methods and systems are provided for using container and model information to select containers for executing models. First, systems, sequence diagrams, and hash rings for using container and model information to select containers for executing models will be described with reference to example embodiments. Then methods for using container and model information to select containers for executing models will be described.
Any of the embodiments described herein may be used alone or together with one another in any combination. The one or more implementations encompassed within this specification may also include embodiments that are only partially mentioned or alluded to or are not mentioned or alluded to at all in this summary or in the abstract. Although various embodiments may have been motivated by various deficiencies with the prior art, which may be discussed or alluded to in one or more places in the specification, the embodiments do not necessarily address any of these deficiencies. In other words, different embodiments may address different deficiencies that may be discussed in the specification. Some embodiments may only partially address some deficiencies or just one deficiency that may be discussed in the specification, and some embodiments may not address any of these deficiencies.
depicts an example of a system for using container and model information to select containers for executing models, in an embodiment. As shown in, a systemmay illustrate a cloud computing environment in which data, applications, services, and other resources are stored and delivered through shared data centers and appear as a single point of access for the users. The systemmay also represent any other type of distributed computer network environment in which servers control the storage and distribution of resources and services for different client users.
In an embodiment, the systemrepresents a cloud computing system that includes a first client, a second client, and a third client; and a machine-learning serving infrastructure, that may be provided by a hosting company. Althoughdepicts the first clientas a desktop computer, the second clientas a laptop computer, and the third clientas a mobile phone, each of the clients-may be any type of computer. The clients-and the machine-learning serving infrastructurecommunicate via a network.
The machine-learning serving infrastructureincludes a gateway, and clusters-of software containers. The clusterof software containers includes software containers-, the clusterof software containers includes software containers-, the clusterof software containers includes software containers-, and the clusterof software containers includes software containers-. The software containerincludes the machine-learning models-. The machine-learning serving infrastructurealso includes a service discovery systemand a routing manager.
depicts the systemwith three clients-, one machine-learning serving infrastructure, one network, one gateway, four clusters-of software containers, sixteen software containers-, four machine-learning models-, one service discovery system, and one routing manager. However, the systemmay include any number of clients-, any number of machine-learning serving infrastructures, any number of networks, any number of gateways, any number of clusters-of software containers, any number of software containers-, any number of machine-learning models-, any number of service discovery systems, and any number of routing managers. The systems depicted inand described below may be substantially like the clients-and to the components-of the machine-learning serving infrastructure.
The clusterof software containers is a clusterof routing containers-, the clusterof software containers is a clusterof ranking serving containers-, and the clusterof software containers is a clusterof recommending containers-. Since the clusterof software containers is a clusterof scoring serving containers-, the scoring serving containerincludes the models-that are machine-learning models which learn to score. Each cluster-of serving containers may load a version of a type of machine-learning models. For example, the clusterof scoring serving containers load a version of a type of machine-learning models-which share a library for scoring opportunities. In another example, the clusterof recommending serving containers load a version of a type of machine-learning models which share a library for recommending opportunities to sales representatives. In yet another example, the clusterof ranking serving containers load a version of a type of machine-learning models that share a library for ranking opportunities for each sales representative. Therefore, if a request from a tenant's application requires the services of these versions of machine-learning models, then any of the routing containers-can split the request into separate sub-requests, and then route the sub-requests to their corresponding clusters-of serving containers. Although these examples describe the clusters-of serving containers that serve one version of the scoring type of machine-learning models-, one version of the recommending type of machine-learning models-, and one version of the ranking type of machine-learning models-, any clusters of any serving containers may serve any number of versions of any number of any types of any machine-learning models.
Upon startup, each of the serving containers-registers with the service directoryby providing the serving container's registration information, such as the host and/or the port. When any of the serving containers-is no longer available, or becomes unavailable (intentionally or unintentionally), the service discovery systemdeletes the unavailable serving container's registration information. An available serving container may be referred to as an actual serving container. An available serving container can be an isolated computer program execution environment that is enabled by a computer's operating system, and which is currently able to execute the main functionality of a machine-learning model.
The service discovery systemmay be implemented by HashiCorp Consul, Apache Zookeeper, Cloud Native Computing Foundation etcd, Netflix eureka, or any similar tool that provides a service discovery and/or a service registration system. The service discovery systemmay not be designed to store a large amount of data, such as container information about each serving container and model information about each serving container's machine-learning models. The following is a hierarchy of data levels (1), (2), and (3) for virtual directories, files, or cnodes that can represent the container and model information for the serving containers A, B, . . . G, which the routing manageruses to determine what to do at any point of time. Container information can be data about an isolated computer program execution environment that is enabled by a computer's operating system, and which executes the main functionality of a machine-learning model. Model information can be data about a computer system that scientifically studies algorithms and/or statistical models to perform a specific task effectively by relying on patterns and inference instead of using explicit instructions.
The routing manageris deployed in a replicated fashion so that it will not become a single point of failure for the machine-learning serving infrastructure. However, only one instance of the routing managerwill act as a master, while other instances of the routing managerwill be in hot standby mode, ready to take over if the master instance of the routing managerfails, based on the notification coming from the service discovery systemif mlservices/cluster[i]/routing_manager/master cnode is deleted.
The routing managermakes decisions to load, rebalance, delete, distribute, and replicate machine-learning models in the serving containers-, based on the following information. The data model's hierarchy level (2) in the service discovery systemprovides information about which serving containers are expected to host specific machine-learning models and which serving containers actually host the specified machine-learning models. The routing managerwill push the list of expected machine-learning models to the model mapping hierarchy in the service discovery system. Each of the serving containers-will keep its own list of actual machine-learning models and if this list does not match the list of expected machine-learning models that a serving container receives, the serving container will load or delete any machine-learning models from the serving container's local cache as needed, and then update its own list of actual machine-learning models accordingly. Each of the routing containers-will listen for and maintain each serving container's list of actual machine-learning models to determine where to route requests.
The routing manageranalyzes the model information about each machine-learning model to decide whether to replicate frequently used machine-learning models to additional serving containers to prevent overloading the serving containers which are hosting the frequently used machine-learning models. The data model's hierarchy level (3) in the service discovery systemstores model information about each machine-learning model. The routing manageruses the data model's hierarchy level (2) to manage lists of available machine-learning models and available serving containers. Every time a machine-learning model is loaded into a serving container's local cache, the serving container registers the machine-learning model in the data model's hierarchy level (2), in a reverse lookup: from the machine-learning model to the serving container. Therefore, the routing containers-can route requests for a particular machine-learning model to the serving container(s) that already loaded a copy of the particular machine-learning model into cache. A copy can be an entity that is made to be similar or identical to another entity. Each serving container also reports model information about its machine-learning models that is stored in the data model's hierarchy level (3), such as frequency of use expressed in terms of requests per second, which is model information that can be periodically (such as hourly) updated by the serving containers-. The routing managerretrieves and uses those updates of model information to make decisions about replicating, rebalancing, loading, and deleting machine-learning models.
For the following numerical examples, the size of the data that the service discovery systemkeeps in memory is based on the following assumptions. Each cluster-of serving containers includes 100 serving containers, withsupported cluster versions per cluster type. Each of the serving containers-can hold up to 500 machine-learning models that each have a maximum replication of 5, such that every machine-learning model is replicated 5 times in the worst-case scenario.
/mls/cluster[i]/routing_manager stores 16 bytes of container information for 5 replications: 16 B*5=80 B of container information
/mls/cluster[i]/scoring/version/model_mapping/containerN (on scoring container) stores 16 bytes of container information for 2 states (expected, actual) per model for 500 models per serving container: 16 B*2*500=16 KB of container information per serving container/mls/cluster[i]/scoring/version/model_mapping/stores 16 KB of container information per serving container for 100 serving containers per cluster version: 16 KB*100=1.6 MB of container information per cluster version/mls/cluster[i]/scoring/stores 1.6 MB of container information per cluster version for 5 cluster versions per cluster type: 1.6 MB*5=8 MB of container information per cluster type/mls stores 8 MB of container information per cluster type for 3 cluster types (scoring, recommending, and ranking): 8 MB*3=24 MB of container information/mls/cluster[i]/scoring/version/container_state/containerN (on scoring container) stores 21 bytes (16 bytes plus 4 int counter bytes plus 1 delimiter byte) of model information for 500 models per serving container: 21 B*500=10.5 KB of model information per serving container/mls/cluster[i]/scoring/version/container_state stores 10.5 KB of model information per serving container for 100 serving containers per cluster version: 10.5 KB*100=1.05 MB of model information per cluster version/mls/cluster[i]/scoring/stores 1.05 MB of model information per cluster version for 5 cluster versions per cluster type: 1.05 MB*5=5.25 MB of model information per cluster type/mls stores 5.25 MB of model information per cluster type for 3 cluster types (scoring, recommending, ranking): 5.25 MB*3=15.75 MB of model information
The 24 MB of container information plus the 15.75 MB of model information equals a total of 39.75 MB of container information and model information stored by the routing manager. The 39.75 MB of container information and model information stored by the routing manageris stored for 100 serving containers per cluster version, for 5 cluster versions per cluster type, and for 3 cluster types: 39.75 MB/(100*5*3)=the routing managerstores 265 KB for each serving container.
Each of the routing containers-needs to store information for actual models only, instead of information for actual models and information for the expected models. Therefore, the size of storage for each of the routing containers-is 8 KB (instead of 16 KB) of container information per serving container for 100 serving containers per cluster version for by 5 cluster versions per cluster type for 3 cluster type: use cases: 8 KB*100*5*3=12 MB (instead of 24 MB) of container information and 15.75 MB of model information: 12 MB+15.75 MB=27.75 MB of container information and model information.
Internal representation may be further optimized for storing the information in each of the routing containers-when storing only actual models and metrics, which is 4 bytes of model metrics for 500 models per serving container for 100 serving containers per cluster version for 5 cluster versions per cluster type for 3 cluster types: =4 B*500*100*5*3=3 MB of model metrics. This further optimization results in 3 MB of actual models and metrics and 12 MB of actual container information: 3 MB+12 MB=15 MB of actual models and metrics and actual container information for each of the routing containers-.
When the machine-learning serving infrastructureadds any new cluster-of serving containers, any of the new serving containers-registers with the service discovery systemaccording to the data model (container_state), without any need to update the code in any of the routing containers-. Since no machine-learning models are yet loaded on this new cluster, the routing containers-will not do any loading or rebalancing of machine-learning models on this new cluster. When the machine-learning serving infrastructurediscontinues or terminates any old cluster-of serving containers, the metadata for any of the old cluster's serving containers-is removed from the service discovery systemaccording to the data model (container_state and model_mapping). This removal of metadata will not initiate a rebalancing of machine-learning models because there are no machine-learning models that need to be moved or loaded again).
When the machine-learning serving infrastructureadds any of the new serving containers-to any of the existing clusters-of serving containers, any of the new serving containers-updates the service discovery systemaccording to the data model (container_state). The watcher for the routing managerwill notify the routing managerof any new serving container-. Based on a model loading strategy, the routing managermay respond to the notification by rebalancing the serving containers-.
When any of the existing serving containers-in any of the existing clusters-of serving containers dies unexpectedly, or gracefully, the serving container's heartbeat to the service discovery systemfails. Then the machine-learning serving infrastructureremoves the ephemeral virtual directory, file, or cnode from the service discovery system, which updates the container_state in the service discovery system. The watcher for the routing managerwill notify the routing managerof any unavailable serving container-. Based on a model loading strategy, the routing managermay respond to the notification by rebalancing the serving containers-.
Each of the routing containers-has a watcher that watches the service discovery systemfor changes in the available serving containers-in the clusters-of serving containers, and then provides notifications of any changes in the information about the serving containers-to its routing container-to update a map, or any similar data structure, of the available serving containers-asynchronously. The notified routing containers-will use their updated maps of the available serving containers-to route new requests, which are received via the gateway, to the available serving containers-. A data structure can be the organization, management, and storage format of information which enables efficient access and modification.
If a new version of a machine-learning model is available, and a request from a tenant's application identifies the new version of the machine-learning model, then the routing managermay identify any of the available serving containers-in the clusters-of serving containers that may have loaded a copy of the old version of the machine-learning model. The routing managerwill update the expected_models' information in the service discovery system, which may be used for loading the new version of the machine-learning model.
If the old version of the machine-learning model is to be discontinued, the routing managerwill change the routing metadata to route requests to the new version of the machine-learning model, such that the old version of the machine-learning model expires in cache. The routing managercan use an Application Programming Interface (API) to unload the old model.
If no mapping exists for a requested machine-learning model in the service discovery system, then any of the routing containers-can route the request for the specific machine-learning model as the first request for the specific machine-learning model, by selecting one of the serving containers in the corresponding cluster for a version of the specific machine-learning model. The one serving container may be selected randomly, selected based on leveraging a bin-packing algorithm, or selected by leveraging a consistent hashing algorithm. If too many requests are received for a specific machine-learning model within the same time period, the service discovery systemcan implement a lock to ensure that the routing containers-route only one request as the first request for the specific machine-learning model.
If mapping exists for a requested machine-learning model in the service discovery system, then any of the routing containers-can use a model loading strategy for loading machine-learning models into serving containers to determine the serving container to which the request will be routed. The model loading strategy may be based on leveraging various bin-packing algorithms or leveraging consistent hashing with bounded loads. After receiving a request for executing a machine-learning model, a serving container loads and executes the machine-learning model, and reports information about the machine-learning model to the service discovery system, which stores this model information in the data model's hierarchy level (3) and updates all the routing containers with this model information.
If model cache has expired in a serving container, a cache removal listener notifies the routing manager, which updates expected_models and actual_models in the model information of the serving container in the service discovery system. Based on the model loading strategy, the routing managermay respond to the notification by rebalancing the models in the serving containers-.
depicts a sequence diagramfor a load API. Given an identifier of a machine-learning model M, the routing managerexecutes a load API (PUT /v1.0/model/{modelId}) which is responsible for identifying in which serving container S to load the machine-learning model M, at event. Due to the high loading latency for machine-learning models, this is an asynchronous call. The output of this request is the identification of the serving container S to load the machine-learning model M.
Then the routing managerprovides an updated list of the expected machine-learning models, including a copy of the machine-learning model M to be loaded, for each serving container to the service discovery system, at event. The service discovery systemprovides the updated list of the expected machine-learning models, including a copy of the machine-learning model M to be loaded, for each serving container, to each serving container, including the identified serving container S, at event, The identified serving container S compares its updated list of the expected machine-learning models, including a copy of the machine-learning model M to be loaded, against its list of actual machine-learning models, and identifies the difference between the lists is a copy of the machine-learning model M to be loaded, at event. The identified serving container S loads the machine-learning model M, at event.
Next, the identified serving container S provides its updated list of actual machine-learning models, including a copy of the loaded machine-learning model M, to the service discovery system, at event. The service discovery systemsends the updated list of actual models loaded in serving containers, including a copy of the loaded machine-learning model M, to the routing manager, at event. The routing managerdetermines that the machine-learning models need rebalancing, at event. The routing managerprovides a rebalanced list of the expected machine-learning models, including a copy of the loaded machine-learning model M, for each serving container to the service discovery system, at event. The service discovery systemprovides the rebalanced list of the expected machine-learning models, including a copy of the loaded machine-learning model M, for each serving container, to each serving container, including the identified serving container S, at event, The serving containers rebalance their models, at event.
depicts a sequence diagramfor an unload API. Given an identifier of a machine-learning model, the routing managerexecutes an unload API (DELETE/v1.0/model/{modelId}), which is responsible for identifying the serving container S in which a copy of the machine-learning model M is currently loaded, at event. The output of this request is the identification of the serving container S that should unload the machine-learning model M. The routing managerprovides an updated list of the expected machine-learning models, including a copy of the machine-learning model M to be unloaded, for each serving container to the service discovery system, at event.
Then the service discovery systemprovides the updated list of the expected machine-learning models, including a copy of the machine-learning model M to be unloaded, for each serving container, to each serving container, including the identified serving container S, at event, The identified serving container S compares its updated list of the expected machine-learning models, excluding a copy of the machine-learning model M to be unloaded, against its list of actual machine-learning models, and identifies the difference between the lists is a copy of the machine-learning model M to be unloaded, at event. The identified serving container S unloads the machine-learning model M, at event. The identified serving container S provides its updated list of actual machine-learning models, which reflects the unloaded copy of machine-learning model M, to the service discovery system, at event.
Next, the service discovery systemsends the updated list of actual models loaded in serving containers, which reflects the unloaded copy of machine-learning model M, to the routing manager, at event. The routing managerdetermines that the machine-learning models need rebalancing, at event. The routing managerprovides a rebalanced list of the expected machine-learning models, which reflects the unloaded copy of machine-learning model M, for each serving container to the service discovery system, at event. The service discovery systemprovides the rebalanced list of the expected machine-learning models, which excludes the unloaded copy of machine-learning model M, for each serving container, to each serving container, including the identified serving container S, at event, The serving containers rebalance their machine-learning models, at event.
When adding any new serving container-to any of the clusters-, or rebalancing of the machine-learning models, the machine-learning serving infrastructureneeds to move some machine-learning models from one serving container to another serving container. Therefore, the machine-learning serving infrastructureneed to load the machine-learning model in the new serving container and then unload the machine-learning model from the old serving container so that the machine-learning model is not unavailable.depicts a sequence diagram for moving a machine-learning model between serving containers. In the following example, the machine-learning model M needs to be moved from the source serving container S to the destination serving container T.
The routing manageradds the machine-learning model M to the list of expected machine-learning models for the serving container T, adds a lock under the machine-learning model M/ serving container S/ serving container T path, removes the machine-learning model M from the list of expected machine-learning models for the serving container S, and provides an updated list of the expected machine-learning models, which reflects the machine-learning model M to be moved, for each serving container to the service discovery system, at event. The service discovery systemprovides the updated lists of the expected machine-learning models, which reflects the machine-learning model M to be moved, for each serving container, to each serving container, including the source serving container S, at event, and the destination serving container T, at event, The source serving container S compares its updated list of expected machine-learning models, excluding the machine-learning model M to be unloaded, against its list of actual machine-learning models, and identifies the difference between the lists is a copy of the machine-learning model M to be unloaded, at event. The destination serving container T compares its updated list of expected machine-learning models, including a copy of the machine-learning model M to be loaded, against its list of actual machine-learning models, and identifies the difference between the lists is a copy of the machine-learning model M to be loaded, at event.
Then the source serving container S checks the lock under the machine-learning model M/ serving container S/ serving container T path, and then will wait until the lock is removed before unloading the machine-learning model M, at event. The destination serving container T loads the machine-learning model M, at event. The destination serving container T provides its updated list of actual machine-learning models, including a copy of the loaded machine-learning model M, to the service discovery system, at event. The routing managerreceives a notification from its watcher on the service discovery system, which identifies the destination serving container T's updated list of actual machine-learning models that includes a copy of the loaded machine-learning model M, removes the lock under the machine-learning model M/ serving container S/ serving container T path, and reports removal of the lock to the service discovery system, at event. The service discovery systemreports removal of the lock to the source serving container S, at event.
Next, the source serving container S unloads the machine-learning model M, at event. The source serving container S provides its updated list of actual machine-learning models, which reflects the unloaded copy of machine-learning model M, to the service discovery system, at event. The service discovery systemsends the updated list of actual models loaded in serving containers, which reflects the moved copy of machine-learning model M, to the routing manager, at event. The routing managerdetermines that the machine-learning models need rebalancing, at event. The routing managerprovides a rebalanced list of the expected machine-learning models, which reflects the moved machine-learning model M, for each serving container to the service discovery system, at event. The service discovery systemprovides the rebalanced list of the expected machine-learning models, which reflects the moved machine-learning model M, for each serving container, to each serving container, including the source serving container S, at event, and the destination serving container T, at event.
The machine-learning serving infrastructurehas n different serving containers with resource capacities c, c, . . . , cn and m different machine-learning models with demands d, d, . . . , dm on these resource capacities. Resource capacities are values specific to the n serving containers, such as memory, Central Processing Units (CPUs), and response per second. The routing managertranslates these serving containers' resource capacities and these machine-learning models' demands into defining a consistent mapping for each of the machine-learning models into their own serving containers so that the machine-learning serving infrastructurecan supply all these machine-learning models' demands based on the serving containers' resource capacities. The routing managermanages each machine-learning model's lifecycle by considering the demand for each machine-learning model and the resource capacities of the serving containers to which the machine-learning models are mapped.
Unknown
April 21, 2026
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.