Patentable/Patents/US-20250322269-A1

US-20250322269-A1

Systems and Methods for Machine Learning-Based Site-Specific Threat Modeling and Threat Detection

PublishedOctober 16, 2025

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

Systems and methods for implementing a threat model that classifies contextual events as threats. The method can include: accessing a threat model; identifying a set of contextual events, wherein each contextual event comprises a set of semantic primitives predicted from a plurality of sensor streams; and determining a threat level for each contextual event based on threat probabilities.

Patent Claims

Legal claims defining the scope of protection, as filed with the USPTO.

. A method comprising: with a surveillance system that is communicatively coupled to a set of sensors:

. The method of, further comprising updating the set of threat probabilities of the threat model based on determined frequencies for the set of contextual events.

. The method of, wherein accessing the threat model comprises accessing a threat model template.

. The method of, wherein each threat probability of the set of threat probabilities is learned based on historical contextual events.

. The method of, wherein the threat model is generated based on normal and anomalous historical contextual events.

. The method of, wherein the contextual event is identified in real-time.

. The method of, further comprising classifying each contextual event of the set as a threat or non-threat based on the threat level.

. The method of, wherein the threat model and sensor data are stored in a local system, wherein identifying the set of contextual events and determining a threat level occurs in a cloud computing system.

. The method of, further comprising determining a threat response action based on the threat level.

. The method of, wherein a threat response action comprises at least one of: sending an alert, sounding an alarm, dispatching security, or triggering a lockdown.

. The method of, wherein identifying a contextual event comprises:

. A system comprising:

. The method of, wherein the processing system is further configured to: update the threat model based on a determined frequency of each contextual event of the set of contextual events.

. The method of, wherein the processing system is further configured to: templatize the updated threat model and store the template for deployment at a new site.

. The method of, Wherein the processing system is further configured to: detect occurrence of a threat response from the plurality of sensor streams and update the threat model based on the threat response event.

. The method of, wherein the system is further configured to construct a site activity frequency mapping based on the set of contextual events.

. The method of, wherein the threat model is updated based on the site activity frequency mapping.

. The method of, wherein the site activity frequency mapping comprises a multi-dimensional matrix.

. The method of, wherein each contextual event of set of contextual events is associated with at least one of: a site location, a time, a known individual, or an unknown individual.

. The method of, wherein the processing system comprises local compute and remote compute; wherein the threat model is stored locally; and wherein the set of contextual events is identified remotely.

Detailed Description

Complete technical specification and implementation details from the patent document.

This application is a continuation of U.S. application Ser. No. 17/515,115, filed 29 Oct. 2021, which is a continuation of U.S. application Ser. No. 16/695,538, filed 26 Nov. 2019, which claim the benefit of U.S. Provisional Application No. 62/784,227, filed 21 Dec. 2018, each of which are incorporated in its entirety by this reference.

This invention relates to the sensor fusion field, and more specifically to a new and useful machine learning-based site monitoring and security in the access security, machine learning, and sensor fusion fields.

Prevalent security and surveillance systems typically enable the monitoring and capture of video images of an area of interest to the entities implementing the systems. Over the course of some time period, such as a day, a week, or a month, such security and surveillance systems may capture significant amounts of video data, which is typically too great for any individual human, or multiple humans, to meaningfully review for the purposes of detecting events of interest including security events. Often such review is merely reactive to a security event or other event that has occurred in the past. While in some instances this type of review may be useful in resolving or addressing less time-sensitive security events or the like, this type of review is merely reactive and the time lost reviewing the video image data can adversely impact obtaining a desired result for time-sensitive security events by the entity implementing the security system.

However, even in real-time (or near real-time) monitoring and review of video images streaming from several surveillance video cameras of the security and surveillance systems can be extremely difficult for human detection of events of interest. Because in most circumstances, defined spaces that are under surveillance via the security and surveillance systems incorporate multiple video cameras, there might be time when there are more video feeds than there are security personnel available to monitor and review the video feeds. Thus, in a real-time monitoring and surveilling situation, many events of interests, including security events, may be missed, thereby compromising the security and/or safety of the defined space(s) and/or the subjects (e.g., persons, protected products, etc.) within the defined space.

Thus, there is a need to create a new and useful event detection system. The embodiments of the present application provide such new and useful systems and methods.

The following description of preferred embodiments of the present application are not intended to limit the inventions to these preferred embodiments, but rather to enable any person skilled in the art of to make and use these inventions.

As shown in, a systemincludes at least one of: sensor data sources (e.g.,-) (e.g., image data sources), a sensor data comprehension system (e.g.,), sensor data storage (e.g.,), contextual event storage (e.g.,), a contextual data storage (e.g.,), a contextual event detection system (e.g.,), a user interface system (e.g.,), a control system (e.g.,), and a notification system (e.g.,).

In some variations, the sensor data comprehension systemis similar to a comprehension system as described in U.S. patent application Ser. No. 16/137,782, filed 21 Sep. 2018, which is incorporated herein in its entirety by this reference. However, the sensor data comprehension systemcan be any suitable type of comprehension system that functions to perform processes as described herein.

In some variations, the comprehension systemincludes a context feature extractorthat functions to extract, from the sensor data, features associated with contextual factors.

In some variations, contextual factors include factors that identify time of the day, a region (e.g., “stairwell”, “parking lot”, “office”, “kitchen”, etc.) of a site, known individuals, unknown individuals, and the like.

In some variations, the comprehension systemincludes a threat feature extractorthat functions to extract, from the sensor data, threat and non-threat features.

In some variations, the comprehension systemincludes at least one of

a high-level feature detection model, a multi-feature detection machine learning ensemble, a condenser, and scene a story generator(e.g., as shown in). In some implementations, the condenserincludes a mutual feature data exploitation engine. In some implementations, the story generator includes a trained language machine learning model.

In some variations, the comprehension systemfunctions to collect sensor data (in any form) (e.g., image data) from the one or more sensor data sources (e.g.,-) within the system. In some variations, the comprehension systemfunctions to implement a combined machine learning model core (e.g., a multi-feature detection machine learning ensemble) to detect relevant features within a scene defined by the collected sensor data. In some variations, the comprehension systemuses a condenser (e.g.,) to form a composite of a plurality of feature outputs (e.g., f_1,f_2, f_3 . . . f_n) of the multiple sub-models of the combined model core (e.g., as shown in). In some variations, from the composite, the system, using a mutual feature data exploitation engine (e.g.,), functions to extract mutual/relationship data from overlapping segments of the composite and derives mutual/relationship vectors, as output (e.g., as shown in). In some variations, the comprehension systempasses the plurality of feature data outputs and the mutual/relationship vectors to a story generator (e.g.,) that functions to use a trained machine learning model (e.g.,) to generate one or more event descriptions for the sensor data.

In some variations, the sensor data processed through the systemincludes live sensor data relating to events and/or circumstances captured in real-time and/or near real-time (e.g., within 0-5 minutes or the like) by one or more sensor data sources (e.g., live-feed video cameras). Correspondingly, in some variations, the systemfunctions to digest the live sensor data in real-time or near real-time to generate timely event or circumstance intelligence.

In some variations, the one or more sensor data sources (e.g.,-) function to capture sensor data of one or more areas of interest. In some variations, the systemfunctions to collect image data. In some variations, the sensor data sources include a plurality of types of sensor data sources (e.g., image sensors, heat sensors, temperature sensors, motion sensors, etc.) each functioning to generate a different type of data. In some variations, the systemfunctions to capture any type or kind of observable data of an area or scene of interest (e.g., by using one or more sensor data sources) including, but not limited to, thermal or heat data, acoustical data, motion and/or vibration data, object depth data, and/or any suitable data that can be sensed. The area of interest may be a fixed area in which a field of sensing (e.g., field of vision for an image capturing sensor) of a sensor data source may be fixed. Additionally, or alternatively, the area of interest may be dynamic such that a field of sensing of a sensor data source may change continuously or periodically to capture different areas of interest (e.g., a rotating video camera). Thus, an area of interest may be dependent on a position and corresponding field of sensing of a sensor data source (e.g.,-). In some variations, the sensor data sources-preferably include an image capturing system comprising one or more image capturing devices. In some variations, the image capturing devices include at least one of: video cameras, still image cameras, satellites, scanners, frame grabbers, and the like that function to capture (in real-time) at least one of analog video signals, digital video signals, analog still image signals, digital still image signals, and the like. In some variations, digital images may be captured or produced by other sensors (in addition to light-sensitive cameras) including, but not limited to, range sensors, tomography devices, radar, ultra-sonic cameras, and the like.

In some variations the one or more sensor data sources-function to capture sensor data and transmit the sensor data via a communication network (e.g., the Internet, LAN, WAN, GAN, short-range communication systems, Bluetooth, etc.) to the system. In some variations, the systemfunctions to access or pull the captured data from the one or more sensor data sources (e.g.,-). In some variations, at least one of the sensor data sources (e.g.,-) is in direct or operable communication with the system, such that live sensor data captured at the one or more sensor data sources (e.g.,-) are fed directly into the one or more machine learning classifiers and feature detection models of system. Thus, in such variations, the live sensor data may not be stored (in a permanent or semi-permanent storage device) in advance of transmitting the live sensor data to the one or more processing modules and/or sub-systems of the system. A technical advantage achieved of such implementation include real-time or near real-time processing of an event or circumstance rather than post event processing, which may delay a suitable and timely response to an urgent occurrence.

In some embodiments, one or more parts or sub-systems of the systemmay be implemented via an on-premise system or device and possibly, in combination with a cloud computing component of the system. In such embodiments, the one or more sensor data sources (e.g.,-) may function to both capture live sensor data in real-time and feed the live sensor data to the on-premise system for generating intelligence data from the live sensor data. In such variations, the on-premise system may include one or more hardware computing servers (e.g.,shown in) executing one or more software modules for implementing the one or more sub-systems, processes, and methods of the system.

In some variations, the one or more sensor data sources (e.g.,-) are configured to optimize scene coverage thereby minimizing blind spots in an observed area or area of interest and additionally, optimize overlapping coverage areas for potential areas of significant interest (e.g., a highly secure area, etc.). In some variations, the systemfunctions to process together overlapping sensor data from multiple sensor data sources (e.g.,-) recording sensor data of a substantially same area (e.g., overlapping coverage areas) of interest. The sensor data in these areas of interest having overlapping coverage may enable the systemto generate increased quality event description data for a scene because of the multiple vantage points within the overlapping sensor data that may function to enable an increased or improved analysis of an event or circumstance using the additional detail and/or variances in data collected from the multiple sensor data sources.

In some variations, the systemfunctions to access additional event data sources including additional sensor data sources, news feed data sources, communication data sources, mobile communication device data (from users operating in an area of interest, etc.) and the like. The additional event data may be ingested by systemand used to augment the event description data for a scene.

In some variations, the comprehension systemfunctions to analyze and/or process sensor data input preferably originating from the one or more sensor data sources (e.g.,-).

In some variations, the high-level feature detection modelis a high-level deep learning model (e.g., a convolutional neural network, etc.) that functions extract high-level features from the sensor data accessed by the comprehension system. In some variations, feature extraction performed by the high-level deep learning model (e.g., a convolutional neural network, etc.) includes at least one of: edge/border detection, and other more abstract features with higher semantic information. In some variations, the high-level deep learning model functions to identify and extract coarse semantic information from the sensor data input from the one or more sensor data sources (e.g.,-). In some variations, the high-level deep learning model implements an artificial neural network and functions to extract broad scene level data (and may optionally generate descriptive metadata tags, such as outdoor, street, traffic, raining, and the like for each of the distinctly identified features).

In some variations, the multi-feature detection machine learning ensembleincludes a plurality of sub-machine learning models, each functioning to perform a distinct feature detection and/or classification of features. In some variations, the plurality of sub-machine learning models functions to perform distinct feature detection tasks that include, but are not limited to: pose estimation, object detection, facial recognition, scene segmentation, object attribute detection, activity recognition, identification of an object (e.g., person ID, vehicle, ID, fingerprint ID, etc.), motion analysis (e.g., tracking, optical flow, etc.), and the like. In some variations, at least one of the sub-models uses the high-level features extracted by the high-level deep learning model to generate a vector in an n-dimensional hyperspace. In some implementations, at least one of the sub-models uses the high-level features extracted by the high-level deep learning model to generate a vector in an n-dimensional hyperspace for a particular computer vision task. In some variations, at least one of the sub-models extracts sensor data features directly from sensor data to generate a vector in an n-dimensional hyperspace. In some implementations, the systemfunctions to identify or classify any features of the accessed sensor data.

In some implementations, training a sub-model of the multi-feature detection machine learning ensembleincludes training at least one sub-model by using an output generated by at least one other sub-model of the ensemble.

In some implementations, training a sub-model of the multi-feature detection machine learning ensembleincludes training at least one sub-model to use high-level features generated by the high-level feature detection modelto generate semantic primitives.

In some implementations, semantic primitives are basic entities in metadata generated by the comprehension system.

In some implementations, each sub-model of the ensembleis trained with a same feature vector (e.g., a feature vector representative of output generated by the high-level feature detection model). By virtue of the foregoing, the machine learning ensemblecan generate semantic primitives by processing high-level features extracted from sensor data, without processing the raw sensor data. In this manner, performance may be improved, as compared with systems in which each model of an ensemble processes raw sensor data.

In some implementations, validating a sub-model of the multi-feature detection machine learning ensembleincludes validating at least one sub-model by using an output generated by at least one other sub-model of the ensemble.

In some implementations, training a sub-model of the multi-feature detection machine learning ensembleincludes simultaneously training at least two sub-models by using an output generated by at least one the sub-models being trained. In some implementations, simultaneously training includes tuning the feature vector output by the high-level feature extraction modelbased on output generated by at least one sub-model of the ensemble. By tuning the high-level feature extraction modelbased on output generated by at least one sub-model of the ensemble, the high-level feature extraction modelcan be tuned to reduce likelihood that the sub-models of the ensembleoutput invalid results after processing the high-level feature vector output by the high-level feature extraction model. For example, in a case of an ensemblethat includes an object detection model and a scene detection model, the high-level feature extraction modelcan be tuned to reduce the likelihood that the object detection model detects a car and the scene detection model detects a sidewalk (indicating a car driving on the sidewalk) after processing of the high-level feature vector (assuming that a car driving on the sidewalk is most probably an incorrect detection result, rather than an unlikely event).

In some variations, training the high-level feature extraction modelincludes training the modelto minimize invalid results of the ensemble. Such training can include processing sensor data of a training set to generate high-level feature vectors, processing the high-level feature vectors by using each model of the ensembleto generate an combined ensemble output that identifies an output of each sub-model of the ensemble, and validating the trained modelby classifying each combined ensemble output as either valid or invalid.

In some variations, a subset and/or all of the sub-models of the multi-feature detection machine learning ensemble are operated in parallel. In some variations, the high-level feature vector from the high-level feature extraction modelis provided to each of the sub-models at the same or substantially the same time (e.g., within 0-5 seconds, etc.), such that a contemporaneous evaluation, classification, and/or feature detection may be performed simultaneously in each of the sub-models. In some variations, the sensor data from the one or more sensor data sources (e.g.,-) are sourced to each of the sub-models at the same or substantially the same time (e.g., within 0-5 seconds, etc.), such that a contemporaneous evaluation, classification, and/or feature detection may be performed simultaneously in each of the sub-models.

In some implementations, the comprehension systemis implemented by one or more computing servers (e.g.,shown in) having one or more computer processors (e.g., graphics processor units (GPU), tensor processing unit (TPU), central processing units (CPUs, MCUs, etc.), or a combination of web servers and private servers) that may function to implement one or more ensembles of machine learning models. In some implementations, the comprehension systemis implemented by at least one hardware device, as shown in. In some embodiments, a storage medium (e.g.,) of the comprehension system includes at least one of machine-executable instructions and corresponding data for at least one of a high-level feature detection model, a multi-feature detection machine learning ensemble, a condenser, a data exploitation engine, a scene story generator, and a trained language machine learning model.

In some variations, the ensembleof machine learning models includes multiple machine learning models that work together to exploit mutual information to provide accurate and useful feature detection and relationship vectors therefor. In some implementations, the comprehension systemfunctions to communicate via one or more wired or wireless communication networks. In some implementations, the comprehension systemutilizes input from various other data sources (e.g., outputs of system, systemderived knowledge data, external entity-maintained data, etc.) to continuously improve or accurately tune weightings associated with features of the one or more of the machine learning modelsand/orof the comprehension system.

In some implementations, the comprehension systemperforms any suitable machine learning process, including one or more of: supervised learning (e.g., using logistic regression, back propagation neural networks, random forests, decision trees, etc.), unsupervised learning (e.g., using an Apriori algorithm, k-means clustering, etc.), semi-supervised learning, reinforcement learning (e.g., using a Q-learning algorithm, temporal difference learning, etc.), and any other suitable learning style. Each module of the plurality can implement any one or more of: a regression algorithm (e.g., ordinary least squares, logistic regression, stepwise regression, multivariate adaptive regression splines, locally estimated scatterplot smoothing, etc.), an instance-based method (e.g., k-nearest neighbor, learning vector quantization, self-organizing map, etc.), a regularization method (e.g., ridge regression, least absolute shrinkage and selection operator, elastic net, etc.), a decision tree learning method (e.g., classification and regression tree, iterative dichotomiser 3, C4.5, chi-squared automatic interaction detection, decision stump, random forest, multivariate adaptive regression splines, gradient boosting machines, etc.), a Bayesian method (e.g., naïve Bayes, averaged one-dependence estimators, Bayesian belief network, etc.), a kernel method (e.g., a support vector machine, a radial basis function, a linear discriminant analysis, etc.), a clustering method (e.g., k-means clustering, expectation maximization, etc.), an associated rule learning algorithm (e.g., an Apriori algorithm, an Eclat algorithm, etc.), an artificial neural network model (e.g., a Perceptron method, a back-propagation method, a Hopfield network method, a self-organizing map method, a learning vector quantization method, etc.), a deep learning algorithm (e.g., a restricted Boltzmann machine, a deep belief network method, a convolutional network method, a stacked auto-encoder method, etc.), a dimensionality reduction method (e.g., principal component analysis, partial least squares regression, Sammon mapping, multidimensional scaling, projection pursuit, etc.), an ensemble method (e.g., boosting, bootstrapped aggregation, AdaBoost, stacked generalization, gradient boosting machine method, random forest method, etc.), and any suitable form of machine learning algorithm. Each processing portion of the systemcan additionally or alternatively leverage: a probabilistic module, heuristic module, deterministic module, or any other suitable module leveraging any other suitable computation method, machine learning method or combination thereof. However, any suitable machine learning approach can otherwise be incorporated in the system. Further, any suitable model (e.g., machine learning, non-machine learning, etc.) can be used in generating scene comprehension data via system.

In some variations, the comprehension systemfunctions to process accessed sensor data to generate one or more semantic primitives describing the accessed sensor data processed by the comprehension system. In some implementations, the high-level deep learning modelprocesses the accessed sensor data to extract the high-level features from the sensor data accessed by the comprehension system, and the multi-feature detection machine learning ensembleprocesses the high-level features to generate the one or more semantic primitives describing the accessed sensor data processed by the comprehension system. By virtue of the ensembleprocessing the high-level features rather than the accessed sensor data, generation of the one-or-more semantic primitives can be performed in real-time. In some variations, the semantic primitives identify at least one of the following for the accessed sensor data: an activity, an object (e.g., person, car, box, backpack), a handheld object (e.g., knife, firearm, cellphone), a human-object interaction (e.g., holding, riding, opening), a scene element (e.g., fence, door, wall, zone), a human-scene interaction (e.g., loitering, falling, crowding), an object states (e.g., (door open”), and an object attribute (e.g., “red car”).

In some variations, the comprehension systemfunctions to store sensor data in a sensor data storage (e.g.,). In some variations, the stored sensor data includes at least one of sensor data received by the comprehension systemand primitives describing sensor data processed by the comprehension system.

In some variations, the systemincludes a contextual data storage (e.g.,) that stores contextual data. In some variations, the contextual data includes contextual data for at least one region of a site (e.g., a building, campus, etc.).

In some variations, the user interface systemfunctions to receive outputs from the comprehension systemas well as from the one or more sensor data sources (e.g.,-). In some variations, the user interface systemfunctions to present sensor data from the one or more sensor data sources together with a scene description or scene story of the sensor data. In some variations, a scene description is presented by the user interface systemonly when an event of interest (e.g., a predetermined event type, etc.) is detected within a scene. Accordingly, based on the detection of the event or circumstance, the systemmay function to generate a scene description and/or scene story to detail the event or circumstance. In some implementations, the sensor data includes video data and the scene description or scene story may be superimposed over or augmented to the video data via a display of the user interface system, such that the scene description is presented at a same time as a video basis of the scene description. Additionally, or alternatively, the scene description or scene story may be presented in any suitable manner by the user interface system, including visually, audibly, haptically, and the like.

In some variations, the user interface systemincludes one or more computers having input/output systems including one or more of displays (e.g., video monitors), keyboards, mice, speakers, microphones, and the like. In some variations, the user interface systemincludes a communication interface that enables the user interface systemto communicate over a communication network (e.g., the Internet) with the other components of system.

In some variations, the contextual event detection systemfunctions to implement a threat model.

In some implementations, a threat model is templatized and immediately deployed to another site of the same type. For example, a threat model generated for a first Corporate Office can be templatized and immediately deployed at a contextual event detection system for a second Corporate Office.

In some variations, a threat model is a collection of threat signatures configured to protect a site. In some variations, a threat signature (included in a threat model) is a combination (e.g., an arbitrary combination) of semantic primitives and contextual factors that represents a threat to be detected.

In some implementations, semantic primitives are basic entities in metadata generated by the comprehension system.

In some variations, the contextual event detection systemdetects the presence of threat signatures (included in a threat model being used by the contextual event detection system) in data received from at least one of the sensor data comprehension system(semantic primitives), a sensor data source (e.g.,-) (raw sensor data), sensor data storage(raw sensor data), and contextual data storage(contextual factors). In some variations, the contextual event detection systemraises alerts in response to detection of a threat signature.

In some variations, the contextual event detection systemfunctions to perform at least one of: generating the threat model, accessing contextual data, accessing access data generated by the sensor data comprehension system (e.g.,), identifying contextual events, classifying contextual events as either threats or non-threats by using the threat model.

In some variations, generating the threat model includes generating one or more threat signatures. In some variations, generating the threat model includes adding one or more threat signatures to the threat model.

In some variations, at least one contextual event is an event generated by processing sensor data (e.g., generated by a sensor data source,-) with contextual data (e.g., stored by the contextual data storage). In some variations, at least one contextual event identifies an interaction between at least two entities (e.g., objects, persons, etc.) (and identifies the at least two entities). In some variations, at least one contextual event identifies an interaction between at least two entities (e.g., objects, persons, etc.) (and identifies the at least two entities) and identifies at least one context identifier.

In some variations, at least one contextual event is identified by transforming at least one semantic primitive generated by the sensor data comprehension systeminto a contextualized semantic primitive by using the contextual data, such that the contextual event identifies at least one contextualized primitive. For example, the contextual event detection systemcan replace a semantic primitive “door” (generated by the sensor data comprehension system) with the contextualized semantic primitive “building entrance” by using contextual data that identifies the door as a building entrance. Accordingly, the interaction “person entering door” can be transformed to “person entering building”, which distinguishes the door from a door within the building (e.g., an office door). As another example, the interaction “person entering door” can be transformed to “person entering kitchen”. Other examples can be envisioned in which semantic primitives generated by the sensor data comprehension systemare transformed into at least one contextualized semantic primitive by using the contextual data.

In some variations, the contextual event detection systemfunctions to perform at least one action responsive to classification of an identified contextual event as a threat. In some implementations, the data generated by the sensor data comprehension system includes semantic primitives (as described herein). In some implementations, the contextual event detection systemaccesses the data generated by the sensor data comprehension systemfrom the sensor data comprehension system; alternatively, or additionally, the detection systemaccesses the data generated by the sensor data comprehension systemfrom a sensor data storage (e.g.,). In some variations, the contextual event detection systemaccesses the contextual data from a contextual data storage (e.g.,).

Patent Metadata

Filing Date

Unknown

Publication Date

October 16, 2025

Inventors

Unknown

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search