11176935

System and Method for Controlling Devices Through Voice Interaction

PublishedNovember 16, 2021
Assigneenot available in USPTO data we have
InventorsSibsambhu Kar
Technical Abstract

Patent Claims
18 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1

1. A method for controlling devices through voice interaction, the method comprises: receiving, by a controlling device, a voice input and one or more image inputs associated with a target device; identifying, by the controlling device, at least one feature of the target device and an action to be performed on the at least one feature, based on an intent and an object determined from the received voice input and the one or more image inputs; determining, by the controlling device, a correspondence between the at least one feature and the action to be performed using a trained neural network, wherein the trained neural network is pre-trained based on a correspondence between a plurality of prior actions and a plurality of features associated with the target device, and wherein the trained neural network is pre-trained to correctly identify the at least one feature and the action by combining a loss function for recognition of the action and a loss function for recognition of the target device that is to perform the action; comparing, by the controlling device, a current operational state of the at least one feature with an operational threshold of the at least one feature, wherein the current operational state of the at least one feature is indicative of a health state of the target device; when there is a variation in the health state of the target device, updating, by the controlling device, the one or more image inputs associated with the target device based on the variation, identifying, by the controlling device, an updated at least one feature of the target device based on the updated one or more image inputs, and determining, by the controlling device, a correspondence between the updated at least one feature and the action to be performed using a trained neural network; and performing, by the controlling device, the action on the at least one feature or the updated at least one feature based on the determined correspondence, when the current operational state is within limits of the operational threshold.

2

2. The method of claim 1 further comprises: converting, by the controlling device, the received voice input to text; and determining, by the controlling device, each of the intent and the object based on processing of the text by a Long Short Term Memory (LSTM) model.

3

3. The method of claim 2 , wherein the text is provided to the LSTM model in the form of sequence of words using word embeddings and the LSTM model is trained based on the prior actions, and each of the prior actions is associated with a probability of execution.

4

4. The method of claim 1 further comprises determining, by the controlling device, from the one or more image inputs associated with the target device, and by a convoluting neural network (CNN), the at least one feature of the target device, wherein the CNN is trained to identify features from the target device using at least one training image associated with the target device.

5

5. The method of claim 4 , wherein the one or more image inputs comprises at least one of a blueprint of the target device, a drawing of the target device, or a layout of the target device.

6

6. The method of claim 1 further comprises: establishing, by the controlling device, non-performance of the action on the at least one feature or the updated at least one feature, when the current operational state is outside the limits of the operational threshold; and outputting, by the controlling device, an alert regarding the non-performance of the action, wherein the alert comprises details associated with the non-performance of the action.

7

7. A controlling device comprising: a processor; and a memory communicatively coupled to the processor and storing instructions, that, when executed by the processor, causes the processor to: receive a voice input and one or more image inputs associated with a target device; identify at least one feature of the target device and an action to be performed on the at least one feature, based on an intent and an object determined from the received voice input and the one or more image inputs; determine a correspondence between the at least one feature and the action to be performed using a trained neural network, wherein the trained neural network is pre-trained based on a correspondence between a plurality of prior actions and a plurality of features associated with the target device, and wherein the trained neural network is pre-trained to correctly identify the at least one feature and the action by combining a loss function for recognition of the action and a loss function for recognition of the target device that is to perform the action; compare a current operational state of the at least one feature with an operational threshold of the at least one feature, wherein the current operational state of the at least one feature is indicative of a health state of the target device; when there is a variation in the health state of the target device, update the one or more image inputs associated with the target device based on the variation, identify an updated at least one feature of the target device based on the updated one or more image inputs, and determine a correspondence between the updated at least one feature and the action to be performed using a trained neural network; and perform the action on the at least one feature or the updated at least one feature based on the determined correspondence, when the current operational state is within limits of the operational threshold.

8

8. The controlling device of claim 7 , wherein the instructions, when executed by the processor, further cause the processor to: convert the received voice input to text; and determine each of the intent and the object based on processing of the text by a Long Short Term Memory (LSTM) model.

9

9. The controlling device of claim 8 , wherein the text is provided to the LSTM model in the form of sequence of words using word embeddings and the LSTM model is trained based on the prior actions, and each of the prior actions is associated with a probability of execution.

10

10. The controlling device of claim 7 , wherein the instructions, when executed by the processor, further cause the processor to determine, from the one or more image inputs associated with the target device and by a convoluting neural network (CNN), the at least one feature of the target device, wherein the CNN is trained to identify features from the target device using at least one training image associated with the target device.

11

11. The controlling device of claim 10 , wherein the one or more image inputs comprises at least one of a blueprint of the target device, a drawing of the target device, or a layout of the target device.

12

12. The controlling device of claim 7 , wherein the instructions, when executed by the processor, further cause the processor to: establish non-performance of the action on the at least one feature or the updated at least one feature, when the current operational state is outside the limits of the operational threshold; and output an alert regarding the non-performance of the action, wherein the alert comprises details associated with the non-performance of the action.

13

13. A non-transitory computer-readable storage medium having stored thereon instructions for controlling devices through voice interaction comprising executable code which when executed by one or more processors, causes the one or more processors to: receive a voice input and one or more image inputs associated with a target device; identify at least one feature of the target device and an action to be performed on the at least one feature, based on an intent and an object determined from the received voice input and the one or more image inputs; determine a correspondence between the at least one feature and the action to be performed using a trained neural network, wherein the trained neural network is pre-trained based on a correspondence between a plurality of prior actions and a plurality of features associated with the target device, and wherein the trained neural network is pre-trained to correctly identify the at least one feature and the action by combining a loss function for recognition of the action and a loss function for recognition of the target device that is to perform the action; compare a current operational state of the at least one feature with an operational threshold of the at least one feature, wherein the current operational state of the at least one feature is indicative of a health state of the target device; when there is a variation in the health state of the target device, update the one or more image inputs associated with the target device based on the variation, identify an updated at least one feature of the target device based on the updated one or more image inputs, and determine a correspondence between the updated at least one feature and the action to be performed using a trained neural network; and perform the action on the at least one feature or the updated at least one feature based on the determined correspondence, when the current operational state is within limits of the operational threshold.

14

14. The non-transitory computer-readable storage medium of claim 13 , wherein the executable code, when executed by the one or more processors, further causes the one or more processors to: convert the received voice input to text; and determine each of the intent and the object based on processing of the text by a Long Short Term Memory (LSTM) model.

15

15. The non-transitory computer-readable storage medium of claim 14 , wherein the text is provided to the LSTM model in the form of sequence of words using word embeddings and the LSTM model is trained based on the prior actions, and each of the prior actions is associated with a probability of execution.

16

16. The non-transitory computer-readable storage medium of claim 13 , wherein the executable code, when executed by the one or more processors, further causes the one or more processors to determine, from the one or more image inputs associated with the target device and by a convoluting neural network (CNN), the at least one feature of the target device, wherein the CNN is trained to identify features from the target device using at least one training image associated with the target device.

17

17. The non-transitory computer-readable storage medium of claim 16 , wherein the one or more image inputs comprises at least one of a blueprint of the target device, a drawing of the target device, or a layout of the target device.

18

18. The non-transitory computer-readable storage medium of claim 13 , wherein the executable code, when executed by the one or more processors, further causes the one or more processors to: establish non-performance of the action on the at least one feature or the updated at least one feature, when the current operational state is outside the limits of the operational threshold; and output an alert regarding the non-performance of the action, wherein the alert comprises details associated with the non-performance of the action.

Patent Metadata

Filing Date

Unknown

Publication Date

November 16, 2021

Inventors

Sibsambhu Kar

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “SYSTEM AND METHOD FOR CONTROLLING DEVICES THROUGH VOICE INTERACTION” (11176935). https://patentable.app/patents/11176935

© 2026 Patentable. All rights reserved.

Patentable is a research and drafting-assistant tool, not a law firm, and does not provide legal advice. Documents we generate are drafts for review by a licensed patent attorney.