US-9270941

Smart video conferencing system

PublishedFebruary 23, 2016

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

Embodiments provide techniques for facilitating transmission of a video stream from a first video conferencing device to a remote video conferencing device. Embodiments receive, by the first video conferencing endpoint device, first video data captured from a first field of view of a physical environment. The video data includes a plurality of frames. Activity data is determined for portions of the first video data across the plurality of frames. Embodiments generate, by a first video conferencing endpoint device, second video data from a second field of view of the physical environment, based on the determined activity data. Additionally, embodiments facilitate the transmission of the video stream to the remote video conferencing device for display, the video stream comprising the generated second video data and audio data captured within the physical environment.

Patent Claims

29 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method of generating a video stream for use in a video conference, comprising: receiving, by a first video conferencing endpoint device, first video data captured from a first field of view of a physical environment, the first video data comprising a plurality of frames; determining activity data from portions of the first video data using information provided in the plurality of frames; generating, by the first video conferencing endpoint device, second video data from a second field of view of the physical environment, based on the determined activity data; generating a video stream that comprises the generated second video data and audio data captured within the physical environment; and transmitting the video stream to a video conferencing application executing on a user device, wherein the video conferencing application is configured to process the video stream as an input video stream to facilitate the transmission of the video stream to a remote video conferencing device for display.

2. The method of claim 1 , wherein generating the second video data from the second field of view of the physical environment further comprises controlling movement of a controlled camera device to capture the second video data, and wherein the received first video data is captured using a wide angle camera device, which is distinct from the controlled camera device.

3. The method of claim 2 , wherein determining the activity data for portions of the first video data using information provided in the plurality of frames further comprises: performing a facial detection analysis to detect a plurality of user faces within the first video data; and determining a measure of motion for each of the detected plurality of user faces using information provided in the plurality of frames of the first video data.

4. The method of claim 3 , wherein generating the second video data from the second field of view of the physical environment further comprises: selecting one of the plurality of user faces having a corresponding determined measure of motion that is indicative of user speech; and determining an orientation of the camera device for capturing video data substantially centered on the selected user face, and wherein controlling the movement of the camera device to capture the second video data further comprises controlling the movement of the camera device to match the determined orientation.

5. The method of claim 4 , wherein determining an orientation of the camera device for capturing the video data substantially centered on the selected user face further comprises: identifying a physical entity corresponding to the selected user face by accessing a model structure describing an attribute of the physical environment; and determining a direction of the identified physical entity, relative to a physical position of the camera device within the physical environment.

6. The method of claim 4 , wherein generating the second video data from the second field of view of the physical environment further comprises: collecting audio data from the physical environment using two or more microphone sensors; identifying user speech within the collected audio data; and determining a direction from which the identified user speech originates, relative to a physical position of the two or more microphone sensors, and wherein determining the orientation of the camera device for capturing the video data substantially centered on the selected user face is further based on the determined direction from which the identified user speech originates.

7. The method of claim 1 , wherein the received first video data is captured at a first resolution, and wherein generating the second video data from the second field of view of the physical environment further comprises: extracting a portion of the first video data to create the second video data, wherein the second video data has a second resolution that is less than the first resolution of the first video data.

8. The method of claim 1 , wherein the generated video stream further comprises the first video data, and the generated video stream is configured to allow a remote video conferencing device to simultaneously display the first video data and second video data.

9. A system for generating a video stream for use in a video conference, comprising: a first camera sensor configured to capture first video data comprising a plurality of frames from a first field of view of a physical environment; a second camera sensor; a mounting structure capable of adjusting an orientation of the second camera sensor along one or more degrees of freedom; control logic configured to: determine activity data for portions of the first video data across the plurality of frames; and control movement of the mounting structure to adjust the orientation of the second camera along the one or more degrees of freedom, based on the determined activity data; and video processing logic configured to: capture second video data from a second field of view of the physical environment using the second camera sensor; encode the captured second video data; generate a video stream comprising the captured second video data and audio data captured within the physical environment; and transmit the video stream to a video conferencing application executing on a user device, wherein the video conferencing application is configured to process the video stream as an input video stream to facilitate the transmission of the video stream to a remote video conferencing device for display.

10. The system of claim 9 , wherein the first camera sensor comprises a wide angle camera sensor.

11. The system of claim 10 , wherein the control logic configured to determine the activity data for portions of the video data across the plurality of frames is further configured to: perform a facial detection analysis to detect a plurality of user faces within the first video data; and determine a measure of motion for each of the detected plurality of user faces across the plurality of frames of the first video data.

12. The system of claim 11 , wherein the control logic configured to control movement of the mounting structure to adjust the orientation of the second camera along the one or more degrees of freedom, based on the determined plurality of measures of activity is further configured to: select one of the plurality of user faces having a corresponding determined measure of motion that is indicative of user speech; and determine an orientation for capturing video data substantially centered on the selected user face, and wherein the movement of the mounting structure is controlled such that an orientation of the second camera sensor matches the determined orientation.

13. The system of claim 12 , wherein the control logic configured to determine the orientation for capturing the video data substantially centered on the selected user face is further configured to: identify a physical entity, within the physical environment, corresponding to the selected user face by accessing a model structure describing an attribute of the physical environment; and determine a direction of the identified physical entity, relative to a physical position of the second camera sensor within the physical environment.

14. The system of claim 12 , wherein the system further comprises two or more microphone sensors, and wherein the control logic configured to determine the orientation for capturing the video data substantially centered on the selected user face is further configured to: collect audio data from the physical environment using the two or more microphone sensors; identify user speech within the collected audio data; and determine a direction from which the identified user speech originates, relative to a physical position of the two or more microphone sensors within the physical environment, and wherein the logic configured to determine the orientation for capturing video data substantially centered on the selected user face operates further based on the determined direction from which the identified user speech originates.

15. The system of claim 12 , wherein the generated video stream further comprises the first video data, and the generated video stream is configured to allow a remote video conferencing device to simultaneously display the first video data and second video data.

16. A system for generating a video stream for use in a video conference, comprising: a camera sensor configured to capture first video data comprising a plurality of frames from a first field of view of a physical environment at a first resolution; control logic configured to: determine activity data for portions of the first video data across the plurality of frames; define a portion of the captured first video data to extract, based on the determined activity data; and extract the portion of the captured video data to create second video data, the second video data having less than all of a plurality of pixels of the captured video data; and video processing logic configured to: generate a video stream that comprises the second video data and audio data captured within the physical environment; and transmit the video stream to a video conferencing application executing on a user device, wherein the video conferencing application is configured to process the video stream as an input video stream to facilitate the transmission of the video stream to a remote video conferencing device for display.

17. The system of claim 16 , wherein the control logic configured to determine the activity data for portions of the first video data across the plurality of frames is further configured to: perform a facial detection analysis to detect a plurality of user faces within the captured first video data; and determine a measure of motion for each of the detected plurality of user faces across the plurality of frames of the first video data.

18. The system of claim 17 , wherein the control logic to determine the portion of the first video data to extract, based on the determined activity data, is further configured to: select one of the plurality of user faces having a corresponding determined measure of motion that is indicative of user speech; and determine the portion of the captured video to extract, such that the second video data is substantially centered on the selected user face.

19. The system of claim 16 , further comprising two or more microphone sensors, and wherein the control logic configured to determine the portion of the captured first video data to extract, based on the determined activity data, is further configured to: collecting audio data from the physical environment using the two or more microphone sensors; identifying user speech within the collected audio data; and determining a direction from which the identified user speech originates, relative to a physical position of the two or more microphone sensors.

20. The system of claim 19 , wherein the control logic configured to determine the portion of the first video data to extract, based on the determined activity data, is further configured to: identify a physical entity, within the physical environment, located in the determined direction from which the user speech originates, by accessing a mapping structure describing an attribute of the physical environment; and determine a visual representation of the identified physical entity within the plurality of frames of the first video data.

21. The system of claim 16 , wherein the video stream further comprises the first video data, and the transmitted video stream is configured to allow the remote video conferencing device to simultaneously display the first video data and second video data.

22. The method of claim 1 , wherein the received first video data is captured at a first resolution, and wherein generating the second video data from the second field of view of the physical environment further comprises: extracting a portion of the first video data to create the second video data, wherein the second video data has less than all of a plurality of pixels of the first video data.

23. A non-transitory computer-readable medium containing computer program code that, when executed by operation of one or more computer processors, performs an operation for generating a video stream for use in a video conference, the operation comprising: receiving, by a first video conferencing endpoint device, first video data captured from a first field of view of a physical environment, the first video data comprising a plurality of frames; determining activity data from portions of the first video data using information provided in the plurality of frames; generating, by the first video conferencing endpoint device, second video data from a second field of view of the physical environment, based on the determined activity data; generating a video stream that comprises the generated second video data and audio data captured within the physical environment; and transmitting the video stream to a video conferencing application executing on a user device, wherein the video conferencing application is configured to process the video stream as an input video stream to facilitate the transmission of the video stream to a remote video conferencing device for display.

24. The non-transitory computer-readable medium of claim 23 , wherein generating the second video data from the second field of view of the physical environment further comprises controlling movement of a controlled camera device to capture the second video data, and wherein the received first video data is captured using a wide angle camera device, which is distinct from the controlled camera device.

25. The non-transitory computer-readable medium of claim 24 , wherein determining the activity data for portions of the first video data using information provided in the plurality of frames further comprises: performing a facial detection analysis to detect a plurality of user faces within the first video data; and determining a measure of motion for each of the detected plurality of user faces using information provided in the plurality of frames of the first video data.

26. The non-transitory computer-readable medium of claim 25 , wherein generating the second video data from the second field of view of the physical environment further comprises: selecting one of the plurality of user faces having a corresponding determined measure of motion that is indicative of user speech; and determining an orientation of the camera device for capturing video data substantially centered on the selected user face, and wherein controlling the movement of the camera device to capture the second video data further comprises controlling the movement of the camera device to match the determined orientation.

27. The non-transitory computer-readable medium of claim 26 , wherein determining an orientation of the camera device for capturing the video data substantially centered on the selected user face further comprises: identifying a physical entity corresponding to the selected user face by accessing a model structure describing an attribute of the physical environment; and determining a direction of the identified physical entity, relative to a physical position of the camera device within the physical environment.

28. The non-transitory computer-readable medium of claim 26 , wherein generating the second video data from the second field of view of the physical environment further comprises: collecting audio data from the physical environment using two or more microphone sensors; identifying user speech within the collected audio data; and determining a direction from which the identified user speech originates, relative to a physical position of the two or more microphone sensors, and wherein determining the orientation of the camera device for capturing the video data substantially centered on the selected user face is further based on the determined direction from which the identified user speech originates.

29. The non-transitory computer-readable medium of claim 23 , wherein the received first video data is captured at a first resolution, and wherein generating the second video data from the second field of view of the physical environment further comprises: extracting a portion of the first video data to create the second video data, wherein the second video data has a second resolution that is less than the first resolution of the first video data.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

H04N

Patent Metadata

Filing Date

March 16, 2015

Publication Date

February 23, 2016

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search