US-10257465

Group and conversational framing for speaker tracking in a video conference system

PublishedApril 9, 2019

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

In one embodiment, a method is provided to intelligently frame groups of participants in a meeting. This gives a more pleasing experience with fewer switches, better contextual understanding, and more natural framing, as would be seen in a video production made by a human director. Furthermore, in accordance with another embodiment, conversational framing techniques are provided. During speaker tracking, when two local participants are addressing each other, a method is provided to show a close-up framing showing both participants. By evaluating the direction participants are looking and a speaker history, it is determined if there is a local discussion going on, and an appropriate framing is selected to give far-end participants the most contextually rich experience.

Patent Claims

20 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1. A method comprising: detecting, by a video conference endpoint, a plurality of participants within a field of view of the video conference endpoint; detecting, by the video conference endpoint, a first participant of the plurality of participants of the video conference endpoint as an active speaker; determining, by the video conference endpoint, if the first participant is facing a second participant of the plurality of participants by determining from an output of a video camera of the video conference endpoint if a head of the first participant is rotated a number of degrees from the video camera of the video conference endpoint; determining, by the video conference endpoint, if the second participant is facing the first participant based on the output of the video camera of the video conference endpoint; calculating, by the video conference endpoint, a proximity between the first participant and the second participant; and altering, by the video conference endpoint, a framing of the video output of the video camera of the video conference endpoint to frame the first participant and the second participant when the proximity between the first participant and the second participant is within a predetermined threshold.

2. The method of claim 1 , wherein if the second participant is not facing the first participant, further comprising: analyzing a recorded speaker history of the video conference endpoint for an indication of alternating speakers that are local to the video conference endpoint.

3. The method of claim 2 , further comprising: determining if the second participant is one of the alternating speakers.

4. The method of claim 1 , wherein determining if the first participant is facing the second participant further comprises: determining whether the number of degrees is more than a predetermined number of degrees.

5. The method of claim 4 , wherein the video conference endpoint performs face detection or gaze detection techniques on the output of the video camera of the video conference endpoint to determine the number of degrees of rotation of the head of the first participant.

6. The method of claim 1 , wherein detecting the first participant as an active speaker is based on an output from one or more microphone arrays of the video conference endpoint.

7. The method of claim 1 , wherein detecting the plurality of participants is performed based on the output of the video camera of the video conference endpoint, and using one or more of face detection, gaze detection, upper body detection, or motion detection of the plurality of participants.

8. An apparatus comprising: a network interface unit that enables communication over a network; and a processor coupled to the network interface unit, the processor configured to: detect a plurality of participants within a field of view of a video conference endpoint; detect a first participant of the plurality of participants of the video conference endpoint as an active speaker; determine if the first participant is facing a second participant of the plurality of participants by determining from an output of a video camera of the video conference endpoint if a head of the first participant is rotated a number of degrees from the video camera of the video conference endpoint; determine if the second participant is facing the first participant based on the output of the video camera of the video conference endpoint; calculate a proximity between the first participant and the second participant; and alter a framing of the video output of the video camera of the video conference endpoint to frame the first participant and the second participant when the proximity between the first participant and the second participant is within a predetermined threshold.

9. The apparatus of claim 8 , wherein if the second participant is not facing the first participant, the processor is further configured to: analyze a recorded speaker history of the video conference endpoint for an indication of alternating speakers that are local to the video conference endpoint.

10. The apparatus of claim 9 , wherein the processor is further configured to: determine if the second participant is one of the alternating speakers.

11. The apparatus of claim 8 , wherein, when determining if the first participant is facing the second participant, the processor is further configured to: determine whether the number of degrees is more than a predetermined number of degrees.

12. The apparatus of claim 11 , wherein the processor is further configured to: perform face detection or gaze detection techniques on the output of the video camera of the video conference endpoint to determine the number of degrees of rotation of the head of the first participant.

13. The apparatus of claim 8 , wherein the processor is configured to detect the first participant as an active speaker based on an output from one or more microphone arrays of the video conference endpoint.

14. The apparatus of claim 8 , wherein the processor is configured to detect the plurality of participants based on the output of the video camera of the video conference endpoint, and using one or more of face detection, gaze detection, upper body detection, or motion detection of the plurality of participants.

15. A non-transitory processor readable medium storing instructions that, when executed by a processor, cause the processor to: detect a plurality of participants within a field of view of a video conference endpoint; detect a first participant of the plurality of participants of the video conference endpoint as an active speaker; determine if the first participant is facing a second participant of the plurality of participants by determining from an output of a video camera of the video conference endpoint if a head of the first participant is rotated a number of degrees from the video camera of the video conference endpoint; determine if the second participant is facing the first participant based on the output of the video camera of the video conference endpoint; calculate a proximity between the first participant and the second participant; and alter a framing of the video output of the video camera of the video conference endpoint to frame the first participant and the second participant when the proximity between the first participant and the second participant is within a predetermined threshold.

16. The non-transitory processor readable medium of claim 15 , wherein if the second participant is not facing the first participant, the instructions, when executed by the processor, further cause the processor to: analyze a recorded speaker history of the video conference endpoint for an indication of alternating speakers that are local to the video conference endpoint.

17. The non-transitory processor readable medium of claim 16 , wherein the instructions, when executed by the processor, further cause the processor to: determine if the second participant is one of the alternating speakers.

18. The non-transitory processor readable medium of claim 15 , wherein, when determining if the first participant is facing the second participant, the instructions, when executed by the processor, further cause the processor to: determine whether the number of degrees is more than a predetermined number of degrees.

19. The non-transitory processor readable medium of claim 18 , wherein the instructions, when executed by the processor, further cause the processor to: perform face detection or gaze detection techniques on the output of the video camera of the video conference endpoint to determine the number of degrees of rotation of the head of the first participant.

20. The non-transitory processor readable medium of claim 15 , wherein the instructions, when executed by the processor, further cause the processor to detect the first participant as an active speaker based on an output from one or more microphone arrays of the video conference endpoint.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

H04N G06F G06V G10L

Patent Metadata

Filing Date

March 1, 2018

Publication Date

April 9, 2019

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search