Patentable/Patents/US-6275258
US-6275258

Voice responsive image tracking system

PublishedAugust 14, 2001
Assigneenot available in USPTO data we have
Inventorsnot available in USPTO data we have
Technical Abstract

A camera tracking system that continuously tracks sound emitting objects is provided. A sound activation feature of the system enables a video camera to track speakers in a manner similar to the natural transition that occurs when people turn their eyes toward different sounds. The invented system is well suited for video-phone applications. The invented tracking system comprises a video camera for transmitting an image from its remote location, a screen for receiving images, and microphones for directing the camera. The camera may be coupled to the microphones via an interface for processing information transmitted from the microphones for directing the camera. The system may utilize the translucent properties of LCD screens by disposing a video camera behind such a screen and enabling persons at each remote location to look directly into the screen and at the camera. The interface enables intelligent framing of a speaker without mechanically repositioning the camera. The microphones are positioned using triangulation techniques. Characteristics of audio signals are processed by the interface for determining movement of the speaker for directing the camera. As the characteristics sensed by the microphones change, the interface directs the camera toward the speaker. The interface continuously directs the camera, until the change in the characteristics stabilizes, thus precisely directing the camera toward the speaker.

Patent Claims
41 claims

Legal claims defining the scope of protection, as filed with the USPTO.

1

1. A camera tracking system that tracks sound emitting objects, the system comprising: image reception means for generating video signals representative of an image; audio sensing means for generating audio signals representative of speech and other sounds; and interface means coupled to the audio sensing means and to the image reception means, the interface means sensing characteristics of the audio signals generated by the audio sensing means for automatically, digitally cropping and scaling the image to allow display of a framed video image including a sound emitting object.

2

2. The system of claim 1 wherein when the interface means senses a change in the characteristics of the audio signals indicating a new position of the sound emitting object, the interface means adjusts the framed video image by recropping and rescaling the image until the framed video image includes the sound emitting object.

3

3. The system of claim 1 further comprising image display means for displaying an image.

4

4. The system of claim 1 wherein the system further includes signal transmission means for processing the video signals generated by the image reception means and audio signals generated by the audio sensing means, the transmission means transmitting the processed video and audio signals to at least one remote location.

5

5. The camera tracking system of claim 1 wherein said image reception means is stationary.

6

6. A camera tracking system that tracks sound emitting objects, the system comprising: a plurality of audio sensing means for generating audio signals representative of speech and other sounds, each of the audio sensing means generating an audio indicating signal indicative of sound sensed thereby and emitted by a sound emitting object; image reception means for generating video signals representative of an image; and interface means coupled to the plurality of audio sensing means and to the image reception means, the interface means continuously sensing the indicating signals generated by the audio sensing means for determining any change in the indicating signals for continuously determining a mobile location of a sound emitting object and directing the image reception means toward the sound emitting object, the interface means creating a framed video image including the sound emitting object.

7

7. The system of claim 6 wherein the indicating signals generated by the audio sensing means indicate a time differential between sound waves sensed by the sensing means, when the interface means determines that the indicating signal generated by at least one of the audio sensing means has changed, such as the time differential between sound waves sensed by the sensing means increasing or decreasing, indicating that the sound emitting object has moved, the interface means adjusts the framed video image until the image contains the sound emitting object.

8

8. The system of claim 6 wherein the indicating signals generated by the audio sensing means indicate sound intensity sensed by the sensing means, when the interface means determines that the indicating signal generated by at least one of the audio sensing means has changed indicating that the amplitude of the sound sensed thereby has changed, the interface means adjusts the framed video image until the image contains the sound emitting object.

9

9. The system of claim 6 wherein the audio sensing means comprises a pair of laterally separated microphones each at a known position with respect to the image reception means, each of the microphones sensing sound emitted by the sound emitting object for transmitting its indicating signal to the interface means for determining any change in the location of the sound emitting object to continuously adjust the framed video image to maintain the sound emitting object within the image.

10

10. The system of claim 9 wherein the audio sensing means further includes a third microphone vertically positioned at a known position with respect to the pair of laterally separated microphones, each of the microphones sensing sound emitted by the sound emitting object for transmitting its indicating signal to the interface means for determining any change in the location of the sound emitting object to continuously adjust the framed video image to maintain the sound emitting object within the image.

11

11. The system of claim 10 wherein the microphones are positioned to facilitate three dimensional triangulation techniques.

12

12. The system of claim 6 wherein the image captured by the image reception means is a wide field of view that includes the sound emitting object, and the interface means further comprising scaling and cropping means for selecting a portion of the image containing the sound emitting object and for framing the image in the selected portion creating the framed video image of the sound emitting object.

13

13. The system of claim 12 wherein the image reception means comprises a desired one of a video camera and a CCD camera.

14

14. The system of claim 13 further comprising: at least one additional system located at at least one additional remote location, and signal transmission means for transmitting the audio indicative and framed video image signals to the remote locations and receiving audio and framed video signals from the remote locations for enabling communication between the locations.

15

15. The system of claim 6 wherein the interface means further includes means for sensing and differentiating tone and for tracking a sound emitting object having a selected tone.

16

16. The system of claim 6 wherein the interface means further includes filter means for sensing and tracking a particular sound emitting object in the presence of ambient noise and for distinguishing a speaker's voice in the presence of the voices of others.

17

17. The system of claim 6 wherein the interface means comprises a computing means.

18

18. The camera tracking system of claim 6 further comprising an image display means for displaying an image.

19

19. The camera tracking system of claim 6 wherein said image reception means is stationary.

20

20. A camera tracking system that tracks sound emitting objects, the system comprising: image display means for displaying an image; a plurality of audio sensing means for generating audio signals representative of speech and other sounds, each of the audio sensing means generating an audio indicating signal indicative of sound sensed thereby and emitted by a sound emitting object; image reception means for generating video signals representative of an image, the image reception means being a desired one of a video camera and a CCD camera and capturing a wide field of view that includes an image of the sound emitting object; and interface means coupled to the plurality of audio sensing means and to the image reception means, the interface means continuously sensing the indicating signals generated by the audio sensing means for determining any change in the indicating signals for continuously directing the image reception means toward a sound emitting object, wherein when the interface means determines that the indicating signal generated by at least one of the audio sensing means has changed, the interface means redirects the image reception means until the change in the indicating signals stabilizes indicating that the image reception means is directed toward the sound emitting object, the interface means having scaling and cropping means for selecting a portion of the field of view containing the image of the sound emitting means and for framing the image in the selected portion to transmit video signals representative of the image, wherein the camera is disposed behind the image display means and retained in a housing thereof, the image display means comprising a screen configured to enable the camera to capture the image of a user while enabling the user to look directly at the screen and gaze directly into the camera, wherein at least one user at each remote location may look directly into the screen, with the camera capturing their image so that users at each remote location are looking directly at the other users at the other remote locations.

21

21. The system of claim 20 wherein the image display means comprises a substantially translucent material.

22

22. The system of claim 21 wherein the image display means comprises a Liquid Crystal Display screen.

23

23. A camera tracking system that tracks sound emitting objects, the system comprising: image display means for displaying an image; a plurality of audio sensing means for generating audio signals representative of speech and other sounds, each of the audio sensing means generating an audio indicating signal indicative of sound sensed thereby and emitted by a sound emitting object; image reception means for generating video signals representative of an image; interface means coupled to the plurality of audio sensing means and to the image reception means, the interface means continuously sensing the indicating signals generated by the audio sensing means for determining any change in the indicating signals for continuously directing the image reception means toward a sound emitting object, wherein when the interface means determines that the indicating signal generated by at least one of the audio sensing means has changed, the interface means redirects the image reception means until the change in the indicating signals stabilizes indicating that the image reception means is directed toward the sound emitting object; and a digital microphone worn on a deaf user's person for indicating to the user any change in the location of the sound emitting object, the microphone including means for delivering a tapping sensation to the user for indicating to the user the direction of the sound emitting object, the amplitude of the tapping indicating the amplitude of the sound sensed by the digital microphone.

24

24. A method for digitally framing an image by tracking sound emitting objects, the method comprising the steps of: placing at least two audio sensing means at known positions relative to an image reception means; sensing sound waves emitted by a sound emitting object, the sound waves representative of speech and other sounds; generating audion signals representative of the sound waves sensed; processing the audio signals using triangulation techniques to determine the position of the sound emitting object; capturing a wide field-of-view image including the sound emitting object; genereating a framed video image by automatically digitally scaling and cropping the wide field-of-view image; continuing to process the audio signal to continuously determine the location of the sound emitting object.

25

25. The method of claim 24 wherein the triangulation technique uses a time differential between the sound waves to determine the location of the sound emitting object.

26

26. The method of claim 24 wherein the image reception means is placed behind at least a portion of an image display to simulate natural eye contact between users at remote locations.

27

27. The method of claim 24 wherein, when the location of the sound emitting object is abruptly changed to a new location, the system simulates a natural eye transition by iteratively recropping and rescaling the image toward the new location until the new location is within the framed video image, thereby simulating the eyes' naturally panning motion when transitioning to a new sound emitting object, such as a new speaker.

28

28. The method of claim 24 further comprising the steps of: transmitting the audio signals and the framed video image to a remote location; and receiving the audio signals and the framed video image at the remote location.

29

29. The method of claim 24 further comprising the step of maintaining the sound emitting object within the framed image.

30

30. The method of claim 29 wherein maintaining the sound emitting object within the framed image is accomplished by rescaling and recropping the wide field-of-view image when the sound emitting object moves.

31

31. A sound tracking system for automatically reframing a video image, comprising: a camera having a fixed position and generating a video image, at least two microphones having known relative positions with respect to the camera, an interface means for processing input from the at least two microphones to utilize triangulation to determine the position of an audio source and for creating a framed video image of the audio source from the video image generated by the camera, and image processing means for cropping and scaling the video image generated by the camera to create the framed video image.

32

32. The sound tracking system of claim 31 further comprising a transmission means for transmitting audio signals from the at least two microphones and the framed video image to a remote location.

33

33. The sound tracking system of claim 31 wherein the camera has a wide field of view and an image sensor.

34

34. The sound tracking system of claim 33 wherein said camera is a CCD.

35

35. The sound tracking system of claim 31 further comprising a display device connected to the interface means and wherein the camera is located behind at least a portion of the display device.

36

36. The sound tracking system of claim 35 wherein the display device is chosen from the group of display devices including a computer monitor and a translucent LCD panel.

37

37. The sound tracking system of claim 31 wherein the at least two microphones are three microphones used to accurately triangulate the audio source within three dimensions.

38

38. The sound tracking system of claim 31 wherein the interface means determines the position of the audio source by measuring an offset time between the input from the microphones.

39

39. The sound tracking system of claim 31 wherein the interface means is capable of differentiating one audio source from other audio sources by tonality variations between audio sources.

40

40. The sound tracking system of claim 39 wherein the interface means is capable of tracking a plurality of audio sources simultaneously.

41

41. The sound tracking system of claim 31 wherein the interface means simulates panning by continuously digitally recropping the image when the audio source location changes.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

Patent Metadata

Filing Date

December 17, 1996

Publication Date

August 14, 2001

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “Voice responsive image tracking system” (US-6275258). https://patentable.app/patents/US-6275258

© 2026 Patentable. All rights reserved.

Patentable is a research and drafting-assistant tool, not a law firm, and does not provide legal advice. Documents we generate are drafts for review by a licensed patent attorney.