An apparatus includes a chassis and a processor. The chassis may be configured to be mounted to a vehicle and to hold a first ADAS camera and a second ADAS camera. The chassis generally provides a coarse alignment of the first ADAS camera and the second ADAS camera to obtain stereo images of an area outside of the vehicle. The processor may be configured to (i) generate a frame synchronization signal based on a real-time clock signal, (ii) present the frame synchronization signal and one or more control signals to the first ADAS camera and the second ADAS camera, (iii) receive a first pixel datastream corresponding to the area outside of the vehicle from the first ADAS camera, (vi) receive a second pixel datastream corresponding to the area outside of the vehicle from the second ADAS camera, (v) process the first pixel datastream arranged as first video frames and the second pixel datastream arranged as second video frames, (vi) compute warp parameters for the first ADAS camera and the second ADAS camera to finely align pixel data of the first video frames with pixel data of the second video frames, and (vii) generate ground truth data based on the first video frames from the first ADAS camera and the second video frames from the second ADAS camera.
Legal claims defining the scope of protection, as filed with the USPTO.
a chassis configured to be mounted to a vehicle and to hold a first ADAS camera and a second ADAS camera, wherein said chassis provides a coarse alignment of said first ADAS camera and said second ADAS camera to obtain stereo images of an area outside of said vehicle; and a processor configured to (i) generate a frame synchronization signal based on a real-time clock signal, (ii) present said frame synchronization signal and one or more control signals to said first ADAS camera and said second ADAS camera, (iii) receive a first pixel datastream corresponding to said area outside of said vehicle from said first ADAS camera, (vi) receive a second pixel datastream corresponding to said area outside of said vehicle from said second ADAS camera, (v) process said first pixel datastream arranged as first video frames and said second pixel datastream arranged as second video frames, (vi) compute warp parameters for said first ADAS camera and said second ADAS camera to finely align pixel data of the first video frames with pixel data of the second video frames, and (vii) generate ground truth data based on said first video frames from said first ADAS camera and said second video frames from said second ADAS camera. . An apparatus comprising:
claim 1 . The apparatus according to, wherein said ground truth data comprises disparity data based on said first video frames and said second video frames.
claim 1 . The apparatus according to, wherein each of said first ADAS camera and said second ADAS camera comprise at least one of an RGB image sensor, an RGB-IR image sensor, a monochrome sensor, and an IR image sensor.
claim 1 . The apparatus according to, wherein respective intrinsic parameters of said first ADAS camera and said second ADAS camera match.
claim 1 said first ADAS camera comprises a first image processing pipeline configured to apply a first homography matrix to said first pixel datastream; and said second ADAS camera match comprises a second image processing pipeline configured to apply a second homography matrix to said second pixel datastream, wherein said first homography matrix and said second homography matrix are configured to align pixel data generated by said first ADAS camera and said second ADAS camera to enable disparity calculation. . The apparatus according to, wherein:
claim 1 . The apparatus according to, further comprising an inertial measurement unit configured to generate one or more signals representative of motion of said vehicle.
claim 1 . The apparatus according to, further comprising a GNSS/GPS device to generate said real-time clock signal.
claim 7 . The apparatus according to, wherein said ground truth data comprises location information provided by said GNSS/GPS device.
claim 1 . The apparatus according to, wherein said chassis is configured to be mounted behind an inside surface of a windshield of said vehicle.
claim 1 . The apparatus according to, wherein said chassis is configured to be mounted to an exterior surface of said vehicle.
mounting a chassis to a vehicle, wherein said chassis is configured to hold a first ADAS camera and a second ADAS camera, and said chassis provides a coarse alignment of said first ADAS camera and said second ADAS camera to obtain stereo images of an area outside of said vehicle; generating a frame synchronization signal based on a real-time clock signal; presenting said frame synchronization signal and one or more control signals to said first ADAS camera and said second ADAS camera; receiving a first pixel datastream corresponding to said area outside of said vehicle from said first ADAS camera; receiving a second pixel datastream corresponding to said area outside of said vehicle from said second ADAS camera; processing said first pixel datastream arranged as first video frames and said second pixel datastream arranged as second video frames; calculating warp parameters for said first ADAS camera and said second ADAS camera to finely align pixel data of the first video frames with pixel data of the second video frames; presenting said warp parameters to said first ADAS camera and said second ADAS camera, wherein said first ADAS camera and said second ADAS camera apply said warp parameters to finely align pixel data of the first video frames with pixel data of the second video frames; and generating ground truth data based on said first video frames from said first ADAS camera and said second video frames from said second ADAS camera. . A method for obtaining ground truth data for scene reconstruction comprising:
claim 11 . The method according to, wherein said ground truth data comprises disparity data based on said first video frames and said second video frames.
claim 11 . The method according to, wherein each of said first ADAS camera and said second ADAS camera comprise at least one of an RGB image sensor, an RGB-IR image sensor, a monochrome sensor, and an IR image sensor.
claim 11 . The method according to, wherein respective intrinsic parameters of said first ADAS camera and said second ADAS camera match.
claim 11 applying a first homography matrix to said first pixel datastream using a first image processing pipeline of said first ADAS camera; and applying a second homography matrix to said second pixel datastream using a second image processing pipeline of said second ADAS camera match, wherein said first homography matrix and said second homography matrix are configured to align pixel data generated by said first ADAS camera and said second ADAS camera to enable disparity calculation. . The method according to, further comprising:
claim 11 . The method according to, further comprising obtaining one or more signals representative of motion of said vehicle using an inertial measurement unit.
claim 11 . The method according to, further comprising obtaining said real-time clock signal using a GNSS/GPS device.
claim 17 . The method according to, wherein said ground truth data comprises location information obtained using said GNSS/GPS device.
claim 11 . The method according to, further comprising mounting said chassis behind an inside surface of a windshield of said vehicle.
claim 11 . The method according to, further comprising mounting said chassis to an exterior surface of said vehicle.
Complete technical specification and implementation details from the patent document.
This application relates to China Application No. 202410850483.8, filed on June 27, 2024. The mentioned application is hereby incorporated by reference in its entirety.
The invention relates to automated driver assistance systems generally and, more particularly, to a method and/or apparatus for implementing binocular stereo vision for ground truth data collection in monocular advanced driver assistance systems (ADAS) camera scene reconstruction.
Some advanced driver assistance systems (ADAS) do not have a ground truth system. Without the ground truth system, an ADAS algorithm often estimates a height of a target object, then calculates a distance based on the geometric perspective relationship, which depends on static scene measurement and calibration. This method has a larger distance detection error when estimation of target object height is inaccurate, and when a road surface has bumps or slopes.
10 Some advanced driver assistance systems use LiDAR for the ground truth system. However, LiDAR has limitations. The scanning lines of LiDAR are relatively sparse. For example, a typical LiDAR has 128 scan lines, while the latest cutting edge LiDAR has 512 scan lines and is very expensive. Because the scanning lines of LiDAR are relatively sparse, points mapped by LiDAR on small long-distance targets are not dense enough to match an image resolution provided by ADAS cameras. Detecting long-distance targets with low reflectivity using LiDAR is difficult. Also, the field-of-view (FOV) and installation location of LiDAR are generally different from those of ADAS cameras. After correction, there is some FOV loss and point cloud reduction from LiDAR data. In addition, LiDAR generally has a relatively low frame rate (usuallyFPS) and cannot match each frame of video (video generally uses 30FPS). Therefore, the number of frames of data from LiDAR is fewer than the number of video frames for a given scene, which is not good when the vision algorithm needs consecutive frames of data with ground truth.
The cost of LiDAR-based ground truth systems is high. Customizing LiDAR-based ground truth systems for product needs can take a lot of time. Thus, LiDAR-based ground truth systems are difficult to use widely in ADAS.
It would be desirable to implement binocular stereo vision for ground truth data collection in monocular ADAS camera scene reconstruction.
The invention concerns an apparatus comprising a chassis and a processor. The chassis may be configured to be mounted to a vehicle and to hold a first ADAS camera and a second ADAS camera. The chassis generally provides a coarse alignment of the first ADAS camera and the second ADAS camera to obtain stereo images of an area outside of the vehicle. The processor may be configured to (i) generate a frame synchronization signal based on a real-time clock signal, (ii) present the frame synchronization signal and one or more control signals to the first ADAS camera and the second ADAS camera, (iii) receive a first pixel datastream corresponding to the area outside of the vehicle from the first ADAS camera, (vi) receive a second pixel datastream corresponding to the area outside of the vehicle from the second ADAS camera, (v) process the first pixel datastream arranged as first video frames and the second pixel datastream arranged as second video frames, (vi) compute warp parameters for the first ADAS camera and the second ADAS camera to finely align pixel data of the first video frames with pixel data of the second video frames, and (vii) generate ground truth data based on the first video frames from the first ADAS camera and the second video frames from the second ADAS camera.
3 Embodiments of the present invention include providing binocular stereo vision for ground truth data collection in monocular ADAS camera scene reconstruction that may (i) generate three-dimensional (D) point cloud data that may be used as ground truth for monocular ADAS algorithm training, (ii) reduce costs by eliminating need for LiDAR, (iii) provide denser point cloud than LiDAR, (iv) utilize unaltered ADAS cameras, (v) improve distance accuracy of ADAS algorithms, (vi) provide binocular stereo detection of road curbs to facilitate improvement of monocular ADAS algorithms (e.g., for bumps and dips in a road), (vii) add precise world time, inertial data, and other information that enable more accurate scene reconstruction in post-processing, and/or (viii) be implemented as one or more integrated circuits.
In various embodiments, a ground truth acquisition device may be provided that creates a binocular stereo vision system with two monocular ADAS cameras. In an example, a ground truth acquisition device in accordance with an embodiment of the invention may perform CMOS sensor exposure synchronization through a frame synchronization signal (e.g., FSYNC), system time synchronization (e.g., via Ethernet, etc.), and acquisition of dual-channel encoded video from two ADAS cameras with timestamps (e.g., via Ethernet, etc.). In an example, the ground truth acquisition device generally comprises a processor (or system-on-chip (SoC)) that communicates with the two monocular ADAS cameras by a communication protocol. In an example, the communication protocol is generally agreed upon in advance, and may be implemented, for example, via an Ethernet interface. However, other interfaces may be implemented to meet design criteria of a particular implementation. In an example, the ground truth acquisition device may be configured to control two monocular ADAS cameras, including recording video, providing precise time synchronization, and obtaining real-time recorded video data (e.g., H.264/H.265, etc.).
In an example, the ground truth acquisition device may also be configured to perform a camera calibration based on the video/pictures collected by the two monocular ADAS cameras, so that the video generated by the two monocular ADAS cameras may be matched well in a stereo vision algorithm. In an example, the ground truth acquisition device may be configured to generate stereo vision disparity by stereo matching by itself. In another example, the ground truth acquisition device may be configured to generate disparity by storing data to memory (e.g., SSD, etc.), then exporting the data, and running the stereo matching algorithm on a remote system (e.g., offline). In an example, a point cloud may be calculated based on intrinsic parameters of the monocular ADAS cameras. The ground truth acquisition device may also be configured to store global positioning system (GPS) data and/or inertial measurement unit (IMU) data with timestamps, which may be helpful for post-processing (e.g., in scene reconstruction).
1 FIG. 40 50 52 52 52 50 52 50 50 54 54 54 54 54 54 50 54 50 54 52 54 50 50 54 54 54 54 52 40 54 54 50 54 54 40 50 a b a b a n a n a b a a b b a b a n a a n a n Referring to, a diagram is shown illustrating an example embodiment of a ground truth acquisition device in accordance with the present invention configured to provide ground truth data for a forward-looking view of a vehicle. An external viewfor a vehicleis shown. External side view mirrors-are shown. The side view mirrormay be a side view mirror on the driver side of the vehicle. The side view mirrormay be a side view mirror on the passenger side of the vehicle. The vehiclemay comprise devices-. The devices-may be camera systems. Camera systems-are shown integrated as part of the vehicle. The camera systemis shown on a passenger side of the vehicle. The camera systemis shown below the passenger side view mirror. The camera systemis shown on the front grille of the vehicle. In the perspective of the vehicleshown, two of the camera systems-may be visible. However, one of the camera systems-may be implemented at a level below the driver side view mirror(not visible from the perspective of the external viewshown). Other camera systems-may be located throughout the exterior of the vehicle. The camera systems-may be configured to capture an all-around view of the environmentnear the vehicle.
62 62 62 54 62 54 62 62 54 54 62 62 54 54 62 62 50 a d a a b b c d c d a d a d a d Dashed lines-are shown. In the example shown, the dashed lineis shown extending from the camera systemand the dashed lineis shown extending from the camera system. The dashed linesandmay similarly extend from respective camera systemsand(not visible from the perspective shown). The dashed lines-may provide an illustrative representation of fields of view captured by each of the camera systems-. The fields of view-together may provide an all-around view of the environment near the vehicle.
62 62 62 62 54 54 54 54 40 54 54 54 50 54 54 54 52 54 54 52 54 54 a d a d a n a n a b b a n a b a n a a n The all-around view-is shown. In an example, the all-around view-may enable an all-around view (AVM) system. The AVM system may comprise four cameras (e.g., each camera may comprise a combination of one of the camera systems-and/or a stereo pair of the lenses implemented by the camera systems-). In the perspective shown in the external view, the camera systemand the camera systemmay each be one of the four cameras and the other two cameras may not be visible. In an example, the camera systemmay be a camera located on the front grille of the vehicle, one of the cameras-may be on the rear (e.g., over the license plate), the camera systemmay be located below the side view mirroron the passenger side and one of the cameras-may be located below the side view mirroron the driver side. The arrangement of the cameras-may be varied according to the design criteria of a particular implementation.
54 54 54 54 62 62 62 62 50 62 50 62 50 62 50 62 50 62 62 50 62 62 50 62 62 50 a d a d a d a d a b c d a d a d a d In some embodiments, each of the camera systems-may be configured to capture pixel data arranged as video frames. In some embodiments, each of the camera systems-providing the all-around view-may implement a fisheye lens (e.g., may capture a video frame with a 180-degree angular aperture). The all-around view-is shown providing a field of view coverage all around the vehicle. For example, the portion of the all-around viewmay provide coverage for a passenger side of the vehicle, the portion of the all-around viewmay provide coverage for a front of the vehicle, the portion of the all-around viewmay provide coverage for a driver side of the vehicleand the portion of the all-around viewmay provide coverage for a rear of the vehicle. Each portion of the all-around view-may be one field of view of a camera mounted to the vehicle. Each portion of the all-around view-may be dewarped and stitched together by video processors to provide an enhanced video frame that represents a top-down view near the vehicle. In an example, the all-around view-may be used to provide a representation of a bird’s-eye view of the vehicle.
54 54 54 54 54 54 50 54 54 50 54 54 a d a d a d a d a d The camera systems-may provide a representative example of the mechanism for image acquisition. In one example, the camera systems-may be implemented as monocular cameras. In another example, the camera systems-may be implemented as stereo cameras (e.g., two capture devices implemented in a stereo pair). In some embodiments, the stereo cameras may be horizontally oriented. In some embodiments, the stereo cameras may be vertically oriented. In one example, four stereo cameras (e.g., eight capture devices) may be implemented, with one on each side of the vehicle. The locations of the camera systems-on the vehicleand/or the orientation of the camera systems-may be varied according to the design criteria of a particular implementation.
50 50 50 In various embodiments, the vehiclemay be a light duty vehicle, a medium duty vehicle, a heavy duty vehicle, etc. The vehiclemay be implemented as an internal combustion engine (ICE) vehicle, a diesel vehicle, a hybrid electric vehicle, a battery electric vehicle, etc. The type of the vehicleimplemented may be varied according to the design criteria of a particular implementation.
100 100 70 50 100 100 70 50 100 100 100 50 70 50 100 100 50 In various embodiments, an apparatusmay be implemented as a ground truth acquisition device in accordance with an example embodiment of the invention. In an example, the ground truth acquisition devicemay be mounted behind an inside surface of a windshieldof the vehicle(e.g., right behind a rear view mirror). The ground truth acquisition devicemay be installed to enable a field of view (FOV) of the ground truth acquisition deviceto capture an environment through the windshieldtoward the front end of the vehicle. In another example, a ground truth acquisition device’ may be implemented similarly to the ground truth acquisition device, except that the ground truth acquisition device’ may be configured to be mounted to an exterior surface (e.g., a roof, etc.) of the vehicle(e.g., above a center the windshieldof the vehicle). The ground truth acquisition device’ may be installed to enable a field of view (FOV) of the camera system’ to capture the environment toward the front end of the vehicle.
100 100 50 50 100 100 100 100 50 100 100 50 The ground truth acquisition devicesand’ may be configured to acquire ground truth data about the environment in front of the vehicle(e.g., detect people, objects, and/or animals that may be approaching the vehiclefrom ahead). The implementation of the ground truth acquisition device/’ and/or where the ground truth acquisition device/’ is installed on the vehiclemay be varied according to the design criteria of a particular application. In some applications, multiple instances of the ground truth acquisition devicesand/or’ may be installed on the vehicleto capture ground truth data for a 360-degree surround view application utilizing a plurality of monocular ADAS cameras.
2 FIG. 100 102 104 104 106 108 102 104 104 104 104 106 104 104 a b a b a b a b Referring to, a block diagram is shown illustrating a ground truth acquisition system in accordance with an example embodiment of the invention. In various embodiments, the ground truth acquisition devicemay comprise a chassis structure (or camera mount), two monocular ADAS cameras (or capture devices)and, a processor (or system-on-chip (SoC)), and a memory. In various embodiments, the chassis structuremay comprise a metal frame that may be configured to accommodate the two monocular ADAS camerasandand interconnect multiple data interfaces. In an example, the two monocular ADAS camerasandmay comprise identical monocular ADAS cameras. In various embodiments, the processor/SoCmay perform local data storage and time synchronization with the two monocular ADAS camerasand.
104 104 102 104 104 102 104 104 104 104 102 102 104 104 104 104 102 102 104 104 a b a b a b a b a b a b a b The two monocular ADAS camerasand, when mounted to the chassis structure, may be roughly aligned (e.g., parallel optical axes, etc.). In general, physically aligning the two monocular ADAS camerasandto the pixel level, which is very difficult to achieve, is not necessary. In an example, the chassis structuremay have some markings, slots, screws, and/or clamps to facilitate easy mounting of the two monocular ADAS camerasand. In an example, registration lines (or grooves) for aligning the two monocular ADAS camerasandmay be marked (or etched) on the chassis structure. In an example, the chassis structuremay have rectangular or dovetail slots for aligning the two monocular ADAS camerasand. However, other types of markings and/or slots may be implemented to meet design criteria of a particular implementation. In an example, the two monocular ADAS camerasandmay be placed or slid into the slots and locked in position on the chassis structure(e.g., using screws, clamps, etc.). In various embodiments, the chassis structureis generally configured to ensure that the optical axes of the two monocular ADAS camerasandare roughly aligned (e.g., parallel).
100 104 104 102 100 100 104 104 100 104 104 104 104 a b a b a b a b In various embodiments, a calibration process is generally performed after the ground truth acquisition devicehas been physically mounted on a vehicle. In an example, a pre-calibration process may be performed after the two monocular ADAS camerasandhave been physically mounted to the chassis structure, and before mounting the ground truth acquisition deviceto a vehicle, to ensure the ground truth acquisition deviceis operating. In general, the fine calibration of the camerasandis performed after the ground truth acquisition deviceis mounted on a vehicle, because the mounting process may bring a subtle shift of the mechanical device, thus calibrating after mounting is more accurate. In an example, the calibration process generally comprises running stereo calibration using a test pattern (e.g., checkerboard, etc.) to determine a relationship between the two monocular ADAS camerasandand determining warp parameters to apply to each of the two monocular ADAS camerasand. In general, when the calibration is done, the warp parameters do not need to change until a big mechanical shift or change occurs. Then, calibration may be performed again.
104 104 104 104 106 104 104 a b a b a b In various embodiments, the warp parameters determined during the calibration process are generally communicated to image processing stages (or pipelines) within the two monocular ADAS camerasand. The image processing pipelines of the two monocular ADAS camerasandmay then apply the warp parameters to respective captured images such that left and right images received by the processor/SoCare fully aligned (e.g., both physically and temporally). After the warp parameters have been applied, the two monocular ADAS camerasandmay output rectilinear left and right images, respectively, where the respective optical axes are now parallel and each pixel that appears in the left image has a matching pixel in the right image. In an example, the pixel alignment of the left and right images may within one pixel or better (e.g., sub-pixel).
106 108 100 100 108 104 104 a b In an example, the processor/SoCmay be configured to store ground truth data in the memory. In an example, the ground truth data may comprise dual-channel encoded video images and disparity maps that are time-synchronized (e.g., using timestamps, etc.). In another example, the ground truth data may further comprise inertial data to enable calculation of a roll/pitch/yaw angle between adjacent frames, determination of whether the road is flat or sloped, and calculation of angle information. The inertial data may also include timestamp information to allow matching with the images and disparity maps. In an example, the ground truth acquisition devicemay be configured to independently generate stereo disparity by running a stereo matching algorithm. In another example, the ground truth acquisition devicemay be configured to generate disparity data by storing data to the memoryand exporting the data to run the stereo matching algorithm(s) in a remote system (e.g., offline). In an example, a point cloud may be calculated based on intrinsic parameters of the two monocular ADAS camerasand.
100 110 112 106 110 106 106 112 106 110 106 112 In some embodiments, the ground truth acquisition devicemay further comprise an inertial measurement unit (IMU)and/or a global navigation satellite system/global positioning system (GNSS/GPS) unit. In an example, the processor/SoCmay be configured to collect IMU data from the IMU. In an example, the processor/SoCmay be configured to collect accurate time information via a wireless connection (e.g., using a network time protocol (NTP)). In another example, the processor/SoCmay be configured to collect accurate time information from the GNSS/GPS unit. In another example, the processor/SoCmay be configured to determine accurate position information using the IMU data from the IMUand an electronic map. In another example, the processor/SoCmay be configured to collect accurate position information from the GNSS/GPS unit. In an example, the IMU and GNSS/GPS data with timestamps may be utilized for post-processing in scene reconstruction.
106 104 104 104 104 106 106 112 106 104 104 a b a b a b In various embodiments, the processor/SoCis generally configured to output a frame synchronization signal (e.g., FSYNC) to the two monocular ADAS camerasand. The frame synchronization signal FSYNC generally ensures the two monocular ADAS camerasandsynchronize CMOS sensor exposure timing of every frame. In an example, the processor/SoCmay be configured to generate the frame synchronization signal FSYNC based on a real-time clock signal. In an example, the real-time clock signal may be generated internally by a real-time clock module of the processor/SoC. In another example, the real-time clock signal may be obtained from an external source (e.g., using a network time protocol (NTP), using accurate time information collected from the GNSS/GPS unit, etc.). The processor/SoCis generally further configured to communicate control and video signals with the two monocular ADAS camerasand.
100 114 114 102 104 104 106 108 110 112 110 112 114 114 100 70 50 100 50 114 100 50 a b In various embodiments, the ground truth acquisition devicemay be mounted in a housing. The housingis generally configured to enclose the chassis structure (or camera mount), the two monocular ADAS camerasand, the processor (or system-on-chip (SoC)), and the memory. In embodiments implementing the IMUand the GNSS/GPS unit, the IMUand the GNSS/GPS unitmay also be enclosed within the housing. In an example, the housingmay be configured to mount the ground truth acquisition deviceto the inside surface of the windshieldof the vehicle. In general, the ground truth acquisition deviceis mounted at, or close to, the center of a width of the vehicle. In another example, the housingmay be configured to mount the ground truth acquisition deviceto an exterior surface of the vehicle.
3 FIG. 100 104 104 106 108 110 112 106 110 106 110 112 a b Referring to, a block diagram is shown illustrating an example implementation of a ground truth acquisition device in accordance with an example embodiment of the invention. In an example, the ground truth acquisition devicemay comprise the camera (or capture device), the camera (or capture device), the processor/SoC, the memory, the IMU, and the GNSS/GPS module. In an example, the processor/SoCmay be implemented as a separate device from the IMU. In another example, the processor/SoCand the IMUmay be combined in a single integrated device. In an example, the GNSS/GPS modulemay be implemented using a pre-certified module.
100 152 154 156 158 160 160 152 154 156 158 160 160 100 106 104 104 108 110 160 160 152 154 156 158 100 106 104 104 110 158 160 160 108 152 154 156 100 100 a b a b a b a b a b a b In various embodiments, the ground truth acquisition devicemay further comprise a block (or circuit), a block (or circuit), a block (or circuit), a block (or circuit), a block (or circuit), and/or a block (or circuit). The circuitmay implement a battery. The circuitmay implement a communication device (or module). The circuitmay implement a wireless interface. The circuitmay implement a general purpose processor. The blocksandmay implement optical lenses. In some embodiments, the ground truth acquisition devicemay comprise the processor/SoC, the capture devicesand, the memory, the IMU, the lensesand, the battery, the communication module, the wireless interface, and the processor. In another example, the ground truth acquisition devicemay comprise the processor/SoC, the capture device, the capture device, the IMU, the processor, the lens, and the lensas one device, and the memory, the battery, the communication module, and the wireless interfacemay be components of a separate device. The ground truth acquisition devicemay comprise other components (not shown). The number, type and/or arrangement of the components of the ground truth acquisition devicemay be varied according to the design criteria of a particular implementation.
106 106 106 106 106 In some embodiments, the processor/SoCmay be implemented as a video processor. In an example, the processor/SoCmay be configured to receive multiple-sensor video input with high-speed SLVS/MIPI-CSI/LVCMOS interfaces. In some embodiments, the processor/SoCmay be configured to perform depth sensing in addition to generating video frames. In an example, the depth sensing may be performed in response to depth information captured in the video frames. In some embodiments, the processor/SoCmay be implemented as a dataflow vector processor. In an example, the processor/SoCmay comprise a highly parallel architecture configured to perform image/video processing.
108 108 108 108 110 112 112 100 108 108 The memorymay store data. The memorymay implement various types of memory including, but not limited to, a cache, flash memory, memory card, random access memory (RAM), dynamic RAM (DRAM) memory, etc. The type and/or size of the memorymay be varied according to the design criteria of a particular implementation. The data stored in the memorymay correspond to video information (e.g., frames, files, etc.), disparity and/or depth information, motion information (e.g., readings from the IMU), position information (e.g., data from the GNSS/GPS), time information (e.g., timestamps, data from the GNSS/GPS, etc.), video fusion parameters, image stabilization parameters, user inputs, computer vision models, feature sets, and/or metadata information. In various embodiments, the ground truth data generated by the ground truth acquisition devicemay be stored in the memory. In some embodiments, the memorymay store the ground truth data comprising video image data, disparity data, depth map data, position data, motion data, timestamp data, etc. The video image data, disparity data, depth map data, position data, motion data, timestamp data, etc. may be used for computer vision operations, 3D reconstruction, scene reconstruction, auto-exposure, etc.
106 106 108 106 3 108 108 108 106 108 106 100 108 106 106 106 The processor/SoCmay be configured to execute computer readable code and/or process information. In various embodiments, the computer readable code may be stored within the processor/SoC(e.g., microcode, etc.) and/or in the memory. In an example, the processor/SoCmay be configured to execute one or more artificial neural network models (e.g., facial recognition CNN, object detection CNN, object classification CNN,D reconstruction CNN, liveness detection CNN, etc.) stored in the memory. In an example, the memorymay store one or more directed acyclic graphs (DAGs) and one or more sets of weights and biases defining the one or more artificial neural network models. In yet another example, the memorymay store instructions to perform transformational operations (e.g., Discrete Cosine Transform, Discrete Fourier Transform, Fast Fourier Transform, etc.). The processor/SoCmay be configured to receive input from and/or present output to the memory. The processor/SoCmay be configured to store the ground truth data generated by the ground truth acquisition devicein the memory. The processor/SoCmay be configured to present and/or receive other signals (not shown). The number and/or types of inputs and/or outputs of the processor/SoCmay be varied according to the design criteria of a particular implementation. The processor/SoCmay be configured for low power (e.g., battery) operation.
152 100 100 100 152 152 152 152 100 152 152 152 The batterymay be configured to store and/or supply power for the components of the ground truth acquisition device. In some embodiments, the ground truth acquisition devicemay include a dynamic driver mechanism for a rolling shutter sensor that may be configured to conserve power consumption. Reducing the power consumption may enable the ground truth acquisition deviceto operate using the batteryfor extended periods of time without recharging. The batterymay be rechargeable. The batterymay be built-in (e.g., non-replaceable) or replaceable. The batterymay have an input for connection to an external power source (e.g., for charging). In some embodiments, the ground truth acquisition devicemay be powered by an external power supply (e.g., the batterymay not be implemented or may be implemented as a back-up power supply). The batterymay be implemented using various battery technologies and/or chemistries. The type of the batteryimplemented may be varied according to the design criteria of a particular implementation.
154 154 156 154 156 100 154 156 154 The communications modulemay be configured to implement one or more communications protocols. For example, the communications moduleand the wireless interfacemay be configured to implement one or more of, IEEE 102.11, IEEE 102.15, IEEE 102.15.1, IEEE 102.15.2, IEEE 102.15.3, IEEE 102.15.4, IEEE 102.15.5, IEEE 102.20, Bluetooth®, and/or ZigBee®. In some embodiments, the communication modulemay be a hard-wired data port (e.g., a USB port, a mini-USB port, a USB-C connector, HDMI port, an Ethernet port, a DisplayPort interface, a Lightning port, etc.). In some embodiments, the wireless interfacemay also implement one or more protocols (e.g., GSM, CDMA, GPRS, UMTS, CDMA2000, 3GPP LTE, 4G/HSPA/WiMAX, SMS, etc.) associated with cellular communication networks. In embodiments where the ground truth acquisition deviceis implemented as a wireless camera, the protocol implemented by the communications moduleand wireless interfacemay be a wireless communications protocol. The type of communications protocols implemented by the communications modulemay be varied according to the design criteria of a particular implementation.
154 156 100 154 106 100 The communications moduleand/or the wireless interfacemay be configured to generate a broadcast signal as an output from the ground truth acquisition device. The broadcast signal may send video data, disparity data, ground truth data, and/or control signal(s) to external devices. For example, the broadcast signal may be sent to a cloud storage service (e.g., a storage service capable of scaling on demand). In some embodiments, the communications modulemay not transmit data until the processor/SoChas performed video analytics to determine that an object is in the field of view of the ground truth acquisition device.
154 154 106 106 100 In some embodiments, the communications modulemay be configured to generate a manual control signal. The manual control signal may be generated in response to a signal from a user received by the communications module. The manual control signal may be configured to activate the processor/SoC. The processor/SoCmay be activated in response to the manual control signal regardless of the power state of the ground truth acquisition device.
154 156 106 In some embodiments, the communications moduleand/or the wireless interfacemay be configured to receive a feature set. The feature set received may be used to detect events and/or objects. For example, the feature set may be used to perform the computer vision operations. The feature set information may comprise instructions for the processor/SoCfor determining which types of objects correspond to an object and/or event of interest.
154 156 106 154 156 106 In some embodiments, the communications moduleand/or the wireless interfacemay be configured to receive user input. The user input may enable a user to adjust operating parameters for various features implemented by the processor/SoC. In some embodiments, the communications moduleand/or the wireless interfacemay be configured to interface (e.g., using an application programming interface (API) with an application (e.g., an app). For example, the app may be implemented on a smartphone to enable an end user to adjust various settings and/or parameters for the various features implemented by the processor/SoC(e.g., set video resolution, select frame rate, select output format, set tolerance parameters for 3D reconstruction, etc.).
158 158 106 108 158 108 158 100 152 154 156 158 158 100 106 158 The processormay be implemented using a general purpose processor circuit. The processormay be operational to interact with the processor/SoCand the memoryto perform various processing tasks. The processormay be configured to execute computer readable instructions. In one example, the computer readable instructions may be stored by the memory. In some embodiments, the processormay send data to and/or receive data from other components of the ground truth acquisition device(e.g., the battery, the communication moduleand/or the wireless interface). In some embodiments, the processormay implement an integrated digital signal processor (IDSP). For example, the IDSPmay be configured to implement a warp engine. Which of the functionality of the ground truth acquisition deviceis performed by the processor/SoCand the general purpose processormay be varied according to the design criteria of a particular implementation.
160 160 104 104 104 104 160 160 160 160 160 160 104 160 160 104 104 160 160 104 a b a b a b a b a b a b a a a a b b b b The lensesandmay be attached to the capture devicesand, respectively. The capture devicesandmay be configured to receive an input signal (e.g., LIN) via the lensesand. The signal LIN may be a light input (e.g., an analog image). The lensesandmay be implemented as an optical lenses. The lensesandmay provide a zooming feature and/or a focusing feature. The capture deviceand/or the lensmay be implemented, in one example, as a single lens assembly. In another example, the lensmay be a separate implementation from the capture device. The capture deviceand/or the lensmay be implemented, in one example, as a single lens assembly. In another example, the lensmay be a separate implementation from the capture device.
104 104 104 104 160 160 104 104 160 160 104 104 160 160 160 160 100 104 104 106 104 104 160 160 104 104 a b a b a b a b a b a b a b a b a a b a b a b The capture devicesandmay be configured to convert the input light LIN into computer readable data. The capture devicesandmay capture data received through the lensesandto generate raw pixel data. In some embodiments, the capture devicesandmay capture data received through the lensesandto generate bitstreams. In an example, the bitstreams may comprise pixel data arranged as video frames. For example, the capture devicesandmay receive focused light from the lensesand. The lensesandmay be directed, tilted, panned, zoomed and/or rotated to provide a targeted view from the ground truth acquisition device(e.g., a view for a video image, etc.). The capture devicemay generate a signal (e.g., VIDEO_a). The capture devicemay generate a signal (e.g., VIDEO_b). The signals VIDEO_a and VIDEO_b may comprise pixel data (e.g., a sequence of pixels that may be used to generate video frames). In some embodiments, the signals VIDEO_a and VIDEO_b may comprise video data (e.g., a sequence of video frames). The signals VIDEO_a and VIDEO_b may be presented to one or more of the inputs of the processor/SoC. In some embodiments, the pixel data generated by the capture devicesandmay be uncompressed and/or raw data generated in response to the focused light from the lensesand. In some embodiments, the output of the capture devicesandmay be digital video signals.
104 180 182 184 104 180 182 184 180 180 182 182 184 184 160 160 100 160 160 160 160 160 160 104 104 180 180 160 160 160 160 104 104 a a a a b b b b a b a b a b a b a b a b a b a b a b a b a b a b In an example, the capture devicemay comprise a block (or circuit), a block (or circuit), and a block (or circuit), and the capture devicemay comprise a block (or circuit), a block (or circuit), and a block (or circuit). The circuitsandmay be image sensors. The circuitsandmay be a processor and/or logic. The circuitsandmay be a memory circuit (e.g., a frame buffer). The lensesand(e.g., camera lenses) may be directed to provide a view of an external environment of the ground truth acquisition device. The lensesandmay be aimed to capture environmental data (e.g., the light input LIN). The lensesandmay be a wide-angle lens and/or a fish-eye lens (e.g., lenses capable of capturing a wide field of view). The lensesandmay be configured to capture and/or focus the light for the capture devicesand. Generally, the image sensorsandare located behind the lensesand. Based on the captured light from the lensesand, the capture devicesandmay generate a bitstream and/or video data (e.g., the signals VIDEO_a and VIDEO_b).
104 104 160 160 104 104 160 160 160 160 160 160 100 a b a b a b a b a b a b The capture devicesandmay be configured to capture video image data (e.g., light collected and focused by the lensesand). The capture devicesandmay capture data received through the lensesandto generate a video bitstream (e.g., pixel data for a sequence of video frames). In various embodiments, the lensesandmay be implemented as a fixed focus lenses. A fixed focus lens generally facilitates smaller size and low power. In an example, a fixed focus lens may be used in battery powered and other low power camera applications. In some embodiments, the lensesandmay be directed, tilted, panned, zoomed and/or rotated to capture the environment surrounding the ground truth acquisition device(e.g., capture data from the field of view). In an example, professional camera models may be implemented with an active lens system for enhanced functionality, remote control, etc.
104 104 104 104 180 180 160 160 182 182 104 104 104 104 a b a b a b a b a b a b a b The capture devicesandmay transform the received light into a digital data stream. In some embodiments, the capture devicesandmay perform an analog to digital conversion. For example, the image sensorsandmay perform a photoelectric conversion of the light received by the lensesand. The processor/logic circuitsandmay transform the digital data stream into a video data stream (or bitstream), a video file, and/or a number of video frames. In an example, the capture devicesandmay present the video data as a digital video signal (e.g., the signals VIDEO_a and VIDEO_b). The digital video signals may comprise the video frames (e.g., sequential digital images and/or audio). In some embodiments, the capture devicesandmay comprise a microphone for capturing audio.
104 104 104 104 106 104 104 106 106 a b a b a b The video data captured by the capture devicesandmay be represented as signals/bitstreams/data VIDEO_a and VIDEO_b (e.g., digital video signals). The capture devicesandmay present the signals VIDEO_a and VIDEO_b to the processor/SoC. The signals VIDEO_a and VIDEO_b may represent the video frames/video data. The signals VIDEO_a and VIDEO_b may be video streams captured by the capture devicesand. In some embodiments, the signals VIDEO_a and VIDEO_b may comprise pixel data that may be operated on by the processor/SoC(e.g., in a video processing pipeline, an image signal processor (ISP), etc.). The processor/SoCmay generate video frames in response to the pixel data in the signals VIDEO_a and VIDEO_b.
160 160 a b The signals VIDEO_a and VIDEO_b may comprise pixel data arranged as video frames. In some embodiments, the signals VIDEO_a and VIDEO_b may be images comprising a background (e.g., objects and/or the environment captured) and the speckle pattern generated by a structured light projector. The signals VIDEO_a and VIDEO_b may comprise single-channel source images. The single-channel source images may be generated in response to capturing the pixel data using the monocular lensesand.
180 180 160 160 180 180 180 180 160 160 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 180 a b a b a b a b a b a b a b a b a b a b a b a b a b a b The image sensorsandmay receive the input light LIN from the lensesand. The image sensorsandmay transform the light LIN into digital data (e.g., the bitstreams). For example, the image sensorsandmay perform a photoelectric conversion of the light from the lensesand. In an example, the sensorsandmay complimentary metal oxide semiconductor (CMOS) sensors. In some embodiments, the image sensorsandmay have extra margins that are not used as part of the image output. In some embodiments, the image sensorsandmay not have extra margins. In various embodiments, the image sensorsandmay be implemented as an RGB sensor, an RGB-IR sensor, an RGGB sensor, a monochrome image sensor, a thermal sensor, an event-based sensor, etc. However, other color pattern sensors may be implemented accordingly. For example, the image sensorsandmay be any type of sensor configured to provide sufficient output for computer vision operations to be performed on the output data (e.g., neural network-based detection, etc.). In an example, the image sensorsandmay be configured to generate an RGB-IR video signal. In an infrared light only illuminated field of view, the image sensorsandmay generate a monochrome (B/W) video signal. In a field of view illuminated by both IR light and visible light, the image sensorsandmay be configured to generate color information in addition to the monochrome video signal. In various embodiments, the image sensorsandmay be configured to generate a video signal in response to visible and/or infrared (IR) light.
180 180 102 a b In various embodiments, the camera sensorsandmay comprise a rolling shutter sensor or a global shutter sensor. In various embodiments, a pair of matched (e.g., identical) monocular ADAS cameras is mounted on the chassis structurefor enabling stereo matching. In an example, rolling shutter sensors do not create a problem when used for stereo vision, because even if there are rolling shutter artifacts caused by motion, the effect is similar in vertical direction for left and right cameras, when the shutter timing is synchronized for left and right cameras.
180 180 180 180 180 180 104 104 104 104 a b a b a b a b a b In an example, the rolling shutter sensorsandmay implement RGB-IR sensors. In an example, the rolling shutter sensorsandmay be implemented as an RGB-IR rolling shutter complementary metal oxide semiconductor (CMOS) image sensor. In some embodiments, the sensorsandof the capture devicesandmay be implemented as separate components. In an example, the capture devicesandmay comprise a rolling shutter IR sensor and an RGB sensor.
180 180 180 180 a b a b In one example, the rolling shutter sensorsandmay be configured to assert a signal that indicates a first line exposure time. In one example, the rolling shutter sensorsandmay apply a mask to a monochrome sensor. In an example, the mask may comprise a plurality of units containing one red pixel, one green pixel, one blue pixel, and one IR pixel. The IR pixel may contain red, green, and blue filter materials that effectively absorb all of the light in the visible spectrum, while allowing the longer infrared wavelengths to pass through with minimal loss. With a rolling shutter, as each line (or row) of the sensor starts exposure, all pixels in the line (or row) may start exposure simultaneously.
182 182 106 182 182 180 180 104 104 184 184 104 104 184 184 a b a b a b a b a b a b a b The processor/logic circuitsandmay transform the bitstream into a human viewable content (e.g., video data that may be understandable to an average person regardless of image quality, such as the video frames and/or pixel data that may be converted into video frames by the processor/SoC). For example, the processor/logic circuitsandmay receive pure (e.g., raw) data from the image sensorsandand generate (e.g., encode) video data (e.g., the bitstream) based on the raw data. The capture devicesandmay have the memoriesandto store the raw data and/or the processed bitstream. For example, the capture devicesandmay implement the frame memories and/or buffersandto store (e.g., provide temporary storage and/or cache) one or more of the video frames (e.g., the digital video signal).
182 182 184 184 104 104 182 182 182 182 106 184 184 104 104 182 182 184 184 104 104 182 182 106 182 182 a b a b a b a b a b a b a b a b a b a b a b a b In some embodiments, the processor/logic circuitsandmay perform analysis and/or correction on the video frames stored in the memories/buffersandof the capture devicesand. In an example, the processor/logic circuitsandmay implement an image digital signal processing pipeline. In an example, the processor/logic circuitsandmay apply warp parameters received from the processor/SoCto the video frames stored in the memories/buffersandof the capture devicesand. After the processor/logic circuitsandhave applied the warp parameters to the video frames stored in the memories/buffersandof the capture devicesand, the processor/logic circuitsandmay communicated the warped video frames to the processor/SoC(e.g., via the signals VIDEO_a and VIDEO_b). The processor/logic circuitsandmay provide status information about the captured video frames.
110 100 110 100 110 106 104 104 180 180 110 110 104 104 100 110 104 104 110 110 a b a b a b a b The IMUmay be configured to detect motion and/or movement of the ground truth acquisition device. The IMUis shown receiving a signal (e.g., MTN). The signal MTN may comprise a combination of forces acting on the camera system. The signal MTN may comprise movement, vibrations, shakiness, a panning direction, jerkiness, etc. The signal MTN may represent movement (e.g., pitch/yaw/roll) in three dimensional space (e.g., movement in an X direction, a Y direction and a Z direction). In an example, the IMUmay be synchronized by the processor/SoCwith the capture devicesand. In an example, the sensor data captured by the images sensorsandand the IMU data captured by the IMUmay each have accurate timestamps that allows subsequent matching of the data. In an example, the IMUand the capture devicesandare tightly coupled by their mounting in the ground truth acquisition device. The tightly coupling the IMUand the capture devicesandgenerally allows the IMUto assist in calculating rotation angles around the X/Y/Z axes and to help in scene reconstruction (e.g., using a Simultaneous Localization and Mapping (SLAM) algorithm). In an example, the IMU data may also help in determining road information (e.g.,whether the road is flat or sloped, and angle information). The type and/or amount of motion received (detected) by the IMUmay be varied according to the design criteria of a particular implementation.
110 186 186 186 186 186 186 100 110 186 110 186 106 110 106 110 106 110 110 In an example, the IMUmay comprise a block (or circuit). The circuitmay implement a motion sensor. In an example, the motion sensormay comprise an accelerometer. In one example, the motion sensormay comprise a gyroscope. The gyroscopemay be configured to measure the amount of movement. For example, the gyroscopemay be configured to detect an amount and/or direction of the movement of the ground truth acquisition device(e.g., the signal MTN) and convert the movement into electrical data. The IMUmay be configured to determine the amount of movement and/or the direction of movement measured by the gyroscope. The IMUmay convert the electrical data from the gyroscopeinto a format readable by the processor/SoC. The IMUmay be configured to generate a signal (e.g., M_INFO). The signal M_INFO may comprise the measurement information in the format readable by the processor/SoC. The IMUmay present the signal M_INFO to the processor/SoC. The number, type and/or arrangement of the components of the IMUand/or the number, type and/or functionality of the signals communicated by the IMUmay be varied according to the design criteria of a particular implementation.
112 112 106 112 110 104 104 112 112 a b The GNSS/GPS modulemay be configured to generate accurate time, motion, and/or position information. In an example, the GNSS/GPS modulemay generate a signal (e.g., TIME) that provides accurate time information. The time information provided by the signal TIME may be used by the processor/SoCto generate timestamps for the data communicated by the signals VIDEO_a, VIDEO_b, and M_INFO. In some embodiments, the GNSS/GPS modulemay provide data that enriches the data captured by the IMUand the capture devicesand, making the captured data more useful when used to create a high-definition (HD) map. The accurate geographical information and accurate time provided by the GNSS/GPS modulemay be useful in post processing scene reconstruction. In an example, scene reconstruction may include, but is not limited to, calculating building height and size, road width, vehicle distances and speed on the road, the topology of the road, the traffic signs and traffic lights of the road, etc. In an example, the GNSS/GPS modulemay include, but is not limited to, standard GPS, differential GPS (dGPS), and GNSS with real time kinematics (RTK) corrections.
106 106 110 160 160 104 140 106 104 104 104 104 a b a b a b a b The processor/SoCmay receive the signals VIDEO_a and VIDEO_b, the signal M_INFO, and the signal TIME. The processor/SoCmay generate the frame synchronization signal FSYNC, one or more video output signals (e.g., VIDOUT), one or more control signals (e.g., CTRLa, CTRLb, CTRL, etc.), one or more depth data signals (e.g., DIMAGES), and/or one or more warp table data signals (e.g., WT) based on the signals VIDEO_a and VIDEO_b, the signal M_INFO, the signal TIME, and/or other input. In some embodiments, the signals VIDOUT, DIMAGES, WT, and CTRL may be generated based on analysis of the signals VIDEO_a and VIDEO_b and/or objects detected in the signals VIDEO_a and VIDEO_b. In some embodiments, the signals VIDOUT, DIMAGES, WT, and CTRL may be generated based on analysis of the signals VIDEO_a and VIDEO_b, the movement information captured by the IMU, and/or the intrinsic properties of the lensesand, and/or the capture devicesand. In various embodiments, the processor/SoCcommunicates the frame synchronization signal FSYNC and the warp table data signals to the capture devicesandto enable the capture devicesandto align the respective video images contained in the signals VIDEO_a and VIDEO_b.
106 106 106 108 154 156 106 182 182 104 104 a b a b In various embodiments, the processor/SoCmay be configured to perform one or more of feature extraction, object detection, object tracking, electronic image stabilization, 3D reconstruction, liveness detection and object identification. For example, the processor/SoCmay determine motion information and/or depth information by analyzing and comparing frames from the signals VIDEO_a and VIDEO_b. The comparison may be used to perform digital motion estimation. In some embodiments, the processor/SoCmay be configured to generate the video output signal VIDOUT comprising video data, the warp table data signal WT, and/or the depth data signal DIMAGES comprising disparity maps and depth maps from the signals VIDEO_a and VIDEO_b. The video output signal VIDOUT, the warp table data signal WT, and/or the depth data signal DIMAGES may be presented to the memory, the communications module, and/or the wireless interface. In some embodiments, the video signal VIDOUT, the warp table data signal WT, and/or the depth data signal DIMAGES may be used internally by the processor/SoC(e.g., not presented as output). In one example, the warp table data signal WT may be used by a warp engine implemented by a digital signal processor (e.g., the processor the processor/logic circuitand) in the capture devicesand.
156 106 104 104 a b The signal VIDOUT may be presented to the communication device. In some embodiments, the signal VIDOUT may comprise encoded video frames generated by the processor/SoC. In some embodiments, the encoded video frames may comprise a full video stream (e.g., encoded video frames representing all video captured by the capture devicesand). The encoded video frames may be encoded, cropped, stitched, stabilized and/or enhanced versions of the pixel data received from the signals VIDEO_a and VIDEO_b. In an example, the encoded video frames may be a high resolution, digital, encoded, de-warped, stabilized, cropped, blended, stitched and/or rolling shutter effect corrected version of the signals VIDEO_a and VIDEO_b.
106 106 106 106 106 106 In some embodiments, the signal VIDOUT may be generated based on video analytics (e.g., computer vision operations) performed by the processor/SoCon the video frames generated. The processor/SoCmay be configured to perform the computer vision operations to detect objects and/or events in the video frames and then convert the detected objects and/or events into statistics and/or parameters. In one example, the data determined by the computer vision operations may be converted to the human-readable format by the processor/SoC. The data from the computer vision operations may be used to detect objects and/or events. The computer vision operations may be performed by the processor/SoClocally (e.g., without communicating to an external device to offload computing operations). Similarly other video processing and/or encoding operations (e.g., stabilization, compression, stitching, cropping, rolling shutter effect correction, etc.) may be performed by the processor/SoClocally. For example, the locally performed computer vision operations may enable the computer vision operations to be performed by the processor/SoCand avoid heavy video processing running on back-end servers. Avoiding video processing running on back-end (e.g., remotely located) servers may preserve privacy.
106 In some embodiments, the signal VIDOUT may be data generated by the processor/SoC(e.g., video analysis results, audio/speech analysis results, stabilized video frames, etc.) that may be communicated to a cloud computing service in order to aggregate information and/or provide training data for machine learning (e.g., to improve object detection, to improve audio detection, to improve liveness detection, etc.). In some embodiments, the signal VIDOUT may be provided to a cloud service for mass storage (e.g., to enable a user to retrieve the encoded video using a smartphone and/or a desktop computer). In some embodiments, the signal VIDOUT may comprise the data extracted from the video frames (e.g., the results of the computer vision), and the results may be communicated to another device (e.g., a remote server, a cloud computing system, etc.) to offload analysis of the results to another device (e.g., offload analysis of the results to a cloud computing service instead of performing all the analysis locally). The type of information communicated by the signal VIDOUT may be varied according to the design criteria of a particular implementation.
106 The signal CTRL may be configured to provide a control signal. The signal CTRL may be generated in response to decisions made by the processor/SoC. In one example, the signal CTRL may be generated in response to objects detected and/or characteristics extracted from the video frames. The signal CTRL may be configured to enable, disable, change a mode of operations of another device. In one example, a door controlled by an electronic lock may be locked/unlocked in response the signal CTRL. In another example, a device may be set to a sleep mode (e.g., a low-power mode) and/or activated from the sleep mode in response to the signal CTRL. In yet another example, an alarm and/or a notification may be generated in response to the signal CTRL. The type of device controlled by the signal CTRL, and/or a reaction performed by of the device in response to the signal CTRL may be varied according to the design criteria of a particular implementation.
106 106 106 108 106 106 106 The signal CTRL may be generated based on additional data received by the processor/SoC(e.g., a temperature reading, a motion sensor reading, etc.). The signal CTRL may be generated based on input from a human interface device (HID). The signal CTRL may be generated based on behaviors of objects detected in the video frames by the processor/SoC. The signal CTRL may be generated based on a type of object detected (e.g., a person, an animal, a vehicle, etc.). The signal CTRL may be generated in response to particular types of objects being detected in particular locations. The signal CTRL may be generated in response to user input in order to provide various parameters and/or settings to the processor/SoCand/or the memory. The processor/SoCmay be configured to generate the signal CTRL in response to sensor fusion operations (e.g., aggregating information received from disparate sources). The processor/SoCmay be configured to generate the signal CTRL in response to results of liveness detection performed by the processor/SoC. The conditions for generating the signal CTRL may be varied according to the design criteria of a particular implementation.
106 The signal DIMAGES may comprise one or more of depth maps and/or disparity maps generated by the processor/SoC. The signal DIMAGES may be generated in response to 3D reconstruction performed on the monocular single-channel images. The signal DIMAGES may be generated in response to analysis of the captured video data and/or a structured light pattern.
104 104 100 100 152 152 106 106 152 110 106 a b A multi-step approach to activating and/or disabling the capture devicesandand/or any other power consuming features of the ground truth acquisition devicemay be implemented to reduce a power consumption of the ground truth acquisition deviceand extend an operational lifetime of the battery. In an example, a motion sensor may have a low drain on the battery(e.g., less than 10 W). In an example, the motion sensor may be configured to remain on (e.g., always active) unless disabled in response to feedback from the processor/SoC. The video analytics performed by the processor/SoCmay have a relatively large drain on the battery(e.g., greater than the IMU). In an example, the processor/SoCmay be in a low-power state (or power-down) until some motion is detected.
100 164 106 100 104 104 108 154 100 104 104 108 154 100 106 104 104 108 154 100 152 100 152 100 100 a b a b a b The ground truth acquisition devicemay be configured to operate using various power states. For example, in the power-down state (e.g., a sleep state, a low-power state) the motion sensor of the sensorsand the processor/SoCmay be on and other components of the ground truth acquisition device(e.g., the image capture devicesand, the memory, the communications module, etc.) may be off. In another example, the ground truth acquisition devicemay operate in an intermediate state. In the intermediate state, the image capture devicesandmay be on and the memoryand/or the communications modulemay be off. In yet another example, the ground truth acquisition devicemay operate in a power-on (or high power) state. In the power-on state, the processor/SoC, the capture devicesand, the memory, and/or the communications modulemay be on. The ground truth acquisition devicemay consume some power from the batteryin the power-down state (e.g., a relatively small and/or minimal amount of power). The ground truth acquisition devicemay consume more power from the batteryin the power-on state. The number of power states and/or the components of the ground truth acquisition devicethat are on while the ground truth acquisition deviceoperates in each of the power states may be varied according to the design criteria of a particular implementation.
100 100 100 100 In some embodiments, the ground truth acquisition devicemay be implemented as a system on chip (SoC). In some embodiments, the ground truth acquisition devicemay be implemented as a printed circuit board comprising one or more components. The ground truth acquisition devicemay be configured to perform intelligent video analysis on the video frames of the video. The ground truth acquisition devicemay be configured to crop and/or enhance the video.
104 104 106 100 106 a b In some embodiments, the video frames may be some view (or derivative of some view) captured by the capture devicesand. The pixel data signals may be enhanced by the processor/SoC(e.g., color conversion, noise filtering, auto exposure, auto white balance, auto focus, etc.). In some embodiments, the video frames may provide a series of cropped and/or enhanced video frames that improve upon the view from the perspective of the ground truth acquisition device(e.g., provides night vision, provides High Dynamic Range (HDR) imaging, provides more viewing area, highlights detected objects, provides additional data such as a numerical distance to detected objects, etc.) to enable the processor/SoCto see the location better than a person would be capable of with human vision.
108 106 106 The encoded video frames may be processed locally. In one example, the encoded, video may be stored locally by the memoryto enable the processor/SoCto facilitate the computer vision analysis internally (e.g., without first uploading video frames to a cloud service). The processor/SoCmay be configured to select the video frames to be packetized as a video stream that may be transmitted over a network (e.g., a bandwidth limited network).
106 106 104 104 110 112 106 106 a b In some embodiments, the processor/SoCmay be configured to perform sensor fusion operations. The sensor fusion operations performed by the processor/SoCmay be configured to analyze information from multiple sources (e.g., the capture device, the capture device, the IMU, and the GNSS/GPS module). By analyzing various data from disparate sources, the sensor fusion operations may be capable of making inferences about the data that may not be possible from one of the data sources alone. For example, the sensor fusion operations implemented by the processor/SoCmay analyze video data (e.g., mouth movements of people) as well as the speech patterns from directional audio. The disparate sources may be used to develop a model of a scenario to support decision making. For example, the processor/SoCmay be configured to compare the synchronization of the detected speech patterns with the mouth movements in the video frames to determine which person in a video frame is speaking. The sensor fusion operations may also provide time correlation, spatial correlation and/or reliability among the data being received.
106 106 106 100 106 100 In some embodiments, the processor/SoCmay implement convolutional neural network capabilities. The convolutional neural network capabilities may implement computer vision using deep learning techniques. The convolutional neural network capabilities may be configured to implement pattern and/or image recognition using a training process through multiple layers of feature-detection. The computer vision and/or convolutional neural network capabilities may be performed locally by the processor/SoC. In some embodiments, the processor/SoCmay receive training data and/or feature set information from an external source. For example, an external device (e.g., a cloud service) may have access to various sources of data to use as training data that may be unavailable to the ground truth acquisition device. However, the computer vision operations performed using the feature set may be performed using the computational resources of the processor/SoCwithin the ground truth acquisition device.
106 106 106 106 106 106 fps fps A video pipeline of the processor/SoCmay be configured to locally perform de-warping, cropping, enhancements, rolling shutter corrections, stabilizing, downscaling, packetizing, compression, conversion, blending, synchronizing and/or other video operations. The video pipeline of the processor/SoCmay enable multi-stream support (e.g., generate multiple bitstreams in parallel, each comprising a different bitrate). In an example, the video pipeline of the processor/SoCmay implement an image signal processor (ISP) with a 320 MPixels/s input pixel rate. The architecture of the video pipeline of the processor/SoCmay enable the video operations to be performed on high resolution video and/or high bitrate video data in real-time and/or near real-time. The video pipeline of the processor/SoCmay enable computer vision processing on 4K resolution video data, stereo vision processing, object detection, 3D noise reduction, fisheye lens correction (e.g., real time 360-degree dewarping and lens distortion correction), oversampling and/or high dynamic range processing. In one example, the architecture of the video pipeline may enable 4K ultra high resolution with H.264 encoding at double real time speed (e.g., 60), 4K ultra high resolution with H.265/HEVC at 30and/or 4K AVC encoding (e.g., 4KP30 AVC and HEVC encoding with multi-stream support). The type of video operations and/or the type of video data operated on by the processor/SoCmay be varied according to the design criteria of a particular implementation.
180 180 180 180 106 180 180 106 a b a b a b The camera sensorsandmay implement a high-resolution sensor. Using the high resolution sensorsand, the processor/SoCmay combine over-sampling of the image sensorsandwith digital zooming within a cropped area. The over-sampling and digital zooming may each be one of the video operations performed by the processor/SoC. The over-sampling and digital zooming may be implemented to deliver higher resolution images within the total size constraints of a cropped area.
160 160 106 106 a b In some embodiments, the lensesandmay implement a fisheye lens. One of the video operations implemented by the processor/SoCmay be a dewarping operation. The processor/SoCmay be configured to dewarp the video frames generated. The dewarping may be configured to reduce and/or remove acute distortion caused by the fisheye lens and/or other lens characteristics. For example, the dewarping may reduce and/or eliminate a bulging effect to provide a rectilinear image.
106 106 The processor/SoCmay be configured to crop (e.g., trim to) a region of interest from a full video frame (e.g., generate the region of interest video frames). The processor/SoCmay generate the video frames and select an area. In an example, cropping the region of interest may generate a second image. The cropped image (e.g., the region of interest video frame) may be smaller than the original video frame (e.g., the cropped image may be a portion of the captured video).
106 106 The area of interest may be dynamically adjusted based on the location of an audio source. For example, the detected audio source may be moving, and the location of the detected audio source may move as the video frames are captured. The processor/SoCmay update the selected region of interest coordinates and dynamically update the cropped section. The cropped section may correspond to the area of interest selected. As the area of interest changes, the cropped portion may change. For example, the selected coordinates for the area of interest may change from frame to frame, and the processor/SoCmay be configured to crop the selected region in each frame.
106 180 180 180 180 106 106 106 a b a b The processor/SoCmay be configured to over-sample the image sensorsand. The over-sampling of the image sensorsandmay result in a higher resolution image. The processor/SoCmay be configured to digitally zoom into an area of a video frame. For example, the processor/SoCmay digitally zoom into the cropped area of interest. For example, the processor/SoCmay establish the area of interest based on the directional audio, crop the area of interest, and then digitally zoom into the cropped region of interest video frame.
106 106 104 104 160 160 160 160 a b a b a b The dewarping operations performed by the processor/SoCmay adjust the visual content of the video data. The adjustments performed by the processor/SoCmay cause the visual content to appear natural (e.g., appear as seen by a person viewing the location corresponding to the field of view of the capture devicesand). In an example, the dewarping may alter the video data to generate a rectilinear video frame (e.g., correct artifacts caused by the lens characteristics of the lensesand). The dewarping operations may be implemented to correct the distortion caused by the lensesand. The adjusted visual content may be generated to enable more accurate and/or reliable object detection.
106 106 Various features (e.g., dewarping, digitally zooming, cropping, etc.) may be implemented in the processor/SoCas hardware modules. Implementing hardware modules may increase the video processing speed of the processor/SoC(e.g., faster than a software implementation). The hardware implementation may enable the video to be processed while reducing an amount of delay. The hardware components used may be varied according to the design criteria of a particular implementation.
106 106 106 106 106 106 100 106 106 106 106 106 In some embodiments, the processor/SoCmay implement one or more coprocessors, cores and/or chiplets. For example, the processor/SoCmay implement one coprocessor configured as a general purpose processor and another coprocessor configured as a video processor. In some embodiments, the processor/SoCmay be a dedicated hardware module designed to perform particular tasks. In an example, the processor/SoCmay implement an AI accelerator. In another example, the processor/SoCmay implement a radar processor. In yet another example, the processor/SoCmay implement a dataflow vector processor. In some embodiments, other processors implemented by the ground truth acquisition devicemay be generic processors and/or video processors (e.g., a coprocessor that is physically a different chipset and/or silicon from the processor/SoC). In one example, the processor/SoCmay implement an x86-64 instruction set. In another example, the processor/SoCmay implement an ARM instruction set. In yet another example, the processor/SoCmay implement a RISC-V instruction set. The number of cores, coprocessors, the design optimization and/or the instruction set implemented by the processor/SoCmay be varied according to the design criteria of a particular implementation.
106 190 190 190 190 106 190 190 190 190 3 190 190 106 190 190 190 190 190 190 a n a n a n n a n a n a n a n The processor/SoCis shown comprising a number of blocks (or circuits)-. The blocks-may implement various hardware modules implemented by the processor/SoC. The hardware modules-may be configured to provide various hardware components to implement a video processing pipeline, a radar signal processing pipeline, and/or an AI processing pipeline. The circuitsa-may be configured to receive the pixel data from the signals VIDEO_a and VIDEO_b, generate the video frames from the pixel data, perform various operations on the video frames (e.g., de-warping, rolling shutter correction, cropping, upscaling, image stabilization,D reconstruction, liveness detection, auto-exposure, etc.), prepare the video frames for communication to external hardware (e.g., encoding, packetizing, color correcting, etc.), parse feature sets, implement various operations for computer vision (e.g., object detection, segmentation, classification, etc.), etc. The hardware modules-may be configured to implement various security features (e.g., secure boot, I/O virtualization, etc.). Various implementations of the processor/SoCmay not necessarily utilize all the features of the hardware modules-. The features and/or functionality of the hardware modules-may be varied according to the design criteria of a particular implementation. Details of the hardware modules-may be described in association with U.S. Patent Application No. 16/831,549, filed on April 16, 2020, U.S. Patent Application No. 16/288,922, filed on February 28, 2019, U.S. Patent Application No. 15/593,493 (now U.S. Patent No. 10,437,600), filed on May 12, 2017, U.S. Patent Application No. 15/931,942, filed on May 14, 2020, U.S. Patent Application No. 16/991,344, filed on August 12, 2020, U.S. Patent Application No. 17/479,034, filed on September 20, 2021, appropriate portions of which are hereby incorporated by reference in their entirety.
190 190 106 190 190 106 190 190 190 190 190 190 190 190 3 100 a n a n a n a n a n a n The hardware modules-may be implemented as dedicated hardware modules. Implementing various functionality of the processor/SoCusing the dedicated hardware modules-may enable the processor/SoCto be highly optimized and/or customized to limit power consumption, reduce heat generation and/or increase processing speed compared to software implementations. The hardware modules-may be customizable and/or programmable to implement multiple types of operations. Implementing the dedicated hardware modules-may enable the hardware used to perform each type of calculation to be optimized for speed and/or efficiency. For example, the hardware modules-may implement a number of relatively simple operations that are used frequently in computer vision operations that, together, may enable the computer vision operations to be performed in real-time. The video pipeline may be configured to recognize objects. Objects may be recognized by interpreting numerical and/or symbolic information to determine that the visual data represents a particular type of object and/or feature. For example, the number of pixels and/or the colors of the pixels of the video data may be used to recognize portions of the video data as objects. The hardware modules-may enable computationally intensive operations (e.g., computer vision operations, video encoding, video transcoding,D reconstruction, depth map generation, liveness detection, etc.) to be performed locally by the ground truth acquisition device.
190 190 190 190 190 a n a a a One of the hardware modules-(e.g.,) may implement a scheduler circuit. The scheduler circuitmay be configured to store a directed acyclic graph (DAG). In an example, the scheduler circuitmay be configured to generate and store the directed acyclic graph in response to the feature set information received (e.g., loaded). The directed acyclic graph may define the video operations to perform for extracting the data from the video frames. For example, the directed acyclic graph may define various mathematical weighting (e.g., neural network weights and/or biases) to apply when performing computer vision operations to classify various groups of pixels as particular objects.
190 190 190 190 190 190 190 190 190 a a a n a n a a n The scheduler circuitmay be configured to parse the acyclic graph to generate various operators. The operators may be scheduled by the scheduler circuitin one or more of the other hardware modules-. For example, one or more of the hardware modules-may implement hardware engines configured to perform specific tasks (e.g., hardware engines designed to perform particular mathematical operations that are repeatedly used to perform computer vision operations). The scheduler circuitmay schedule the operators based on when the operators may be ready to be processed by the hardware engines-.
190 190 190 190 190 190 190 190 190 a a n a n a a a n The scheduler circuitmay time multiplex the tasks to the hardware modules-based on the availability of the hardware modules-to perform the work. The scheduler circuitmay parse the directed acyclic graph into one or more data flows. Each data flow may include one or more operators. Once the directed acyclic graph is parsed, the scheduler circuitmay allocate the data flows/operators to the hardware engines-and send the relevant operator configuration information to start the operators.
Each directed acyclic graph binary representation may be an ordered traversal of a directed acyclic graph with descriptors and operators interleaved based on data dependencies. The descriptors generally provide registers that link data buffers to specific operands in dependent operators. In various embodiments, an operator may not appear in the directed acyclic graph representation until all dependent descriptors are declared for the operands.
190 190 190 a n b One of the hardware modules-(e.g.,) may implement an artificial neural network (ANN) module. The artificial neural network module may be implemented as a fully connected neural network or a convolutional neural network (CNN). In an example, fully connected networks are “structure agnostic” in that there are no special assumptions that need to be made about an input. A fully-connected neural network comprises a series of fully-connected layers that connect every neuron in one layer to every neuron in the other layer. In a fully-connected layer, for n inputs and m outputs, there are n*m weights. There is also a bias value for each output node, resulting in a total of (n+1)*m parameters. In an already-trained neural network, the (n+1)*m parameters have already been determined during a training process. An already-trained neural network generally comprises an architecture specification and the set of parameters (weights and biases) determined during the training process. In another example, CNN architectures may make explicit assumptions that the inputs are images to enable encoding particular properties into a model architecture. The CNN architecture may comprise a sequence of layers with each layer transforming one volume of activations to another through a differentiable function.
190 190 190 190 106 b b b b In the example shown, the artificial neural networkmay implement a convolutional neural network (CNN) module. The CNN modulemay be configured to perform the computer vision operations on the video frames. The CNN modulemay be configured to implement recognition of objects through multiple layers of feature detection. The CNN modulemay be configured to calculate descriptors based on the feature detection performed. The descriptors may enable the processor/SoCto determine a likelihood that pixels of the video frames correspond to particular objects (e.g., a particular make/model/year of a vehicle, identifying a person as a particular individual, detecting a type of animal, detecting characteristics of a face, etc.).
190 190 190 190 b b b b The CNN modulemay be configured to implement convolutional neural network capabilities. The CNN modulemay be configured to implement computer vision using deep learning techniques. The CNN modulemay be configured to implement pattern and/or image recognition using a training process through multiple layers of feature-detection. The CNN modulemay be configured to conduct inferences against a machine learning model.
190 190 190 b b b The CNN modulemay be configured to perform feature extraction and/or matching solely in hardware. Feature points typically represent interesting areas in the video frames (e.g., corners, edges, etc.). By tracking the feature points temporally, an estimate of ego-motion of the capturing platform or a motion model of observed objects in the scene may be generated. In order to track the feature points, a matching operation is generally incorporated by hardware in the CNN moduleto find the most probable correspondences between feature points in a reference video frame and a target video frame. In a process to match pairs of reference and target feature points, each feature point may be represented by a descriptor (e.g., image patch, SIFT, BRIEF, ORB, FREAK, etc.). Implementing the CNN moduleusing dedicated hardware circuitry may enable calculating descriptor matching distances in real time.
190 190 190 190 b b b b The CNN modulemay be configured to perform face detection, face recognition and/or liveness judgment. For example, face detection, face recognition and/or liveness judgment may be performed based on a trained neural network implemented by the CNN module. In some embodiments, the CNN modulemay be configured to generate the depth image from the structured light pattern. The CNN modulemay be configured to perform various detection and/or recognition operations and/or perform 3D recognition operations.
190 190 190 190 190 106 100 b b b b b The CNN modulemay be a dedicated hardware module configured to perform feature detection of the video frames. The features detected by the CNN modulemay be used to calculate descriptors. The CNN modulemay determine a likelihood that pixels in the video frames belong to a particular object and/or objects in response to the descriptors. For example, using the descriptors, the CNN modulemay determine a likelihood that pixels correspond to a particular object (e.g., a person, an item of furniture, a pet, a vehicle, etc.) and/or characteristics of the object (e.g., shape of eyes, distance between facial features, a hood of a vehicle, a body part, a license plate of a vehicle, a face of a person, clothing worn by a person, etc.). Implementing the CNN moduleas a dedicated hardware module of the processor/SoCmay enable the apparatusto perform the computer vision operations locally (e.g., on-chip) without relying on processing capabilities of a remote device (e.g., communicating data to a cloud computing service).
190 190 106 190 b b b The computer vision operations performed by the CNN modulemay be configured to perform the feature detection on the video frames in order to generate the descriptors. The CNN modulemay perform the object detection to determine regions of the video frame that have a high likelihood of matching the particular object. In one example, the types of object(s) to match against (e.g., reference objects) may be customized using an open operand stack (enabling programmability of the processor/SoCto implement various artificial neural networks defined by directed acyclic graphs each providing instructions for performing various types of object detection). The CNN modulemay be configured to perform local masking to the region with the high likelihood of matching the particular object(s) to detect the object.
190 160 160 106 b a b In some embodiments, the CNN modulemay determine the position (e.g., 3D coordinates and/or location coordinates) of various features (e.g., the characteristics) of the detected objects. In one example, the location of the arms, legs, chest and/or eyes of a person may be determined using 3D coordinates. One location coordinate on a first axis for a vertical location of the body part in 3D space and another coordinate on a second axis for a horizontal location of the body part in 3D space may be stored. In some embodiments, the distance from the lensesandmay represent one coordinate (e.g., a location coordinate on a third axis) for a depth location of the body part in 3D space. Using the location of various body parts in 3D space, the processor/SoCmay determine body position, and/or body characteristics of detected people.
190 190 106 190 190 b b b b The CNN modulemay be pre-trained (e.g., configured to perform computer vision to detect objects based on the training data received to train the CNN module). For example, the results of training data (e.g., a machine learning model) may be pre-programmed and/or loaded into the processor/SoC. The CNN modulemay conduct inferences against the machine learning model (e.g., to perform object detection). The training may comprise determining weight values for each layer of the neural network model. For example, weight values may be determined for each of the layers for feature extraction (e.g., a convolutional layer) and/or for classification (e.g., a fully connected layer). The weight values learned by the CNN modulemay be varied according to the design criteria of a particular implementation.
190 190 190 106 b b b The CNN modulemay implement the feature extraction and/or object detection by performing convolution operations. The convolution operations may be hardware accelerated for fast (e.g., real-time) calculations that may be performed while consuming low power. In some embodiments, the convolution operations performed by the CNN modulemay be utilized for performing the computer vision operations. In some embodiments, the convolution operations performed by the CNN modulemay be utilized for any functions performed by the processor/SoCthat may involve calculating convolution operations (e.g., 3D reconstruction).
The convolution operation may comprise sliding a feature detection window along the layers while performing calculations (e.g., matrix operations). The feature detection window may apply a filter to pixels and/or extract features associated with each layer. The feature detection window may be applied to a pixel and a number of surrounding pixels. In an example, the layers may be represented as a matrix of values representing pixels and/or features of one of the layers and the filter applied by the feature detection window may be represented as a matrix. The convolution operation may apply a matrix multiplication between the region of the current layer covered by the feature detection window. The convolution operation may slide the feature detection window along regions of the layers to generate a result representing each region. The size of the region, the type of operations applied by the filters and/or the number of layers may be varied according to the design criteria of a particular implementation.
190 b Using the convolution operations, the CNN modulemay compute multiple features for pixels of an input image in each extraction step. For example, each of the layers may receive inputs from a set of features located in a small neighborhood (e.g., region) of the previous layer (e.g., a local receptive field). The convolution operations may extract elementary visual features (e.g., such as oriented edges, end-points, corners, etc.), which are then combined by higher layers. Since the feature extraction window operates on a pixel and nearby pixels (or sub-pixels), the results of the operation may have location invariance. The layers may comprise convolution layers, pooling layers, non-linear layers and/or fully connected layers. In an example, the convolution operations may learn to detect edges from raw pixels (e.g., a first layer), then use the feature from the previous layer (e.g., the detected edges) to detect shapes in a next layer and then use the shapes to detect higher-level features (e.g., facial features, pets, vehicles, components of a vehicle, furniture, etc.) in higher layers and the last layer may be a classifier that uses the higher level features.
190 190 b b The CNN modulemay execute a data flow directed to feature extraction and matching, including two-stage detection, a warping operator, component operators that manipulate lists of components (e.g., components may be regions of a vector that share a common attribute and may be grouped together with a bounding box), a matrix inversion operator, a dot product operator, a convolution operator, conditional operators (e.g., multiplex and demultiplex), a remapping operator, a minimum-maximum-reduction operator, a pooling operator, a non-minimum, non-maximum suppression operator, a scanning-window based non-maximum suppression operator, a gather operator, a scatter operator, a statistics operator, a classifier operator, an integral image operator, comparison operators, indexing operators, a pattern matching operator, a feature extraction operator, a feature detection operator, a two-stage object detection operator, a score generating operator, a block reduction operator, and an upsample operator. The types of operations performed by the CNN moduleto extract features from the training data may be varied according to the design criteria of a particular implementation.
190 190 190 190 190 190 190 190 100 100 a n a n a n a n a n One or more of the hardware modules-may be configured to implement other types of AI models. In one example, the hardware modules-may be configured to implement an image-to-text AI model and/or a video-to-text AI model. In another example, the hardware modules-may be configured to implement a Large Language Model (LLM). Implementing the AI model(s) using the hardware modules-may provide AI acceleration that may enable complex AI tasks to be performed on an edge device such as the edge devices-.
190 190 190 190 190 190 a n a n a n One of the hardware modules-may be configured to perform the virtual aperture imaging. One of the hardware modules-may be configured to perform transformation operations (e.g., FFT, DCT, DFT, etc.). The number, type and/or operations performed by the hardware modules-may be varied according to the design criteria of a particular implementation.
190 190 190 190 190 190 190 190 190 190 190 190 190 190 a n a n a n a n a n a n a n Each of the hardware modules-may implement a processing resource (or hardware resource or hardware engine). The hardware engines-may be operational to perform specific processing tasks. In some configurations, the hardware engines-may operate in parallel and independent of each other. In other configurations, the hardware engines-may operate collectively among each other to perform allocated tasks. One or more of the hardware engines-may be homogeneous processing resources (all circuits-may have the same capabilities) or heterogeneous processing resources (two or more circuits-may have different capabilities).
4 FIG. 200 200 104 104 202 202 102 104 160 104 160 100 a b a a b b Referring to, a diagram illustrating ground truth acquisition device calibration is shown. A scenariois shown. The scenariomay comprise the camera (or capture device), the camera (or capture device), and an object. In an example, the objectmay comprise a checkerboard or other calibration pattern (e.g., corners, circles, etc.). The chassis structure (or camera mount), the camerawith the lens, and the camerawith the lensare shown. Other components of the ground truth acquisition devicehave been omitted for clarity.
100 160 160 202 100 202 100 a b A location DC is shown at the ground truth acquisition device. The location DC may represent a baseline location of the lensand the lens. A location DO is shown. The location DO may represent a distance of the objectfrom the baseline location DC of the ground truth acquisition device. In an example, the objectmay be a distance of DO from the ground truth acquisition device.
202 202 160 160 202 160 160 202 202 104 104 100 202 a b a b a b The objectis shown at the distance DO from the baseline location DC. The objectis shown at some location in-between the lensand the lens. For example, the objectis shown offset from both the lensand the lens. In an example, the objectmay comprise a checkerboard or other calibration pattern (e.g., corners, circles, etc.) that allows the objectto have a slight appearance difference in the images captured by the camerasand. The type, size, shape, distance from the ground truth acquisition deviceand/or distance from the objectmay be varied according to the design criteria of a particular implementation.
206 206 104 206 206 104 210 210 100 202 210 212 212 204 202 180 202 202 214 214 204 202 180 202 202 a a a b b b a b A lineis shown. The linemay represent the optical axis of the camera. A lineis shown. The linemay represent the optical axis of the camera. A lineis shown. The linemay represent a baseline depth of the ground truth acquisition devicefrom the object. The linemay illustrate a depth direction. A lineis shown. The linemay represent an image of a pointon the objectcaptured by the image sensorwith respect to the objectand a depth direction of the object. A lineis shown. The linemay represent an image of a pointon the objectcaptured by the image sensorwith respect to the objectand a depth direction of the object.
102 104 104 102 206 104 206 104 104 104 102 180 104 180 104 104 104 104 104 104 104 a b a a b b a b a a b b a b a b a b The chassis structureis generally configured such that when the camerasandare mounted to the chassis structure, the optical axisof the cameraand the optical axisof the cameraare substantially parallel. Furthermore, when the camerasandare mounted on the chassis structure, the image sensorof the cameraand the image sensorof the cameraare generally coplanar. The camerasandare mounted on a common horizontal (X) axis with a predetermined separation distance. The camerasandare generally mounted having a minimal vertical offset from each other. Any vertical offset between the camerasandis generally compensated by a calibration process in accordance with an embodiment of the invention.
102 104 104 a b In various embodiments, the chassis structureprovides a mechanical mount that ensures the differences in the three rotation directions YAW/PITCH/ROLL are very small before the stereo calibration process is performed. The stereo calibration process generally determines the remaining small differences and computes the warm parameters for compensating for the small differences to substantially align the captured images generated by the camerasand. In an example, the images may be aligned to within one pixel or better.
204 202 104 104 204 104 204 104 202 100 202 100 104 104 a b a b a b In an example, a pointon the objectmay appear at a different point in the images captured by the camerasand. A disparity map may be created based on the difference between the position of the pointin an image captured by the cameraand the position of the pointin an image captured by the camera. In an example, parallax error is generally more pronounce when the objectis closer to the ground truth acquisition devicethan when the objectis farther from the ground truth acquisition device. In general, the disparity values obtained from the captured images are directly proportional to the distance between the camerasand, and inversely proportional to the distance DO from the baseline location DC.
5 FIG. 250 252 180 254 180 256 160 104 160 104 180 180 160 160 202 100 180 180 104 252 180 104 254 180 a b a a b b a b a b a b a a b b L R L R Referring to, a diagram is shown illustrating disparity determination for a ground truth acquisition device in accordance with an example embodiment of the invention. A scenariois shown. In an example, a rectangular planeof the camera sensorand a rectangular planeof the camera sensormay be substantially coplanar and parallel to an XY planecontaining the lensof the cameraand the lensof the camera. A pixel row of the sensorand a pixel row of the sensor(e.g., represented by a dashed line) may be aligned with the X-axis direction. A point A may be used to represent the optical center of the lensand a point B may be used to represent the optical center of the lens. A point E may be used to represent a feature on the objectat different distances to the ground truth acquisition device. The point E may be captured by the camera sensorat a point C and by the camera sensorat a point D. A point where the optical axis of the cameraintersects the planemay be captured by the camera sensorat a point O. A point where the optical axis of the cameraintersects the planemay be captured by the camera sensorat a point O. In an example, the disparity value is generally defined as the difference in length between the line COand the line DO.
252 180 254 180 256 252 180 254 180 256 104 104 a b a b b a L R L R L R A plane ABE is generally defined by the three points A, B, and E. The plane ABE intersects the planeof the image sensoralong the pixel row containing the disparity line CO, intersects the planeof the image sensoralong the pixel row containing the disparity line CO, and intersects the planeat the X-axis. Because the plane ABE intersects the planeof the image sensoralong the pixel row containing the disparity line CO, intersects the planeof the image sensoralong the pixel row containing the disparity line CO, and intersects the planeat the X-axis, the disparity lines COand COare always parallel to the X-axis. The relative pose T=[R t] of the camera coordinate system of the camerarelative to the camera coordinate system of the cameramay be determined using extrinsic parameter calibration.
6 FIG. 300 300 302 104 304 104 202 202 302 304 302 304 104 104 a b a b Referring to, a diagram is shown illustrating an object being imaged by cameras of a ground truth acquisition device in accordance with embodiments of the invention. A scenariois shown. The scenariomay comprise a framecaptured by the camera (or capture device), a framecaptured by the camera (or capture device), and the object. In an example, the objectmay appear at different spots in the framesand. In various embodiments, aligning adjacent frames together may be accomplished using a method or combination of methods. In an example, a spatial transformation (warping) of quadrilateral regions may be used. First, a number of image registration points may be determined. For example, fixed points may be imaged at known locations in each of the framesand. In either case a calibration process involves pointing the camerasandat a known, structured scene and finding corresponding points.
6 FIG. 202 202 202 104 104 202 104 104 202 a b a b For example,illustrates views of two cameras trained on a scene that includes the rectangular object (or target). In an example, the objectmay be implemented as a chessboard, checkerboard, or circle calibration board. The objectis generally placed in an area of overlap between a field-of-view of the cameraand a field-of-view of the camera. In an example, the corners of the rectangular objectmay constitute image registration points (e.g., points in common to views of each of the camerasand). In an embodiment where the objectcomprises a chessboard or circle calibration board, chessboard or circle detectors may be used to obtain the image registration points.
7 FIG. 202 104 104 100 400 400 402 404 104 104 202 406 402 408 404 406 408 406 408 406 408 a b a b Referring to, a diagram is shown illustrating the objectas captured in frames obtained from the camerasandof the ground truth acquisition device. A scenariois shown. In the scenarioframesandof the two camerasandare shown. The objectmay be captured as a quadrilateral areahaving corners F, G, H, and I in the frame, and as a quadrilateral areahaving corners F', G', H', and I' in the frame. Because of the slightly different camera angles, the quadrilateral areasandare not consistent in angular construction and are generally captured in different locations with respect to other objects in the image. In an example, the frames may be matched (aligned) by warping each of the quadrilateral regionsandinto a common coordinate system. Note that the sides of quadrilateral areasandare shown as straight, but may actually be subject to some barrel or pincushion distortion, which may also be approximately corrected via warping operations. In an example, barrel/pincushion distortion may be corrected using radial (rather than piecewise linear) transforms. Piecewise linear transforms may fix an approximation of the curve.
406 406 402 408 404 In another example, only one of the images may be warped to match a coordinate system of the other image. For example, warping of quadrilateral areamay be performed via a perspective transformation. Thus quadrilateralin framemay be transformed to quadrilateralin the coordinate system of frame.
Camera arrays generally have a small but significant baseline separation. This may be a problem when combining images of objects at different distances from the baseline, as a single warping function will only work perfectly for one particular distance. Images of objects not at that distance may be warped into different places and may appear doubled ("ghosted") or truncated when the images are merged.
In various embodiments, a camera array may be calibrated such that objects at a particular distance, or images of smooth backgrounds, may be combined with no visible disparity. A minimum disparity may be found by determining how much to shift one image to match the other. Because images are warped into corresponding squares, all that is necessary is to find a particular shift that matches the corresponding squares.
104 104 a b A ground truth acquisition device in accordance with embodiments of the invention generally comprises two video cameras arranged in a spaced apart array, so as to collectively capture a particular field of view. The ground truth acquisition device may also include a processor circuit configured to receive each stream of digital or analog output from the two cameras simultaneously. The processor circuit may be configured to synchronize the exposure times of the two cameras (e.g., using a frame synchronization signal) and process the collection of signals, so as to remove any distortion created by the image capture process, to accurately overlay the two images of the two adjacent camerasand.
8 FIG. 500 500 104 104 500 104 104 500 502 504 506 508 510 512 514 516 518 500 502 504 a b a b Referring to, a diagram is shown illustrating a calibration processin accordance with embodiments of the invention. In various embodiments, the calibration processreceives input image frames from the camerasand(e.g., via the pixel data streams). The processmay also receive the intrinsic parameters and distortion parameters for each of the camerasand. In an example embodiment, the calibration process (or method)comprises a step (or state), a step (or state), a step (or state), a step (or state), a step (or state), a step (or state), a step (or state), a step (or state), and a step (or state). The calibration processgenerally begins in the stepand moves to the step.
504 500 104 104 104 104 104 104 100 104 104 a b a b a b a b In the step, the processmay obtain image frames from the camerasand. Exposure times of the camerasandare generally synchronized by a frame synchronization signal presented to both of the camerasand. In an example, the images frames may comprise images of a flat and rigid calibration board placed at a distance from the ground truth acquisition device. In an example, the calibration board may be placed in a position to appear in the respective fields-of-view (FOVs) of the (left) cameraand the (right) camera.
506 500 508 500 In the step, the processmay perform feature extraction to detect a plurality of features in each of the left image frame and the right image frame. In an example, a circle or chessboard detector may be used to detect a circle center or a corner on the calibration board. In the step, the processmay identify matching features in each of the left image frame and the right image frame. In general, a detected point in one view should be matched in the other view.
510 500 104 104 104 104 104 104 104 104 104 104 a b a b a b a b a b 3x3 3x1 In the step, the processmay perform extrinsic calibration for each of the camerasand, using intrinsic parameters (e.g., fx, fy, cx, cy, etc.) and distortion parameters (e.g., k1, k2, k3, p1, p2, etc.) for the camerasand. In an example, the intrinsic parameters and the distortion parameters may be obtained from separate lens calibration procedures performed independently on the camerasand. In an example, the extrinsic calibration for the camerasandgenerally calculates warp information (e.g., extrinsic parameters, homography matrices, etc.) that may be used by the camerasandto align the matching features. In an example, the warp information may include, but is not limited to, a rotation matrix (e.g., R) and a translation vector (e.g., T). In general, the warp information may be determined using common stereo calibration techniques.
512 500 104 104 104 104 100 104 104 104 104 100 a b a b a b a b In the step, the processgenerally communicates respective warp information to the left cameraand the right camera. The camerasandthen apply the respective warp information to the raw image data prior to communicating the image data to the ground truth acquisition system. In an example, the respective warp information may be applied to the raw image data using image processing pipelines within the camerasand. By applying the respective warp information to the raw image data using image processing pipelines within the camerasand, the image frames obtained by the ground truth acquisition systemare generally aligned and ready for determining disparity. The corresponding image may be warped with the homography matrix.
514 104 104 516 500 500 514 500 518 a b In the step, an iterative process maybe performed to optimize the warp information to obtain a fine alignment of the camerasand. In the step, the processmay check whether the alignment of the image frames meets a predetermined threshold. If the alignment of the image frames does not meet the predetermined threshold, the processmay return to the stem. When the alignment of the image frames meets the predetermined threshold, the processmay move to the stemand terminate.
9 FIG. 600 100 600 106 600 602 604 606 608 610 612 614 616 618 600 602 604 Referring to, a diagram is shown illustrating a ground truth data acquisition processin accordance with embodiments of the invention. In various embodiments, the ground truth acquisition devicemay provide ground truth data that may be used for scene reconstruction. In an example, the ground truth data may be utilized using a Simultaneous Localization and Mapping (SLAM) algorithm. In an example, the SLAM algorithm may be implemented locally or remotely (e.g., on a remote computer/server). In an example, the ground truth data acquisition processmay be implemented (executed) on the processor/S0C. In an example embodiment, the ground truth data acquisition process (or method)comprises a step (or state), a step (or state), a step (or state), a step (or state), a step (or state), a step (or state), a step (or state), a step (or state), and a step (or state). The ground truth data acquisition processgenerally begins in the stepand moves to the step.
604 600 606 600 608 600 610 600 In the step, the ground truth data acquisition processmay obtain image frames from a first monocular ADAS camera and a second monocular ADAS camera. The first monocular ADAS camera and the second monocular ADAS camera are generally mounted on a chassis structure configured to roughly align the optical axes of the first monocular ADAS camera and the second monocular ADAS camera. The first monocular ADAS camera and the second monocular ADAS camera are generally configured to apply respective warp information to the raw image data collected by the first monocular ADAS camera and the second monocular ADAS camera. In step, the ground truth data acquisition processmay perform feature extraction on the image frames obtained from the first monocular ADAS camera and the second monocular ADAS camera. In the step, the ground truth data acquisition processmay identify matching features in the image frames obtained from the first monocular ADAS camera and the second monocular ADAS camera. In the step, the ground truth data acquisition processmay calculate disparity and/or depth information for the image frames using the features extracted.
612 600 600 600 614 600 600 616 614 600 108 600 618 616 600 600 618 100 In the step, the ground truth data acquisition processmay determine whether a three-dimensional (3D) point cloud is to be generated locally or on a remote device/server. When the ground truth data acquisition processis to generate the 3D point cloud locally, the ground truth data acquisition processmoves to the step. When the ground truth data acquisition processis to generate the 3D point cloud remotely, the ground truth data acquisition processmoves to the step. In the step, the ground truth data acquisition processgenerates the 3D point cloud locally using the disparity and/or depth information calculated for the image frames and stores the 3D point cloud in the memory. The ground truth data acquisition processthen moves to the stepand terminates. In the step, the ground truth data acquisition processmay communicate the image frames, the disparity and/or depth information calculated for the image frames, and other ground truth information and/or metadata to the remote device or server. The ground truth data acquisition processthen moves to the stepand terminates. The remote device or server may generate the 3D point cloud using the disparity and/or depth information calculated for the image frames and/or perform other post-processing using the information received from the ground truth data acquisition device.
In various embodiments, a ground truth acquisition device (or system) may be implemented that provides various cost advantages. In an example, the ground truth data collection process in accordance with an embodiment of the invention may avoid using LiDAR, which significantly reduces costs. In an example, the ground truth data collection process in accordance with an embodiment of the invention may provide a denser point cloud than LiDAR, which may largely replace LiDAR. In an example, the finally generated three-dimensional (3D) point cloud data is generally enough for the monocular ADAS algorithm to be used as ground truth for algorithm training. In various embodiments, a ground truth acquisition device may be implemented that eliminates a need to make major modifications to the monocular ADAS equipment. In general, ADAS cameras may be considered unchanged physically, and only the software is slightly modified.
In various embodiments, a ground truth acquisition device may be implemented that also provides various performance advantages. Compared with ADAS that does not use the ground truth system, the data provided may have a significant effect on improving the distance accuracy of the ADAS algorithm. Binocular stereo vision plus post-processing may detect untrained targets and establish an occupancy grid, which may be of value for monocular ADAS algorithm design and testing the detection and classification of untrained targets on the road. Binocular stereo vision detection of road curbs in the scene also may be very useful. In an example, detection of road curbs in the scene may facilitate the improvement of monocular vision ADAS algorithms, especially for bumps and dips in the road. In addition, the ground truth acquisition device in accordance with an embodiment of the invention may facilitate the data collection of ADAS devices. In an example, the ground truth acquisition device in accordance with an embodiment of the invention may add precise world time, IMU information, and other information, which may enable more accurate scene reconstruction in post processing.
1 9 FIGS.- The functions performed by the diagrams ofmay be implemented using one or more of a conventional general purpose processor, digital computer, microprocessor, microcontroller, RISC (reduced instruction set computer) processor, CISC (complex instruction set computer) processor, SIMD (single instruction multiple data) processor, signal processor, central processing unit (CPU), arithmetic logic unit (ALU), video digital signal processor (VDSP) and/or similar computational machines, programmed according to the teachings of the specification, as will be apparent to those skilled in the relevant art(s). Appropriate software, firmware, coding, routines, instructions, opcodes, microcode, and/or program modules may readily be prepared by skilled programmers based on the teachings of the disclosure, as will also be apparent to those skilled in the relevant art(s). The software is generally executed from a medium or several media by one or more of the processors of the machine implementation.
The invention may also be implemented by the preparation of ASICs (application specific integrated circuits), Platform ASICs, FPGAs (field programmable gate arrays), PLDs (programmable logic devices), CPLDs (complex programmable logic devices), sea-of-gates, RFICs (radio frequency integrated circuits), ASSPs (application specific standard products), one or more monolithic integrated circuits, one or more chips or die arranged as flip-chip modules and/or multi-chip modules or by interconnecting an appropriate network of conventional component circuits, as is described herein, modifications of which will be readily apparent to those skilled in the art(s).
The invention thus may also include a computer product which may be a storage medium or media and/or a transmission medium or media including instructions which may be used to program a machine to perform one or more processes or methods in accordance with the invention. Execution of instructions contained in the computer product by the machine, along with operations of surrounding circuitry, may transform input data into one or more files on the storage medium and/or one or more output signals representative of a physical object or substance, such as an audio and/or visual depiction. Execution of instructions contained in the computer product by the machine, may be executed on data stored on a storage medium and/or user input and/or in combination with a value generated using a random number generator implemented by the computer product. The storage medium may include, but is not limited to, any type of disk including floppy disk, hard drive, magnetic disk, optical disk, CD-ROM, DVD and magneto-optical disks and circuits such as ROMs (read-only memories), RAMs (random access memories), EPROMs (erasable programmable ROMs), EEPROMs (electrically erasable programmable ROMs), UVPROMs (ultra-violet erasable programmable ROMs), Flash memory, magnetic cards, optical cards, and/or any type of media suitable for storing electronic instructions.
The elements of the invention may form part or all of one or more devices, units, components, systems, machines and/or apparatuses. The devices may include, but are not limited to, servers, workstations, storage array controllers, storage systems, personal computers, laptop computers, notebook computers, palm computers, cloud servers, personal digital assistants, portable electronic devices, battery powered devices, set-top boxes, encoders, decoders, transcoders, compressors, decompressors, pre-processors, post-processors, transmitters, receivers, transceivers, cipher circuits, cellular telephones, digital cameras, positioning and/or navigation systems, medical equipment, heads-up displays, wireless devices, audio recording, audio storage and/or audio playback devices, video recording, video storage and/or video playback devices, game platforms, peripherals and/or multi-chip modules. Those skilled in the relevant art(s) would understand that the elements of the invention may be implemented in other types of devices to meet the criteria of a particular application.
The terms “may” and “generally” when used herein in conjunction with “is(are)” and verbs are meant to communicate the intention that the description is exemplary and believed to be broad enough to encompass both the specific examples presented in the disclosure as well as alternative examples that could be derived based on the disclosure. The terms “may” and “generally” as used herein should not be construed to necessarily imply the desirability or possibility of omitting a corresponding element.
The designations of various components, modules and/or circuits as “a”-“n”, when used herein, disclose either a singular component, module and/or circuit or a plurality of such components, modules and/or circuits, with the “n” designation applied to mean any particular integer number. Different components, modules and/or circuits that each have instances (or occurrences) with designations of “a”-“n” may indicate that the different components, modules and/or circuits may have a matching number of instances or a different number of instances. The instance designated “a” may represent a first of a plurality of instances and the instance “n” may refer to a last of a plurality of instances, while not implying a particular number of instances.
While the invention has been particularly shown and described with reference to embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made without departing from the scope of the invention.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
July 4, 2024
January 1, 2026
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.