 |
|
| |
|
|
| |
Video compression and data flow for video surveillance
( 01 May 2007 )
By Zhengting He, Texas Instruments
|
The desire for increased security has catapulted the popularity of video surveillance systems which are now widely deployed in places such as airports, banks, public transportation centers and even private homes. The many problems associated with traditional analog-based systems are influencing the push for digital based systems. Furthermore, with the growing popularity of computer network, semiconductor and video compression technologies, next generation video surveillance systems will undoubtedly be digital and based on standard technologies and IP networking.
In a video surveillance system over Internet Protocol (VSIP), hardware handling the network traffic is an integral part of the camera because video signals are digitalized by the camera and compressed before being transmitted to the video server to overcome the bandwidth limitation of the network. A heterogeneous processor architecture, such as DSP/GPP, is desirable in achieving maximum system performance. Interrupting intensive tasks such as video capturing, storing and streaming can be partitioned to GPP while the MIPS intensive video compression is implemented on the DSP. After the data is transferred to the video server, it then stores the compressed video streams as files to a hard disk, overcoming traditional quality degradation associated with analog storage devices. Various standards have been developed for compression of digital video signals that can be classified into two categories:
• Motion-estimation (ME) based approaches: Even N frames are defined as a group of pictures (GOP) in which the first frame is encoded independently. For the other (N-1) frames, only the difference between the previously encoded frame(s) (reference frame(s)) and itself is encoded. Typical standards are MPEG-2, MPEG-4, H.263 and H.264.
• Still image compression: Each video frame is encoded independently as a still image. The most well-known standard is the JPEG. MJPEG standard encodes each frame using the JPEG algorithm.
ME VS. STILL IMAGE COMPRESSION
Figure 1 shows the block diagram of an H.264 encoder. Similar to other ME based video coding standards, it processes each frame macroblock (MB) by macroblock which is 16 × 16 pixels. It has a forward path and reconstruction path. The forward path encodes a frame into bits. The reconstruction path generates a reference frame from the encoded bits. Here (I)DCT stands for (inverse) discrete cosine transform and (I)Q stands for (inverse) quantization. ME and MC stand for motion estimation and motion compensation, respectively.
In the forward path (DCT to Q), each MB can either be encoded in intra mode or inter mode. In inter mode, the reference MB is found in previously encoded frame(s) by the motion estimation (ME) module. In intra mode, M is formed from samples in the current frame.
The purpose of the reconstruction path (IQ to IDCT) is to ensure that the encoder and decoder will use the identical reference frame to create the image. Otherwise, the error between the encoder and decoder will accumulate.
Figure 2 is a JPEG encoder block diagram. It divides the input image into multiple 8×8 pixel blocks and processes them one by one. Each block passes through the DCT module first. Then the quantizer rounds off the DCT coefficients according to the quantization matrix. The encoding quality and compression ratio is adjustable depending on the quantization step. The output from the quantizer is encoded by the entropy encoder to generate the JPEG image.
Since sequential video frames often contain a lot of correlated information, ME based approaches can achieve a higher compression ratio. For example, for NTSC standard resolution at 30f/s, the H.264 encoder can encode video at 2Mbps to achieve average image quality with a compression ratio of 60:1. To achieve similar quality, MJPEG's compression ratio is about 10:1 to 15:1.
MJPEG has several advantages over the ME-based approach. Foremost, JPEG requires significantly less computation and power consumption. Also, most PCs have the software to decode and display JPEG images. MJPEG is also more effective when a single image or a few images record a specific event such as a person walking across a door entrance. If the network bandwidth cannot be guaranteed, MJPEG is preferred since the loss or delay of one frame will not affect other frames. With the ME-based method, the delay/loss of one frame will cause the delay/loss of the entire GOP since the next frame will not be decoded until the previous reference frame is available.
Since many VSIP cameras have multiple video encoders, users can choose to run the most appropriate one based on the specific application requirement. Some cameras even have the ability to execute multiple codecs simultaneously with various combinations. MJPEG is typically considered to be the minimal requirement, and almost all VSIP cameras have a JPEG encoder installed.
MOTION JPEG IMPLEMENTATION
In a typical digital surveillance system, video is captured from a sensor, compressed and then streamed to the video server. It is undesirable to interrupt the video encoder task implemented on modern DSP architecture since each context switch may involve large numbers of register saving and cache throwing. Thus, the heterogeneous architecture is desirable so that video capturing and streaming tasks can be offloaded from the DSP. The block diagram below illustrates an example of DSP/GPP processor architecture used in video surveillance applications.
When implementing a Motion JPEG on a DSP/GPP SoC-based system, developers should first partition the function modules appropriately to achieve better system performance. The EMAC driver, TCP/IP network stack and HTTP server, that work together to stream compressed images to the outside, the video capture driver and ATA driver should all be implemented on the ARM to help offload DSP processing. The JPEG encoder should be implemented on the DSP core since its VLIW architecture is particularly good at processing this type of computation intensive task. Once the video frames are captured from the camera via the video input port on the processor, the raw image is compressed by exercising the JPEG encoder, and then the compressed JPEG image files are saved to the hard disk on the board.
Typically, PCs are used to monitor a video scene in real time by retrieving the streams in the video server and decoding and displaying them on the monitor. Encoded JPEG image files can be retrieved on the board via the Internet. Multiple streams can be monitored in a single PC. The streams can also be watched simultaneously from multiple points in the network. As a huge benefit over traditional analog systems, the VSIP central office can contact the video server through the TCP/IP network, and it can be physically located anywhere in the network. The single point of failure becomes the digital camera, not the central office. The quality of the JPEG images can also be dynamically configured to meet varying video quality specifications.
OPTIMIZING THE JPEG ENCODER
Out of the three main function modules in a JPEG encoder, DCT and quantizer are computationally intensive. It is also noticeable that the performance difference between highly optimized assembly code and un-optimized C code for these two modules can be dramatic. Thus, optimizing these two modules is necessary.
Optimizing the 2-dimentional (2D) 8×8 DCT function reduces the number of additions/subtractions and multiplication by removing the redundant computations in the original equation. Many fast DCT algorithms have been published among which Chen’s algorithm is widely accepted by the industry. For 2D 8×8 DCT, Chen’s algorithm requires 448 additions/subtractions and 224 multiplications.
These additions/subtractions and multiplications can further be partitioned to multiple function units in the DSP core to achieve parallel instruction execution, achieving better performance. Highly optimized DSP assembly code can finish a 2D DCT within 100 cycles, excluding the overhead. Some other fast DCT algorithms require even fewer computations. However, they often require more buffer to save intermediate computation results. For modern DSP with pipelined VLIW architecture, loading/storing data from/to the memory takes more cycles than a multiplication. Thus, it is important for developers to consider the idea of balancing computations and memory accessing when optimizing the algorithm.
Quantizing each pixel requires a multiplication and an addition. The computation typically requires only 16bit precision, while the size of DSP registers is 32bits. The first objective to optimize the quantizer module is to pack two pixels into one register and perform additions and multiplications on a pair of pixels. The second is also to use multiple DSP function units in parallel. Since the DSP core in TMS320DM6446 has two multipliers and two adders, up to four pixels can be quantized simultaneously. The last but not least goal is to take advantage of the pipelined DSP architecture. When the DSP core is quantizing the current four pixels, the next four can be loaded from memory so that data can be fed to the multipliers and adders in every cycle. The first two objectives usually have to be realized by developers themselves writing optimized C code or assembly code. Pipelining the code can rely on the DSP compiler.
Other than optimizing each function module, a PING-PONG buffering scheme needs to be deployed to optimize the JPEG encoder at system level. The DSP core accesses data residing in internal RAM (IRAM) at much faster speed compared to accessing data in external DDR2 memory. However, the precious IRAM has very limited size, and it is not large enough to fit the whole input frame. Thus, a portion of the blocks are processed at a time in IRAM. When the PING(PONG) set of blocks are being processed, the PONG(PING) set of blocks are transferred by DMA from DDR2 to IRAM so that the DSP core can start processing the next set immediately after completing the current set.
It is clear that the move to digital video surveillance systems is well on its way. Understanding video compression, system partitioning and codec optimization are key to developing next generation video surveillance systems to meet the escalating demand.
|
|
|