Now that we have familiarized ourselves with the PIFS algorithm, we can dissect the Dimension patents 8351509 and 8639053, which handle, respectively, the main compression/decoding process and the upscaling process. In this post we will focus mostly on the former.
Although Dimension and TMM are different companies, their respective technologies have the same origin (which also explains why they are embroiled in a legal fight).
Although the compressor in the Iterated VDK was distributed as a closed application, its features (such as lambda parameter, minimum and maximum block size, zoom flag, domain extent search quality, etc.) correlate strongly to those in the patents. Given that the patents also describe an antiquated algorithm, the probability is almost certain that they describe the VDK.
The essence of the encoder is to describe blocks in terms of other blocks, and then store the location of the domain block along with a linear RGB color shift. If upscaling is chosen, the process simply occurs on an internally upscaled frame with all the blocks and their locations simply enlarged. There is no fractal interpolation, just block copying. The decoder never performs any fractal iterations, it just copies blocks and adjusts their colors. If quality is a problem, blocks are subdivided. From this alone, the codec can never — at a comparable quality level — match the bandwidth of H.264, let alone HEVC.
Block matching takes place in YUV color space instead of RGB, with the Y channel favored. This makes block matches easier to find. The compressed stream is written out in RGB, so the decoder can work entirely in RGB which is faster.
Like all modern codecs, the VDK distinguishes between I-frames and P-frames. An I-frame stores the entire frame, acting as a key frame that subsequent P-frames can modify. These modifications are often much smaller than the whole frame, so P-frames tend to be stored more compactly. However, since each P-frame depends on the frame behind it, errors can accumulate if too many P-frames are used. So an I-frame is encoded every so often. Ideally, I-frames are generated whenever a significant scene change occurs in the video, although one must determine this in advance; the VDK only offers fixed counts for a set of P-frames. The input data is a set of Windows BMP images, one image per frame.
Similar to PIFS, the current frame is divided into nonoverlapping blocks. A quadtree, however, is used to try larger blocks first, normally 16 x 16 or 8 x 8. If a suitable domain block cannot be found, the large block is split into four smaller blocks of equal size, and a domain search is tried with each of them. If matches still cannot be found, the split-and-search process repeats again. This can go all the way down until the minimum specified block size is reached, at which point the closest available match is used. If the minimum block size is 1 x 1, then a copy of the raw pixel itself is effectively encoded. This is very inefficient however, so a minimum block size of 2 x 2 is preferred.
The domain block search occurs in a spiral pattern around the range block, since analysis shows that the best match tends to occur close to the range block. A user-definable parameter lets one control how many blocks to test. The lower this number, the less time the search takes, but the quality will be poorer. Another parameter called “lambda” controls how well a match must be before a search can end.
Spatial transforms such as flips and rotations are not used. If a match cannot be found, the block is simply split and its quadrants are then matched. The reasoning is probably that decoding would take too long if the inverse spatial transforms had to be performed.
There are four main codes the encoder writes to the compressed stream, which the patent mistakenly calls fractal codes:
A “do nothing” code, used to leave child blocks of split blocks alone. This also acts as a carry forward of the exact same block from the previous frame.
A 1:1 block copy, where a block is copied from the current frame at the same size.
A 2:1 copy, where the source block is scaled down 50% from the current frame and then copied. For efficiency, the entire frame is likely predownscaled into an auxiliary buffer before any blocks are searched.
A carry forward, where a block is copied from the previous frame. This is called “recursive motion vector” in the patent which is needlessly confusing.
Each of these codes has four “split” versions, where after the operation is performed (in the decoder) the block is split into four equal squares and a set of four codes follow, one for each child block.
The encoder prioritizes the above codes. For each block being encoded, it starts with a “do nothing”, then tries to find a same-size block in the current frame. If a double-sized block in the current frame would be better, it uses that. If a carry-forward from the previous frame is better still, it uses that instead. Finally, if any of the four child blocks have an even better match, the code is given the split attribute.
Like PIFS, the VDK stores the domain block’s location and color shift. The data is Huffman encoded for extra compression. The color shift is actually three values, one for each RGB channel. Domain block locations are relative to the range block, and are expressed as 8-bit indicies into a spiral search pattern lookup table. Color shifts range from to -128 to +127 (in an 8-bit RGB system).
Lastly, the values in the code are Huffman encoded for an extra space savings.
Upscaling (normally at 200%) is done by upwardly resampling the current frame into a higher-resolution buffer, and then simply compressing that buffer. The block sizes and block distances are scaled up to match, so that if the minimum specified block size was 1 x 1, it is now actually 2 x 2 (or if the minimum was 2 x 2, it is now 4 x 4). This defeats the purpose of zooming, but since some 2 x 2 blocks have more detail then they otherwise would, some extra sharpness is gained. The overall result, however, is noisier and blockier than a traditional bicubic upscaling. Zoomed and unzoomed file sizes tend to match closely.
Specifying a zoom flag (or a zoom scale) is unnecessary when encoding with PIFS, because fractal interpolation occurs solely during decoding. The fact that the zoom settings must be specified with the VDK encoder is another clue that the upscaling is not done with true fractal interpolation — the encoder needs the zoom level because it internally performs a static upscaling prior to compression. The VDK encoder also takes longer to encode a zoomed video, while in PIFS, any extra processing time for upscaling occurs only when decoding. But since PIFS-style decoding takes too long, the VDK replaces it with merely magnified block copying.
An earlier 1999 patent, 5,982,441, discusses a precursor to the VDK. It contains this interesting paragraph:
“Finally, the method of the present invention supports a property known as zooming. Specifically, decompressor may use extrapolation or interpolation techniques to generate an image which is larger than the image being stored in frame buffer. No matter how decompressor generates such an image, it may thereafter generate a current frame by simply multiplying the motion vectors transmitted from compressor by the corresponding ratio factor so that decompressor is simply moving larger blocks of data from frame buffer to generate the larger current frame at the receiver site. Thus, the only factor affecting image fidelity is the process for generating the larger image in the first place.”
So to summarize, the encoder/decoder is a fancy block copier, but nothing more. Decoding can be done in software if implemented in assembly language, although “rich” frames or I-frames might take dangerously long. “Fractal zooming” is an abused term: the “2:1 copy” code does exploit self-similarity at different scales, but this does not do anything remotely similar to fractal interpolation.
The rationale was clearly to reduce computational costs. A true PIFS codec would never have worked in the 1990s, and still would not work today. But by cutting so many corners, too much was sacrificed. An entirely different approach was needed, and that was where DCT-based algorithms came in. Even Iterated abandoned SoftVideo for DCT approaches.
Given the enormous computational costs of block searching (in the upscaler as well), it is unlikely that a realtime implementation will be available anytime soon. As of this writing, TMM has, with a 20-node cluster of 12-core servers, reduced an encoding session of a 52-minute 4K video from 169 hours to 87 minutes. For true realtime, however — to support live broadcasting — they would need to parallelize not just GOPs (group of P-frames) but the tasks occuring within the encoding of each individual frame. Meanwhile, one can — with a single PC — encode an HEVC movie at rates faster than realtime, in software or with only a GPU.
To be fair, such costs would be bearable if the quality and compression ratios were an order of magnitude better than HEVC, but ironically, the reverse is true.