Well, it looks like I have run out of things to talk about. But I will keep an eye on the marketplace and check back in maybe a few months.
Is upscaling worth it? If one has no other options, then certainly. But otherwise, it is worth comparing to actual high-resolution content.
Below is a color upscaling using our improved upscaler, and the same part of the test image from the original high-resolution photo. Aside from sharp edges, the original content is clearly more detailed. This is not surprising since even the best upscaler only has one-fourth the number of pixels to use.
One of the fundamental problems with the VDK is that it performs two image processing passes, one to upscale and the other to compress. Both passes use block copying, which introduces errors, and the errors from the upscaler are compounded by the compressor.
This might be why Dimension went for upscaling only, they figured that dropping the compression feature would be an easy way to eliminate a large error source. In TMM’s case, they need to do both passes together with a single algorithm (or use an upscaler of better quality), greatly increase the bandwidth utilization, and have it still work fast enough for realtime decoding.
It is a hard job. Block copying is decode-friendly, but quality is poor unless blocks are split. But block splitting hurts the compression ratio. It is damned if you do and damned if you do not. There is just not enough compression at an acceptable quality level. Whatever the work with Raytheon accomplishes, I suspect it will include a sizeable departure from the block copying methodology, so we could be waiting a while.
Although I mentioned at the start of this blog that all image tests would be done in grayscale, I thought it would be nice to show the simple upscaler working in the full RGB color space. Each red/green/blue channel was processed as a grayscale image and then reintegrated for final output.
Here is the upscaled color Lenna, from 512 x 512 to 1024 x 1024:
Here is a section showing a comparison with bicubic upsampling:
I modified my simple block-copying upscaler to enlarge pixels instead of blocks. Like the Dimension upscaler, it does matching by using a 3 x 3 block centered around the target pixel to act as a context, and then looks for 6 x 6 blocks. The center 2 x 2 block is then used from the best matching 6 x 6 block. The earlier problem of edges having “knicks” is gone, but the image tends to be more posterized (or the “oil painting” effect is stronger). Some small details such as lower eyelashes are washed out. Still, it is noticeably sharper than a bicubic upsampling.
I have included upscalings (from 512 x 512 to 1024 x 1024) of both my test image and the classic Lenna.
TMM’s involvement of Raytheon implies that TMM could not figure out how to make their own technology work. This is understandable. Many smart people have tried and failed, and mathematics has not evolved to reveal any new directions.
An interesting trend these days is a resurgence in AI (artificial intelligence). Companies with vast databases like Google and IBM use large scale data mining and statistical techniques to make intelligent — or at least practical — guesses about things, like voice recognition or playing Jeopardy.
What we really want an image or video codec to do is upscale creatively. If we gave a person a photo of a landscape, he could enlarge it and make it geniuinely have higher resolution. Maybe he would start by using bicubic resampling or even Perfect Resize, but then he would make sensible adjustments to add detail. Or maybe he would surmise the gist of the original image, and create a whole new one using it as a guide.
Obviously a human would take a long time to do so. So too would a computer. And to what end? We could shoot photos and movies with low-resolution cameras and then upscale them after, or we could just use high-resolution cameras in the first place.
A computer would need to extract geometry and texture from a video and then rerender. This is still well beyond the capabilities of machine vision. The computer would actually need to know what it was looking at, and for small features, it would need to make logical guesses based on context.
How much could we upscale? At two to four times, the details we need to add comprise three to fifteen times extra information. The higher we go, the more creative we need to be. If an astronomer gave us a picture of a blue dot and wanted it to be upscaled to a poster, we could draw a near-infinite set of richly detailed blue planets. Since each one has an equal probability of being right (they can all be downsampled to match the blue dot), upscaling beyond a certain level creates a total breakdown in certainty and consistency.
If creative upscaling could be automated, it is unlikely to be deployed for realtime decoding due to the computational demands. Instead, it would be a tool with which to create footage prior to distribution. We could take legacy film, enlarge it to fit the highest-resolution TVs and cinemas, and then just encode it with ordinary codecs.
In fact, you can do this today: extract frames from a raw format video as individual bitmap files, batch process them with Perfect Resize, and then stitch the resulting larger bitmaps into a new, higher-resolution movie.
I wrote a simple upscaler that replaces each 3 x 3 block in an image with the best matching nearby 6 x 6 block. No color shifts or thresholding is performed, and block transforms could probably be skipped too. Searching is limited to the 121 blocks around the block to be upscaled.
Blockiness aside, it definitely sharpens high-contrast edges. As with Perfect Resize, one can see the “oil painting” effect start to come in. The image below was upscaled from 510 x 510 pixels to 1020 x 1020 and then cropped slightly.
block, compression, dct, dimension, filter, fpga, fractal, genuine fractals, google, gpu, hevc, iterated, perfect resize, pifs, pixel, quadtree, raytheon, research, shader, softvideo, tmm, tmmi, upscale, vp8, vp9, waterloo, wavelet
It has been suggested that Dimension and TMM are trying to improve the old Iterated SoftVideo technology. This is reasonable given that it was considered obsolete shortly after being introduced.
Since neither company will divulge their research, we can only speculate as to the efforts they may be trying. TMM’s recent enlistment of help from Raytheon, while positive, is also disappointing in that TMM could not succeed on their own. Clearly, forging ahead is a difficult undertaking.
Dimension appears to be employing a different strategy, by dropping compression to focus on realtime upscaling. They hope to find a solution by narrowing down the problem domain.
It is worth noting what others have tried. Iterated had millions of dollars and a bright team of engineers, and they switched to DCT and wavelet techniques. Google has billions of dollars and even more bright engineers, and their VP8 and VP9 codecs are DCT based. The HEVC consortium has similar resources, and they stayed with DCT also. Finally, there are countless fractal imaging research papers written by very clever people at places like the University of Waterloo, and in all this time not a single one has led to a commercial product.
I cannot hope to best such efforts, so I only offer some basic ideas.
SoftVideo gutted the PIFS algorithm to achieve realtime decoding, but at the expense of compression ratio and quality. Could restoring PIFS be an option?
PIFS unfortunately does not offer enough compression for typical imagery, and takes too long to decode. As a result, it has been used only for non-realtime upscaling, in products like Genuine Fractals and Perfect Resize. Part of the problem is that PIFS requires serious filtering to look good, which increases the decoding time.
Gaining better compression is hard. Using larger blocks in the quadtree only works for images that have correspondingly large areas of similar color. DCT systems can do this because they can expand the waveform tables to cover the extra patterning possibilities of larger blocks, but in fractal systems it makes finding block matches exponentially harder.
Using nonsquare (rectangular) blocks might help, as blocks can be better fitted to the imagery. However, the block shape must then also be encoded in the file. Depending on how blocks are allowed to split, however, the extra data need not be excessive, so there is some potential here. The block shape variability must also be taken into account in the decoder, although this should not be a problem.
If we limit upscaling to blocks which happen to lie on region edges which are larger than the block, then a tiny search of the immediate area can work. This is essentially the Dimension approach. An optimization would be to examine the block for sufficient contrast and to not bother with block searching if the contrast is too low. This will cause the upscaling quality to vary, but hopefully the absence of upscaling in low-contrast blocks will not be noticeable. The assumption here is that the time spent determining contrast is significantly less than the time spent searching. On a GPU or FPGA, however, the savings may be moot because the output frame buffer cannot be released for display until all the shader units have finished executing, so even if one shader unit needs to do a block search, all the others will be waiting for it to finish. On the other hand, dividing the work into two shader passes might work: one to analyze for contrast and develop a mask, and the other to do block searches for the blocks indicated by the mask.
The only downside is that many small high-contrast edges would be left unscaled. A review of the quality will need to wait until I can prototype the necessary software.
TMM recently announced that it was working with Raytheon to conduct research. This got me thinking about the applications Raytheon might deploy.
Because fractal compression exploits similarities between different parts of an image, it is unsuitable for survelliance applications. The basic problem is that it is based on the collage theorem (in the case of PIFS and SoftVideo, block copying), and a collage is not something you want when doing surveillance.
When an image is zoomed using traditional methods, each pixel being enlarged is considered alongside its immediate neighbours. Data from other parts of the image is not considered and can therefore not “pollute” the area being enlarged.
In a collage system however, the pixels used to enlarge an area may come from anywhere else in the image. In video, they can even come from a previous frame. I will illustrate the danger with a simple example.
Imagine a picture of two men, both standing in the same way and both facing the camera, and both with a similar head shape. Their facial features are also similiar but they are not identical. One of the men is standing twice as close to the camera and so he appears twice as large.
When this image is fractally compressed, a block containing the large man’s face will be shrunk to half size and compared to the block containing the small man’s face. The blocks will be similar enough that a match will be declared found, and a fractal code relating the two blocks will be written to the file.
During decompression, the face of the first man will be constructed by sampling the pixels from the face of the second man. If fractal zooming is used, the first man will look more like the second man.
Now, anyone using this decompressed image for survelliance will face three problems:
1. The identity of the first man has become lost, since his face is determined solely from the pixels in the face of the second man.
2. If the first man was a suspect, the second man will be incorrectly suspected instead.
3. The two men may be believed to be twins when they may not be brothers, or even related to each other at all.
This example is a little extreme, but you see the point. All sorts of other erroneous situations can occur. A bunch of car keys held in someone’s hands may be substituted by a gun held by another. The color of a person eyes may be shifted. An identifying birthmark or tattoo may be altered.
Long story short, all the details that appear when zooming can come from anywhere, and they may be significantly different than what would have been seen if the original scene had simply been viewed twice as close. If an investigator is biased towards interpreting a particular cluster of pixels, his bias may be erroneously strengthened when seeing them upscaled, as he is now effectively looking at a different part of the image.
I wanted to talk a bit more about a subject that — despite good efforts — keeps getting misunderstood, especially by newcomers to digital graphics.
The term “resolution independance” does not mean that one can zoom into a picture indefinitely and see ever more detail. This is mathematically impossible. Instead, what is meant is that a picture is described in a way that it can be rendered out to any desired resolution. For example, video game objects are described as polygons, whose edges are always crisp no matter how large they appear.
For pixel-based images such as photographs, the best we can do when upscaling is to keep the apparent edges of differently-colored regions sharp. This is done by finding some geometrical way to describe the pixel regions, and then render to the desired size using said geometry.
In the case of fractals, and the PIFS algorithm in particular, the image is described as a set of recursively iteratable blocks, which produces a sharp but quasi-random blocky look. Other algorithms try to convert groups of similiar pixels into spline paths or polygons, whose edges can be scaled without pixellation.
None of these approaches can increase actual detail, e.g. showing pores as a person’s skin is enlarged, or numbers on a distant licence plate becoming readable. Despite what popular crime-fighting TV shows would have us believe, one cannot enhance video this way. Where detail is added, it has far more chance of being random than meaningful.
Consider an analogy to compression, because we are actually trying to describe a picture with many pixels using fewer pixels. A file cannot be infinitely compressed down to a single bit or byte, because at some point the number of different files than can be represented is too few. A single byte has eight bits, giving 256 different possible meanings, which means it can only expand, at best, into 256 different possible files.
If any file, no matter how little redundant data it contained, could be compressed, then we could simply compress huge files over and over until they shrank to a single byte. But we already know that a byte cannot come anywhere close to representing the billions of files people have. Once a file has no redundant data, it cannot be compressed further.
For images, imagine trying to zoom into a single gray pixel. What could it possibly resolve into when enlarged to, say, 100 x 100 pixels? A 2 x 2 block of black pixels could be enlarged to a black square or to a circle, but there is not enough information to say which shape is the right one. The ambiguity results from information that has been irretriveably lost when the image was first created, and computers are not artists able to creatively decide how to fill in the gaps.
If such magic were possible we would hardly need, for example, telescopes. Astronomers could just snap pictures of the night sky with their smartphones and enhance them until the surfaces of distant planets were visible.