Posts Tagged ‘codecs’

SVT Provides Great Coverage, if Not Top Quality

Stefan Holmer
Posted by Stefan Holmer
on March 2nd, 2010 in Technology

Sweden has exceptional public service television called SVT. During the winter Olympics, SVT broadcasted the most popular events, sometimes two or three at a time on different channels. As if that’s not enough, they also have a free web service called SVT Play, where they streamed almost all Olympic competition in good quality. Finally, for those really devoted to sports, SVT even has an iPhone app for watching the broadcasts on the go.

At first glance, the quality of the online stream appears to be really good. They’re encoding at a bit rate of 810 kbps and defaults to using Flash, probably with ON2 VP6 as the codec. For the Windows user, Windows Media is also available.

However, most people probably prefer to watch the stream in full screen mode. This is where I think SVT Play fails to deliver. If you look at the image below, which is a part of a full screen video sequence on SVT Play, you can see severe aliasing at the edges. This is the most apparent at the edge between the man’s neck and his shirt.

tommy_fs

The aliasing appears due to bad – or nonexistent – interpolation when upsampling the images. Whether or not this has to do with problems with Flash or SVT Play I cannot tell for sure, but we can at least assume it can be solved since watching a YouTube video (which also uses Flash) in full screen looks good, as is demonstrated in the screen capture of a section of a full screen YouTube clip below.

zombieland

There have recently been a lot of discussions about the video tag in HTML5, and what codecs to use with it. Some prefer license-free codecs, while some prefer the best possible performance. But one thing is for sure: regardless of how good your codec is, the experience is what is most important. Having bad post-processing will always have the last say, no matter how many bits and CPU cycles you spend on encoding your video source.

The “Secret” HP Sauce

Mats Perjons
Posted by Mats Perjons
on September 23rd, 2009 in Industry News

Yesterday, HP launched a new video conferencing tool called SkyRoom that will be available for $149 per client and free on select HP business desktop and mobile workstations. From their marketing videos, it definitely looks like a nice collaboration tool, so I am looking forward to installing it and giving it a try. H.264Sauce

Despite the excitement, I did raise my eyebrows a bit when I read that Jeff Wood, director of worldwide marketing for workstations at HP, stated that “SkyRoom is basically a codec” and that “the secret sauce is a HD codec developed over the years that can take info from host system. It’s very good. It’s been used by NASA on the Mars Rover program.”      

The setup documents and specifications state that H.264 and MPEG2 are used for video and MPEG for audio. Wow, so that is where they found the codecs. I didn’t realize that a widely deployed industry standard could be considered a “secret”, but I guess this proves that the codec can deliver high quality video to a variety of applications. 

Unfortunately, to use SkyRoom, each participant needs to be on the same VPN because the application does not traverse firewalls. In addition, echo cancellation does not work on laptops, making desktop conferencing less desirable. Nevertheless, I cheer HP’s innovation and look forward to the proliferation of high quality video throughout the collaboration market.

What Needs to Happen for Video Conferencing Interoperability?

John Hermansen
Posted by John Hermansen
on July 13th, 2009 in Market Trends, Technology

hdcamCarol Wilson wrote a good piece today about the need for telepresence interoperability. She claims that, although Cisco appears to be dragging their feet in offering a solution that interoperates with other major players, it is really up to service providers and end users to demand open solutions.

I found the article especially interesting in light of a webinar that Cisco sponsored on the topic of desktop video conferencing a few weeks back. Mike Sonnier, the presenter from Cisco, spent quite a bit of time discussing the need for interoperability in the video conferencing market. In one sense, it makes total sense that Cisco would be more concerned with interoperability on the desktop vs. telepresence. Telepresence requires a captive audience in an expensively appointed room used only for video conferencing. The sheer overhead cost of the solution dictates that Cisco would want to sell as many rooms as possible while making sure that all participants in a given call were in Cisco-enabled rooms. However, the ubiquity and mobility offered by desktop solutions, combined with users’ expectations and the relatively low cost per unit, would mean that interoperability is likely to come sooner to the desktop. 

Whether it is on the desktop or in high-end telepresence solutions, most of the technical hurdles to interoperability have been overcome, even if the business case isn’t quite there yet. However, the different video conferencing players still need to agree on which standards to support. In order for disparate systems to interoperate, as I see it, there are 3 technical components that need to be shared by each side of a call. First, each side needs to be able to establish a connection with the other by using the same call setup protocol.  This had been an issue for quite some time, but once Cisco launched Call Manager 5.0 in 2006, it became apparent that SIP was truly the standard that everyone needed to support. The second component that video conferencing solutions need to share is a video codec. In order for each side to see the other, they need to be able to encode and decode images using the same technology. While it is still evolving, and there are certainly plenty of proprietary options available, it seems like H.264 is the codec of choice for the foreseeable future.  Finally, just like with video, vendors need to supply solutions with the same voice codec on each end of the call in order to interoperate. Even though telephony, and VoIP in particular, is much more mature and has been around longer than videoconferencing, there are a slew of voice codecs available to vendors, making the decision of which to integrate a difficult one. Videoconferencing providers can certainly just integrate every codec available in the hopes that the other side will support one of the options, but it is much more cost effective to agree on a small number of codecs and avoid unnecessary licensing costs. The telephony world decided on two standards long ago (G.711 and G.722), but the market for the HD (or wideband) voice codecs that are necessary for video conferencing is like the Wild West right now. 

In light of this need for agreement on which HD voice codec to support, GIPS is sponsoring a webinar Tuesday, July 14 titled, “It’s All Gone HD: Overcoming the Hurdles to Supplying HD Voice and Resolving the Codec War”. To register for the webinar, visit the USTelecom site here.

Traffic Shaping for HD Video

Stefan Holmer
Posted by Stefan Holmer
on April 23rd, 2009 in Technology

To me real-time video communication is essentially about three things:

  • 1. Estimating the available resources, such as computation power, channel capacity and quality.
  • 2. Making the best use of those resources.
  • 3. Protecting against network impairments such as jitter and packet losses

The best way to achieve the second goal, in my opinion, is to utilize traffic shaping, e.g. shaping video traffic so that we optimize the quality experienced by the user.  Typically we have a bandwidth limitation which we must make sure to stay below. The most common way to do this is by changing the quantization step size and/or the frame rate until the limitation is reached.

As network capacity increases, and consumers demand more bandwidth intensive applications, we approach a situation where even people making their trans-Atlantic call to mom want to use HD video. Unfortunately the varying quality of many internet routes, and the huge variance in available bandwidth due to cross traffic, also mean that the quantization step size and the frame rate will vary a lot. For instance, consider the case of a real-time HD video call over a channel with 2 Mbit/s of available bandwidth. The image is sharp and we have a smooth flow. Suddenly we get a lot of cross traffic impairing our available bandwidth to about 500 kb/s. Now the application must decide how to combine increased quantization step size and decreased frame rate. In other words, what is the lowest image quality and most jerkiness the user is willing to tolerate?

If we look at it from another perspective, in the past we had lower bandwidth and were using lower video resolutions. Now we have more available bandwidth and are thus using higher resolutions- up to HD. If the bandwidth along the route at which we’re making our call varies a lot, should we use HD even though at times we must conform to bit rates as low as 500 kb/s? Wouldn’t it be better to lower the resolution?

At GIPS we solve this problem by automatically spatially sub-sampling each frame before encoding if we notice that the available bandwidth has gone too low for the preferred frame size. At the decoder side we then up-sample the image again after decoding. In this way we have added an additional parameter to tune for traffic shape to allow for the best possible end-user experience.

Below are two screenshots of video frames produced by different coding methods. The first video frame has been JPEG encoded with high quantization step size. The second frame has also been sub-sampled and JPEG encoded to the same file size (requiring a lower quantization step size), then decoded and up-sampled. Notice the pixilization of the first image, especially around still and uniform objects such as the woman’s pants and the floor. Now look at how much clearer the second image appears. By taking into account network limitations, and shaping traffic accordingly, we are able to produce a much more life-like experience for the end-user.

 video_frame_1

 

   video_frame_2

Answering the Call of HD Voice

John Hermansen
Posted by John Hermansen
on April 13th, 2009 in Company News, Market Trends, Technology

Today, Daniel Berninger wrote a great piece as a guest blogger for the Jeff Pulver Blog about the importance of HD voice in the face of increasing adoption of text communication. He writes, “High Definition (HD) voice can do for the voice industry what it did for the video industry in triggering the replacement cycle that follows format changes.”

There is a growing buzz around HD voice, which involves extending a call’s bandwidth from the 3.5 kHz found in traditional telephony, to anywhere from 7 kHz to around 22 kHz for VoIP communications. This bandwidth extension increases intelligibility and improves overall comfort. By delivering a more true-to-life experience, HD voice makes it easier to understand other participants on a call. For instance, fricative sounds like “s” and “f” are easier to distinguish, and foreign accents become less harsh. HD voice also sounds much closer to in-person speech. This gives participants the feeling that they are actually in a more natural conversation setting. It can also be incredibly helpful in identifying who is speaking in a conference call with several participants. Finally, because users do not need to strain to understand each other, and feel like they are in the same room, participants become less fatigued, especially in long conference calls.

In an effort to address the most common questions about HD voice, GIPS will be publishing a whitepaper on the topic next week. Make sure to check the GIPS in Action page next Tuesday to download the whitepaper, and to hear the difference between traditional speech and HD voice.

Silverlight Video Quality is a Slam Dunk

John Hermansen
Posted by John Hermansen
on March 23rd, 2009 in Technology

Ahhh, March Madness. It’s one of my favorite times of the year, and I don’t even gamble (office pools are not gambling, just like Joe Lieberman is not a Democrat). Despite witnessing my alma matter, Wisconsin, lose perhaps the ugliest “basketball” game I have ever seen, and the fact that my bracket looks like Iggy Pop after flopping around onstage (picking WVU to go to the Elite 8 was not the smartest thing I have ever done), I have thoroughly enjoyed watching this year’s action. Perhaps the most pleasant surprise has been CBS Sports‘ online coverage.  While the hectic schedule of multiple games ending at the same time can make the TV broadcast maddening , CBS’ streaming coverage makes it easy to follow multiple games simultaneously. To top it off, the video quality is the most impressive streaming technology I have seen to date. blake-griffin

This last feature is courtesy of the new Microsoft Silverlight plugin, which, if what I have seen is any indication, could be a serious competitor to Flash.  It is difficult for me to say how CBS was able to achieve such high quality, but according to the Silverlight Wikipedia page, the Media Stream Source API supports a number of video codecs and can dynamically adjust bit rate to accommodate available bandwidth and CPU. This concept is also incredibly important to real-time audio and video coding, as it allows for maximum quality in the face of changing network conditions.

Another probable reason Silverlight’s video looks so good is that it does not need to approach delay very aggressively, which is a big difference between streaming and real-time applications.  For instance, I noticed about a one minute delay between the online coverage of yesterday’s Marquette-Missouri game and CBS’ TV broadcast. This is perfectly acceptable for most people, especially if they aren’t near a television. However, a one minute delay would render a real-time video conversation completely useless. Thus, a streaming solution is able to overcome packet loss and jitter by simply waiting for all packets to arrive at the receiving end, or even having the sending side resend any lost information, while real-time solutions must employ more clever techniques to maintain video quality while keeping delay as low as possible.

Counterpoint: The Promise of HD Voice

John Hermansen
Posted by John Hermansen
on March 16th, 2009 in Market Trends, Technology

A couple months ago, my colleague Mats blogged about the dangers of overhyping “super wideband” speech.  Since then, there have been a few posts on the GIPS blog discussing the technological significance, as well as the market implications of super wideband. While these posts have certainly been accurate and well-informed, they have tended to downplay the importance of this emerging technology. Thus, I would like to highlight a few of the potential benefits of wideband technology, or HD voice, in general (as if we needed another opinion on the matter), and in turn hopefully broaden the overall discussion.

Perhaps the most promising element of the debate over HD voice is the growing awareness of the perception of voice quality. Skype’s announcement of their SILK codec, and Jeff Pulver’s plans for an HD VoIP Summit have raised the profile of the topic to the point where it is getting significant discussion. Since the inception of the company, GIPS has been arguing that voice quality matters. While there may be questions about the useful application of super wideband codecs, it only means that there is uncertainty in the market about the degree to which emerging technology should be implemented. The very fact that the discussion is taking place means that people recognize legacy solutions (e.g. those designed for the PSTN) are inadequate for the next generation of communication, and that new technology needs to be adopted to overcome these limitations.

Which leads me to my next point- an awareness of the importance of voice quality is good for the overall VoIP market. If VoIP can be recognized as not only a low cost alternative to the PSTN, but also a mode of communication that can deliver even better quality than what people have become accustomed to, then it will truly gain mass adoption. I can’t tell you how many times I have dialed into a conference call from a landline or cell phone and struggled to keep up with the conversation due to poor quality. When people realize a better world is possible, it will be too difficult to go back to the inadequate solutions of the past. At that point, as my colleague Larry likes to say, the term “VoIP” will go away, and people will just be using “voice”.

eComm – A Great Place

Jan Linden
Posted by Jan Linden
on March 6th, 2009 in Market Trends, Technology
This week I attended the eComm conference. What a great conference it was! Thanks Lee for putting this together. I think practical details such as keeping the presenters on a short leash and diligently keeping to the time schedule makes for a very good experience. The 15 minute presentation format and no parallel sessions are also, in my mind, the right format for this type of conference.

There were many great presentations ranging from very technical and geeky to refreshing high level thoughts on communications. Even though there were many more really good ones I would like to single out a few that I found especially interesting.

Ge Wang of Smule/Stanford had an exciting keynote on “Creating New Expressive Social Mediums on the iPhone” where he presented a number of really cool applications for the iPhone including an application called Ocarina that turns the iPhone into a flute (you blow into the microphone).

 

Ge Wang playing the Ocarina on the iPhone at eComm2009.

Ge Wang playing the Ocarina on the iPhone at eComm2009. Copyright 2009 by James Duncan Davidson

In terms of new applications/services I really liked Matt Ranney’s presentation on  RebelVox‘ technology that in a great way combines live and  asynchronous voice communications. This can be viewed as an integration of Voice SMS/IM, text IM, and live voice calls. This is definitely a type of service I would be prepared to pay for.

A nice perspective on today’s communication style was presented by Stefan Agamanolis with Distance Lab. He likened today’s mobile communication with fast food and proposed “Slow Communication” as corresponding to the current trend of Slow Food. Very rarely do we pay full attention to a phone conversation anymore. Either we are on the computer at the same time or because we are no longer tethered to a fixed phone we are easily being distracted by things around us.

The trend towards enabling web developers (rather than just voice developers) with simple enough tools to allow them to build voice applications into their web offerings is continuing to evolve. A recent example is Voxeo’s launch of Tropo.com.

As a speech coding person it would be surprising if I didn’t comment on Skype’s SILK codec announcement. The codec, which can be run in narrowband, wideband, or even “superwideband” mode seems to be a very well designed codec with good quality at many bitrates. Binaries can be obtained without any licensing fees and there is no obvious restriction for usage. I.e., it can be used for applications that do not involve Skype at all. As practically all free codecs, and most standard codecs for that matter, it doesn’t come with indemnification against patent infringements. That is only to be expected, and quite natural since there is no licensing fee associated with the usage of the codec. Indemnification is of course one of the benefits you get from buying a solution from a vendor like GIPS. In addition to making binaries available to everybody it was announced that Skype is planning to release source code to select partners for optimization on certain platforms.

Regarding the technical specifications of SILK my only concern is regarding complexity and memory usage. Not that any of those numbers are worse than comparable codecs; they are actually in the same ballpark as most and complexity is better than e.g. AMR-WB. However, this level of complexity is high for many mobile and embedded solutions and there is a need for lower complexity wideband codecs.

A very nice gesture by Lee was to donate 10 % of the proceeds to a local charity. The money went to Shelter Network that “…is committed to providing housing and support services that create opportunities for homeless families and individuals to re-establish self-sufficiency and to return to permanent homes of their own”

 
Myself talking about VoIP on the iPhone at eComm 2009. Copyright 2009 by James Duncan Davidson

Myself talking about VoIP on the iPhone at eComm 2009. Copyright 2009 by James Duncan Davidson

How super-wideband is super-wideband enough?

Andrew MacDonald
Posted by Andrew MacDonald
on February 26th, 2009 in Technology

With the recent discussions of super-wideband codecs (cf. Mats’ post), I had the notion to find, at least for me, at what bandwidth a speech signal would be transparent from its source. Or, more accurately, at what bandwidth I could no longer declare the signal to be non-transparent.

Transparency, in the context of subjective evaluation, means to be indistinguishable from a reference. In a general sense it is ultimately the goal of all lossy compression schemes (of which most audio codecs are examples). The encoded and decoded output of audio codecs such as AAC can be transparent to reference at practical bitrates.

I prepared a small ABX test, in which the listener is presented with a series of unknown signals which they must correctly identify as a reference or alternative. I used a female source sampled at 48 kHz and the source resampled to a series of different bandwidths for the alternatives. For each bandwidth, I used 10 trials, striking a balance between statistical significance and time required for the test.

On to the results! A correct score here means I was able to correctly identify the unknown signal as reference or alternative. (The frequency listed is the signal bandwidth; half the sampling frequency).

8 kHz — 10 correct
10 kHz — 10 correct
12 kHz — 10 correct
14 kHz — 10 correct
16 kHz — 8 correct (still ~95% confidence)
18 kHz — 3 correct

I stopped there, since I was clearly unable to any longer make a distinction. Furthermore, I probably can’t hear much higher than 18 kHz (yet another interesting test!). To be honest, I was very surprised with the 16 kHz result, so surprised in fact that I repeated the test only to arrive at the same result (which merely serves to improve the confidence…). I was, however, using a high quality USB sound device and accurate headphones. To make the score somewhat more relevant to a practical VoIP scenario, I retook the 16 kHz test using my laptop sound card and a cheap headset. This time I scored 5 correct, which is no better than flipping a coin.

All of this should be taken with a grain of salt, but it suggests that i) it’s possible to distinguish between reference and 16 kHz bandlimited speech given the best equipment, and ii) that a typical VoIP user might consider something less such as 14 kHz bandlimited speech to be transparent.

Super Wideband or Super “Hype-band”?

Mats Perjons
Posted by Mats Perjons
on January 8th, 2009 in Technology

When VoIP applications like softphones started to use wideband codecs in 2003, they gave a major boost to the VoIP market.  The improvement from narrowband codecs that use 3.4 or 4 kHz, to wideband codecs that use 7 or 8 kHz, was a giant step in terms of perceived voice quality, and totally changed people’s views on VoIP’s legitimacy.

 

Today there is a lot of talk about HD audio (usually referred as wideband) and super wideband codecs that use 14 or 16 kHz bandwidth. I have been listening to different audio and music samples with 3.4, 7, 8, 14, 16, 22 kHz to get a better understanding of the quality differences. As anyone who has tried Skype or Google Talk can attest, there is obviously a big difference when going from narrowband (3.4 kHz sampled with 8 kHz) to wideband speech (7 or 8 kHz sampled with 16 kHz). The bigger question is, can people hear the difference when using 7, 8, 14, 16, or 22 kHz in a normal voice conversation?

 

In my opinion, there is an audible difference when moving from a 7.0 kHz sample to a codec that supports 8 kHz, such as iSAC and iPCM-wb. However, there is a much less obvious difference between 8 kHz and 14 kHz, which I can only detect after listening to a speech sample several times. I experimented with different headsets and speakers, and found that studio grade equipment can accentuate the quality of the super wideband samples, but not to an extent that the average user would be able to regularly appreciate.  Furthermore, super wideband codecs are more susceptible to background noise, to the point that my experience was actually much worse using 14kHz than 8kHz when I added even low levels of ambient noise. Once I went beyond 14 kHz, I was unable to hear any difference in quality at any range for speech or even music.

 

The basic conclusion of my simple tests is that quality differences between wideband and super wideband are not obvious above 8 kHz, but can be detected by using the right equipment.

 

There is obviously still more work to be done to provide the most robust speech quality for IP communications. Super wideband may end up pushing the market even further, but the jury is still out. Regardless of what transpires, GIPS will continue to support a wide range of codecs to provide the best user experience possible.