Posts Tagged ‘audio processing’

The challenges of server-based AEC

Andrew MacDonald
Posted by Andrew MacDonald
on September 10th, 2009 in Technology

Mr. Brunberg, my sales engineering colleague who suggested the topic of this post, informs me that server-based acoustic echo control (AEC) is a frequent source of inquiry. It’s a technically feasible task, but not without its share of difficulties. This post will provide an overview of the associated challenges.

First, allow me to better qualify what I mean when I say server-based AEC. During a voice call, acoustic echo is generated at client endpoints when the playout and capture devices are acoustically coupled (i.e. can “hear” each other). It would typically be the role of the client to process the captured stream to remove the echo. Server-based AEC proposes to do this processing somewhere in the network at a mediating server.

There are several reasons it’s preferable to handle AEC at the endpoints:

Complexity

Echo control is a computationally intensive task, and adopting the philosophy of the internet, we would like to distribute complexity by pushing it to the endpoints of the network whenever possible. Significant load would be added to a server required to perform AEC on every conference channel.

Intermediary processing stages

A client-side AEC sees the capture signal early in the processing chain, usually immediately after being provided by the system. Assuming that saturation has been avoided in the analog-to-digital conversion stage, the only degradation to the signal should be due to the actual acoustic channel. At the server, however, several additional processing stages will have been performed. The degradation this causes to the signal affects the quality of echo control possible. Refer to the diagram below to follow the signal path.

block_diagram

First assume we have access to a signal which has been received from one client and is destined to be played out at another. This is the farend signal. It’s encoded at the server, transmitted over the network, and decoded at the client. The client jitter buffer might perform some time-stretching as it adjusts its buffer size. It’s then played out and the echo is captured (the unavoidable degradation also seen by a client-based AEC). The captured signal might undergo various speech enhancement processing, such as noise suppression. We then have another encoding/decoding and jitter buffer processing at the server before we finally receive the nearend signal to provide to the AEC.

Delay estimation

Perhaps the most crucial problem for server-side AEC is delay estimation. As described more fully in an earlier post, the farend and nearend signals must be synchronized in time. This is necessary for the filter to adapt to the channel and provide an estimate of the echo. An AEC operating at an endpoint must compensate for the system render and capture buffers to provide this synchrony. A server-based AEC must go further yet. The network, jitter buffer and additional processing stages such as encoding/decoding add latency between farend and nearend signals which must be accounted for.

The networking delay can be obtained from the round trip delay supplied by RTCP sender reports. The jitter buffer at the server can report its delay to the AEC. However, unless signaled somehow to the server, the latency of the client jitter buffer, render and capture buffers, and other algorithmic latencies are unknown. A standard AEC can handle a latency offset up to some proportion of the length of its adaptive filter. So there is some play here, and by sinking enough complexity it might be possible to operate without knowledge of these unknown latencies. Another approach is to use a dedicated delay estimator operating on the signals themselves. This stage should be of lower complexity than the AEC filter (which provides implicit “delay estimation” itself) allowing us to avoid an excessively complex solution.

Relaying

What if the server is simply relaying packets and has no access to the decoded data? I have read about some attempts to perform echo control in the coded domain, but I think the best you can hope for is a very crude half-duplex suppressor. More likely, the relaying server would be forced to decode the signal in order to remove the echo before re-encoding and sending the packet on its way.

As we’ve seen there are a number of challenges to server-based AEC.  Although these are not insurmountable, it’s certainly clear that we should perform echo control at the endpoint whenever possible.

Practical concerns in acoustic echo control for PC

Andrew MacDonald
Posted by Andrew MacDonald
on January 29th, 2009 in Technology

Echo control differs from many other audio processing tasks in that it depends on two streams: the farend (audio to be played on the speaker) and the nearend (audio recorded from the microphone). In a hands-free call, the nearend typically contains an echoed version of the farend. In order to identify and remove this echo component from the stream, the signals must be time-aligned in some fashion. It is this need for time-alignment that is at the root of many of the practical difficulties in acoustic echo control (AEC) that are not apparent for other tasks such as noise suppression and coding which operate on only a single stream.

There are two crucial and sometimes unmentioned factors which contribute to this time-alignment problem for AEC on a PC platform:

1. The AEC will be running on a non-real-time operating system. This means that processing is not guaranteed to take place at any particular time, but will instead be performed in some kind of best-effort manner. The effect is that the delay between the farend and nearend signals is unknown a priori and will probably change over time. When the CPU is heavily loaded this is aggravated; it’s even possible that buffers will overflow and we’ll lose some of the stream data. It’s necessary to compensate for this delay to achieve our desired time-alignment.

2. There is a wide array of available hardware devices which can be used in combination. Recording and playout devices (often soundcards, but alternately webcams and other USB devices) run on hardware clocks just as the CPU does. This controls the rate at which data is recorded or played out. If these clocks differ, and data is recorded at a different rate than it is played out, the farend and nearend streams will drift away from each other. This phenomenon is aptly labeled clock drift. Again, it’s necessary to compensate for this effect to achieve time-alignment.

The wide variety of hardware leads to another issue. In any practical scenario there will be some amount of non-linear distortion in the echo path. Poor or overdriven speakers and microphones are usually the cause. This type of distortion can be heard for instance if a user speaks very loudly into the microphone, causing signal saturation. The traditional echo canceller uses a linear filter to remove echo, which by definition cannot model this distortion. An effective algorithm must be prepared for this eventuality.

In the literature these practical considerations are sometimes not given their due weight. An AEC algorithm that performs well “in the lab” can be surprisingly underwhelming in a real scenario. It is therefore important that we use actual field recordings including time-alignment mismatches to test performance.  

GIPS relies on an integration between our AEC algorithm and VoiceEngine’s cross-platform sound device handling to effectively contend with these practical considerations.