Mr. Brunberg, my sales engineering colleague who suggested the topic of this post, informs me that server-based acoustic echo control (AEC) is a frequent source of inquiry. It’s a technically feasible task, but not without its share of difficulties. This post will provide an overview of the associated challenges.
First, allow me to better qualify what I mean when I say server-based AEC. During a voice call, acoustic echo is generated at client endpoints when the playout and capture devices are acoustically coupled (i.e. can “hear” each other). It would typically be the role of the client to process the captured stream to remove the echo. Server-based AEC proposes to do this processing somewhere in the network at a mediating server.
There are several reasons it’s preferable to handle AEC at the endpoints:
Complexity
Echo control is a computationally intensive task, and adopting the philosophy of the internet, we would like to distribute complexity by pushing it to the endpoints of the network whenever possible. Significant load would be added to a server required to perform AEC on every conference channel.
Intermediary processing stages
A client-side AEC sees the capture signal early in the processing chain, usually immediately after being provided by the system. Assuming that saturation has been avoided in the analog-to-digital conversion stage, the only degradation to the signal should be due to the actual acoustic channel. At the server, however, several additional processing stages will have been performed. The degradation this causes to the signal affects the quality of echo control possible. Refer to the diagram below to follow the signal path.

First assume we have access to a signal which has been received from one client and is destined to be played out at another. This is the farend signal. It’s encoded at the server, transmitted over the network, and decoded at the client. The client jitter buffer might perform some time-stretching as it adjusts its buffer size. It’s then played out and the echo is captured (the unavoidable degradation also seen by a client-based AEC). The captured signal might undergo various speech enhancement processing, such as noise suppression. We then have another encoding/decoding and jitter buffer processing at the server before we finally receive the nearend signal to provide to the AEC.
Delay estimation
Perhaps the most crucial problem for server-side AEC is delay estimation. As described more fully in an earlier post, the farend and nearend signals must be synchronized in time. This is necessary for the filter to adapt to the channel and provide an estimate of the echo. An AEC operating at an endpoint must compensate for the system render and capture buffers to provide this synchrony. A server-based AEC must go further yet. The network, jitter buffer and additional processing stages such as encoding/decoding add latency between farend and nearend signals which must be accounted for.
The networking delay can be obtained from the round trip delay supplied by RTCP sender reports. The jitter buffer at the server can report its delay to the AEC. However, unless signaled somehow to the server, the latency of the client jitter buffer, render and capture buffers, and other algorithmic latencies are unknown. A standard AEC can handle a latency offset up to some proportion of the length of its adaptive filter. So there is some play here, and by sinking enough complexity it might be possible to operate without knowledge of these unknown latencies. Another approach is to use a dedicated delay estimator operating on the signals themselves. This stage should be of lower complexity than the AEC filter (which provides implicit “delay estimation” itself) allowing us to avoid an excessively complex solution.
Relaying
What if the server is simply relaying packets and has no access to the decoded data? I have read about some attempts to perform echo control in the coded domain, but I think the best you can hope for is a very crude half-duplex suppressor. More likely, the relaying server would be forced to decode the signal in order to remove the echo before re-encoding and sending the packet on its way.
As we’ve seen there are a number of challenges to server-based AEC. Although these are not insurmountable, it’s certainly clear that we should perform echo control at the endpoint whenever possible.