Over the past decade, researchers in industry and academia have developed several algorithms that attempt to achieve high throughput and control latency. Some of these, such as the BBR algorithm developed by Google, are now widely used by many websites and applications.
But a team of MIT researchers has discovered that these algorithms can be deeply unfair. In a new study, they demonstrate that there will always be a network scenario in which at least one sender receives almost no bandwidth compared to other senders; in other words, a problem known as bandwidth starvation is unavoidable.
"What's really surprising about this work and the results is that, when you consider the complexity of real-world network paths and all they can do to data packets, it's basically impossible for delay-based congestion control algorithms to avoid starvation with current methods," says Mohammad Alizadeh, associate professor of electrical engineering and computer science (EECS).
Although Alizadeh and his co-authors were unable to find a traditional congestion control algorithm that could prevent starvation, it is possible that algorithms of a different class might be able to avoid this problem. Their analysis also suggests that modifying the operation of these algorithms to allow for greater delay variations could help prevent starvation in some network situations.
Alizadeh co-authored the paper with first author and EECS graduate student Venkat Arun, and with senior author Hari Balakrishnan, a professor of Computer Science and Artificial Intelligence at Fujitsu. The research will be presented at the ACM Special Interest Group on Data Communications (SIGCOMM) conference.
Congestion control
Congestion control is a fundamental problem in communication networks that researchers have been trying to solve since the 1980s.
A user's computer doesn't know how fast to send data packets across the network because it lacks information such as the quality of the network connection or how many other senders are using the network. Sending packets too slowly inefficiently uses available bandwidth. But sending them too quickly can overload the network, causing packets to be lost. These packets must be retransmitted, resulting in further delays. Delays can also occur because packets wait in queues for extended periods.
Congestion control algorithms use packet loss and delays as signals to infer congestion and decide how fast to send data. But the internet is complex, and packets can be delayed and lost for reasons unrelated to network congestion. For example, data might get stuck in a queue along the way and then be released with a burst of other packets, or the receiver's acknowledgment might be delayed. The authors call these non-congestion-related delays "jitter.".
Even if a congestion control algorithm perfectly measures delay, it can't distinguish between delay caused by congestion and delay caused by jitter. Delay caused by jitter is unpredictable and confuses the sender. Because of this ambiguity, users begin to estimate delay differently, causing them to send packets at uneven rates. In the long run, this leads to a situation where starvation occurs and someone is completely left out, explains Arun.
"We started the project because we lacked a theoretical understanding of how congestion control behaves in the presence of jitter. To put it on a firmer theoretical foundation, we built a mathematical model simple enough to think about, but capable of capturing some of the complexities of the internet. It has been very rewarding that mathematics has told us things we didn't know and that have practical relevance," he says.
Studying starvation
The researchers fed their mathematical model into a computer, gave it a series of commonly used congestion control algorithms, and asked it to find an algorithm that could avoid starvation, using their model.
"We couldn't do it. We tried every algorithm we know and some new ones we invented. Nothing worked. The computer always found a situation where some people received all the bandwidth and at least one person received basically nothing," says Arun.
The researchers were surprised by this result, especially since these algorithms are generally considered reasonably fair. They began to suspect that starvation, an extreme form of injustice, might not be possible. This motivated them to define a class of algorithms they call "delay-convergent algorithms," which they demonstrated will always suffer starvation under their network model. All the congestion control algorithms that manage delay (that the researchers are aware of) are delay-convergent.
The fact that such simple failure modes of these widely used algorithms have remained unknown for so long illustrates how difficult it is to understand algorithms solely through empirical testing, Arun adds. This underscores the importance of a solid theoretical foundation.
But all hope is not lost. Although all the algorithms they tested failed, there may be other algorithms that are not convergent with respect to delay and that can avoid starvation. This suggests that one way to solve the problem could be to design congestion control algorithms that vary the delay range more widely, so that the range is greater than any delay that might occur due to jitter in the network.
"To control delays, algorithms have also tried to limit delay variations around a desired equilibrium, but there's nothing wrong with potentially creating greater delay variation to obtain better measurements of congestion delays. It's just a new design philosophy that should be adopted," Balakrishnan adds.
Now, the researchers want to keep pushing to see if they can find or build an algorithm that eliminates starvation. They also want to apply this approach of mathematical modeling and computational testing to other thorny, unsolved problems in networked systems.
"We are increasingly dependent on computer systems for very critical things, and we need their reliability to have a firmer conceptual foundation. We have demonstrated the surprising things that can be discovered when time is spent developing these formal specifications of what the problem really is," says Alizadeh.
###
Written by Adam Zewe, MIT News Office
