Data centers function as distributed networks, with numerous web and mobile applications deployed on a single server. When users send requests to an application, fragments of stored data are retrieved from hundreds or thousands of services across those servers. Before sending a response, the application must wait for the slowest service to process the data. This delay is known as queue latency.

Current methods for reducing queue latency leave tons of CPU cores on a server open to quickly handle incoming requests. But this means that the cores remain idle for much of the time, while the servers continue to use power just to stay powered on. Data centers can contain hundreds of thousands of servers, so even small improvements in the efficiency of each server can save millions of dollars.

Alternatively, some systems reallocate cores through applications based on workload. But this happens in milliseconds, about one-thousandth of the speed desired for today's accelerated requests. Waiting too long can also degrade an application's performance, since any information not processed before a predetermined time is not sent to the user.

In a paper presented at the USENIX Networked Systems Design conference, researchers developed a faster core allocation system, called Shenango, that reduces queue latencies while achieving high efficiencies. First, a new algorithm detects which applications are struggling to process data. Then, a software component allocates idle cores to handle the application's workload.

"In data centers, there's a trade-off between efficiency and latency, and you really need to reallocate cores at a much more precise granularity than every millisecond," says first author Amy Ousterhout, a PhD student at the Computer Science and Artificial Intelligence Laboratory (CSAIL). Shenango allows servers to "manage operations that occur on really short timescales and do so efficiently.".

Energy and cost savings will vary by data center, depending on its size and workloads. But the overall goal is to improve CPU utilization in the data center, so that every core is put to good use. Current best CPU utilization rates are around 60 percent, but the researchers say their system could raise that figure to 100 percent.

"Data center utilization today is quite low," says co-author Adam Belay, assistant professor of electrical engineering and computer science and a CSAIL researcher. "This is a very serious problem that can't be solved in just one place in the data center. But this system is a critical piece toward increasing utilization.".

Efficient congestion detection.
In a real-world data center, Shenango (algorithm and software) would run on every server. All servers would be able to communicate with each other.

The system's first innovation is a new congestion detection algorithm. Every five microseconds, the algorithm checks the data packets queued for processing by each application. If a packet is still waiting for the last check, the algorithm indicates a delay of at least five microseconds. It also checks whether any computational process, called a subprocess, is waiting to be executed. If so, the system considers an application to be "congested.".

It seems simple enough. But the queue structure is important for achieving microsecond congestion detection. Traditional thinking meant that the software would have to check the timestamp of every data packet in the queue, which would take far too long.

Researchers implement queues in efficient structures known as "ring buffers." These structures can be visualized as different slots around a ring. The first packet of data received enters an initial slot. As new data arrives, it is placed in subsequent slots around the ring. Typically, these structures are used for first-in, first-out (FIFO) data processing, removing data from the initial slot and working toward the final slot.

The researchers' system, however, only stores data packets briefly in the structures, until an application can process them. In the meantime, the stored packets can be used for congestion checks. The algorithm only needs to compare two points in the queue (the location of the first packet and the location of the last packet five microseconds ago) to determine if the packets are experiencing a delay.

"You can look at these two points and track their progress every five microseconds to see how much data has been processed," Fried says. Because the structures are simple, "you only have to do this once per core. If you're looking at 24 cores, you do 24 checks in five microseconds, which scales very well.".

Intelligent Allocation:
The second innovation is called IOKernel, the central software hub that directs data packets to the appropriate applications. IOKernel also uses a congestion detection algorithm to rapidly allocate kernels to congested applications—orders of magnitude faster than traditional approaches.

For example, IOKernel might see an incoming data packet for a particular application that requires microsecond processing speeds. If the application is congested due to a lack of cores, IOKernel immediately dedicates an idle core to the application. If it also sees another application running cores with less time-sensitive data, it will need some of those cores and reallocate them to the congested application. Applications also help: if an application isn't processing data, it alerts IOKernel that its cores can be reallocated. The processed data is then sent back to IOKernel to send the response.

"IOKernel focuses on applications that need cores but don't have them," Behrens explains. "It's about figuring out who's overloaded and needs more cores, and giving them cores as quickly as possible, so they don't fall behind and experience huge latencies.".

The close communication between IOKernel, the algorithm, the applications, and the server hardware is "unique in data centers" and allows Shenango to function seamlessly, according to Belay: "The system has global visibility into what's happening on every server. It sees the hardware providing the packets, what's running on each core, and how busy each application is. And it does this on a microsecond scale.".

The next step is for the researchers to refine Shenango for real-world data center deployment. To do this, they are ensuring the software can handle very high data throughput and has the appropriate security features.

###

Written by Rob Matheson, MIT News Office