A Data Center End-host Stack for Predictable Low Latency and Dynamic Network Topologies
General Material Designation
[Thesis]
First Statement of Responsibility
Kapoor, Rishi
.PUBLICATION, DISTRIBUTION, ETC
Date of Publication, Distribution, etc.
2015
DISSERTATION (THESIS) NOTE
Text preceding or following the note
2015
SUMMARY OR ABSTRACT
Text of Note
The scale of modern data centers enables developers to deploy applications across thousands of servers. The variety of applications and the scale of operations impose onerous challenge of meeting application performance requirements while maintaining efficiency. Today, data center operators typically over-provision the network and run services at low utilization to rein in latency outliers, thus decreasing efficiency. This large scale inefficiency results in high monetary, energy, and management expenses. This dissertation focuses on redesigning the end-host network stack to improve network efficiency and achieve low latency and latency variation at high utilizations . We begin by studying traffic emanating from modern servers across a variety of data center applications. We find that traffic is highly bursty, which contradicts the network flow model where traffic is uncorrelated. We use this observation to design networks that can benefit from bursty behavior. Second, in data center applications, predictability in service time and controlled latency, especially tail latency, are essential for building performant applications. Current practice has been to run such services at low utilization to rein in latency outliers, which decreases efficiency. To combat this, we present Chronos, which is a framework to reduce end-host latency and latency variation. Chronos reduces Memcached latency by a factor of 20 compared to typical deployments. Third, a range of new data center switch designs incorporating wireless or optical circuits depend on fast reconfiguration of the underlying topology. These hybrid designs assume a perfect, closed-loop control plane which end-host network stacks cannot provide today. We present the design and implementation of a closed-loop control plane using only software changes at the end host operating system that enable these topologies to support unmodified applications running over TCP. Taken together, these contributions demonstrate we can meet performance requirements of data center applications while running data centers at high levels of efficiency