Matthew Luckie and Robert Beverly
Proceedings of ACM
(SIGCOMM 2017) Conference,
Los Angeles, CA, August 2017.
We propose and evaluate a new metric for understanding the dependence of the AS-level Internet on \emph{individual} routers. Whereas prior work uses large volumes of reachability probes to infer outages, we design an efficient active probing technique that directly and unambiguously reveals router restarts. We use our technique to survey 149,560 routers across the Internet for 2.5 years. 59,175 of the surveyed routers (40\%) experience at least one reboot, and we quantify the resulting impact of each router outage on global IPv4 and IPv6 BGP reachability.
Our technique complements existing data and control plane outage analysis methods by providing a causal link from BGP reachability failures to the responsible router(s) and multi-homing configurations. While we found the Internet core to be largely robust, we identified specific routers that were \emph{single points of failure} for the prefixes they advertised. In total, 2,385 routers -- 4.0\% of the routers that restarted over the course of 2.5 years of probing -- were single points of failure for 3,396 IPv6 prefixes announced by 1,708 ASes. We inferred 59\% of these routers were the customer-edge border router. 2,374 (70\%) of the withdrawn prefixes were not covered by a less specific prefix, so 1,726 routers (2.9\%) of those that restarted were single points of failure for at least one network. However, a covering route did not imply reachability during a router outage, as no previously-responsive address in a withdrawn more specific prefix responded during a one-week sample. We validate our reboot and single point of failure inference techniques with four networks, finding no false positive or false negative reboots, but find some false negatives in our single point of failure inferences.
[PDF(756KB)]
[BibTeX]
[Presentation Slides]
[ Return to publications ]