SREcon23 Americas – Logs Told Us It Was DNS, It Looked like DNS, It Had to Be DNS, It Wasn't DNS



SREcon23 Americas – Logs Told Us It Was DNS, It Looked like DNS, It Had to Be DNS, It Wasn't DNS

SREcon23 Americas - Logs Told Us It Was DNS, It Looked like DNS, It Had to Be DNS, It Wasn't DNS

Logs Told Us It Was DNS, It Looked like DNS, It Had to Be DNS, It Wasn’t DNS

Hemanth Malla and Elijah Andrews, Datadog

It all started with a team reaching out because they had DNS issues during rolling updates. Business as usual… Four weeks later: We are reading kernel code to understand the corner cases of dropping Martian packets. Could this be the connection between gRPC client reconnect algorithms and the overflowing conntrack table we can feel but not see? In time, we solved the issue. And for once… it wasn’t DNS!

In this talk, we will focus on one of the most complex incidents we have faced in our Kubernetes environment. We will go through the debugging steps in detail, dive deep into the mysterious behaviors we discovered and explain how we finally addressed the incident by simply removing three lines of code.

View the full SREcon23 Americas Technical Sessions at https://www.usenix.org/conference/srecon23americas

Comments are closed.