This is the story of how the tiniest things can sometimes be the biggest culprits. Because first and foremost, this is a detective story. So come and follow me on a little crime scene investigation that uncovered an incredibly counterintuitive and almost criminal default setting in Ubuntu 14.04 crippling the virtual network performance and especially network latency.
And just in case you don’t like detective stories – and this goes out to all you nerds out there – I promise it will be worth your while, because this is also an introduction to keepalived, tcpdump, and sysdig. So sit back and enjoy!
The problems came out of the blue. We are running an OpenStack installation for the next generation CenterDevice cloud.
This system worked very well. Until both load balancers became the victims of an evil crime. Both virtual machines started flapping their highly available virtual IP address and we were receiving alerting e-mails by the dozens, but there was no obvious change to the system that could have explained this behavior.
And this is how the story begins.
When I was called to investigate this weird behavior and answer the question what was happening to the poor little load balancers, I started by taking a good close look at the crime scene: keepalived. The following listing shows our master configuration of keepalived on virtual host loadbalancer01 1 .
1% sudo cat /etc/keepalived/keepalived.conf 2global_defs { 3 notification_email { 4 some-e-mail-address@veryimportant.people 5 } 6 notification_email_from loadbalancer01@protecting.the.innocent.local 7 smtp_server 10.10.9.8 8 smtp_connect_timeout 30 9 router_id loadbalancer01 10} 11 12vrrp_script chk_haproxy { 13 script "killall -0 haproxy" 14 interval 2 15 weight 2 16} 17 18vrrp_instance loadbalancer { 19 state MASTER 20 interface eth0 21 virtual_router_id 205 22 priority 101 23 smtp_alert 24 advert_int 1 25 authentication { 26 auth_type PASS 27 auth_pass totally_secret 28 } 29 virtual_ipaddress { 30 192.168.205.7 31 } 32 track_script { 33 chk_haproxy 34 } 35}
The same configuration on our second load balancer loadbalancer02 looks exactly the same with notification email and router id changed accordingly as well as a lower priority. This all looked fine to me and it was immediately clear that keepalived.conf was not the one to blame. I needed to figure out a different reason why the two keepalived were continuously flapping the virtual IP address.
Since we were observing this exchange back and forth, I suspected the virtual network to unlawfully disregard its responsibility. I immediately rushed to the terminal and started to monitor the traffic. This might seem easy, but it is far more complex than you think. The figure below, taken from openstack.redhat.com, shows on overview of how Neutron creates virtual networks.
OpenStack Neutron Architecture [https://openstack.redhat.com/Networking_in_too_much_detail]
Since both load balancers were running on different nodes, the VRRP packets traveled from A to J and back to Q. So where to look for the missing packets?
I decided to tail the packets at A, E, and J. Maybe that was where the culprit was hiding. So I started tcpdump on loadbalancer01, the compute node node01, and the network controller control01 looking for missing packets. And there were packets, I tell you. Lots of packets. But I could not see missing packets until I stopped tcpdump:
1$ tcpdump -l host vrrp.mcast.net 2... 345389 packets captured 4699 packets received by filter 5127 packets dropped by kernel
Dropped packets by the kernel? Are these our victims? Unfortunately not.
Not very helpful when you are trying to find where packets get, well, dropped. But there is the handy parameter -B for tcpdump that increases this buffer. I tried again:
1$ tcpdump -l -B 10000 host vrrp.mcast.net
Much better, no more dropped packets by the kernel. And now, I saw something. While the VRRP packets were dumped on my screen, I noticed that sometimes there was a lag of more than one second between VRRP keep alives. This struck me as odd. This should not happen as the interval for VRRP packets has been set to one second in the configurations. I felt that I was on to something, but just needed to dig deeper.
I let tcpdump show me the time differences between succeeding packets and detect differences bigger than one second.
1$ tcpdump -i any -l -B 10000 -ttt host vrrp.mcast.net | grep -v '^00:00:01'
Oh my god. Look at the screenshot above 2 . There is a delay of more than 3,5 seconds. Of course loadbalancer02 assumes his partner went MIA.
But wait, the delay already starts on loadbalancer01? I have been fooled! It is not the virtual network. It had to be the master keepalived host! But why? Why should the virtual machine hold packets back? There must be some evil hiding in this machine and I will find and face it…
Stay tuned for our all new next episode of OpenStack Crime Investigation tomorrow on the codecentric Blog.
Footnotes
1. All personal details like IP, E-Mail addresses etc. have been changed to protect the innocent.↵
2. Yes, you see correctly. I am running screen inside tmux . This allows me to sustain the screen tabs even during logouts.↵
More articles
fromLukas Pustina
Your job at codecentric?
Jobs
Agile Developer und Consultant (w/d/m)
Alle Standorte
More articles in this subject area
Discover exciting further topics and let the codecentric world inspire you.
Gemeinsam bessere Projekte umsetzen.
Wir helfen deinem Unternehmen.
Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.
Hilf uns, noch besser zu werden.
Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.
Blog author
Lukas Pustina
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.