To Ping Or Not To Ping, That Is The Question

Over the last few months, I have been having trouble with the stability of my home internet connection. This became evident to me while live streaming to Twitch as my stream would begin dropping a lot of frames in a really short amount of time. For a time, recycling the power on the cable modem would make a positive difference, but after a few months that stopped working reliably. This post is about my subsequent experience trying to write a scheduled cron job that would let me know whenever my internet connection was having issues.

A few years ago I had a similar issue. That took six months to resolve and it ended up requiring that some cable run underneath the sidewalk in front of my house be replaced. So because of the PTSD associated with that hellish experience (which effectively rendered my connection unusable for hours on end at random times, making working remotely from home quite difficult), I took immediate action when I was presented with a similar issue once again.

Once recycling the power on the modem stopped working, I called Charter / Spectrum. They sent a tech out and he replaced the line run from the street, through my yard, into my crawlspace which goes through the floor into my office where the Cable Modem resides. The tech also replaced my Cable Modem (which incidentally was not only an older model, but also “survived” our house getting struck by lightning in 2019).

However that wasn’t enough. I wanted to be able to monitor this situation so I wouldn’t be caught with my pants down again. I decided to write a simple cron job which would notify me when my internet connection was not working. So how do I did that? Well I decided that pinging some known IP addresses was the way to go. Initially the script was written to simply ping Google’s public DNS server @ 8.8.8.8 one hundred times and only produce output in the event that at least one packet was dropped.

Eventually I updated the script so that it would randomly select an IP from a list of IPs, all of which are IPs of publicly accessible DNS servers. Below is the final version of the script in its entirety:

#!/bin/bash
IPS[0]=9.9.9.9
IPS[1]=208.67.222.222
IPS[2]=208.67.220.220
IPS[3]=1.1.1.1
IPS[4]=1.0.0.1
IPS[5]=205.210.42.205
IPS[6]=64.68.200.200
IPS_SIZE=${#IPS[@]}
IPS_INDEX=$(($RANDOM % $IPS_SIZE))
IP=${IPS[$IPS_INDEX]}

COUNT=100

rm /tmp/ping_test_output.txt > /dev/null 2>&1
ping -c $COUNT -D -O $IP > /tmp/ping_test_output.txt 2>&1
if grep -qi " 0% packet loss" /tmp/ping_test_output.txt; then
 exit 0
else
 echo ALERT: Ping Test to $IP lost at least one packet!
 cat /tmp/ping_test_output.txt | tail -n 2 | head -n 1
 echo -----------------------------------------------------
 cat /tmp/ping_test_output.txt
 echo -----------------------------------------------------
 exit 1
fi

This script ran every half an hour via the magic of cron and in the event any packets were dropped produces an email with the pertinent details. So now I had a great way to track when my internet connection was acting up, right?

Wrong.

As I soon came to realize, even though this script seemed to do its job by sending emails when packets were lost, it didn’t actually mean my internet connection was having a problem at all. Nevertheless since I didn’t realize this I started calling Spectrum each and every time I started getting more than one or two of these emails within a period of a few hours (as I am willing to tolerate maybe one or two a day, because this is the internet after all). At some point it became clear that unlike before, I wasn’t seeing other things fail. My streams were not being negatively effected despite receiving these emails and most importantly Spectrum’s phone support wasn’t able to replicate the packet loss on their end (which they were able to do before).

At this point, I thought maybe that some other piece of network equipment on my side was responsible. My network is largely powered by Ubiquiti equipment and ever since the Coronavirus became an issue, their firmware updates have been less than stellar. I had already rolled back one upgrade on my Wireless Access Point to resolve a separate issue, so I decided to roll back the most recent update on my EdgeRouter X-1 as well.

Initially this seemed to resolve my issues. I went a few weeks and only got an email here and there and all seemed well. Then sometime last week I started getting A LOT of these emails. Every half an hour in fact. Not only that but the packet loss started creeping up from 1% to as high as 3% and 4%. But everything else appeared to be working fine. I was still live streaming without dropping any frames. Internet video services were working fine (exempting of course the occasional glitch from Philo which is just standard operating procedure sadly).

So what the hell was going on? Sadly my entire approach was absolutely idiotic and had never had a real chance of working. However because networking isn’t my primary forte and is probably not yours, you are probably scratching your head just like I was wondering what the flaw was. Remember the point of this script is to let me know that my internet connection is having issues, but so far the output produced by this script didn’t seem to correlate with any actual issues. So I started to do some research. Eventually I stumbled on this conversation which helped to shed some light on the problem. The subject of the thread really says it all:

DNS Rate Limiting ICMP (8.8.8.8 and 8.8.4.4)

Holy shit. Wow. It never even occurred to me that a public DNS server like 8.8.8.8 would put rate limits on things like ICMP. But when you think about it, it makes perfect sense. These servers exist primarily to service DNS clients, not allow half ass network admins like me to “test” the quality of their internet connection. So of course you want to prioritize servicing DNS clients over anything else.

Keep in mind at this point, the script was only pinging 8.8.8.8. This is when I went back and made it randomly pick an IP out of a list of IPs. There are a variety of IPs there including the IPs for Cloudflares, Quad 9 and OpenDNS servers. But of course after making this mod, I was still getting negative results that didn’t correlate with any other symptoms.

Well of course the problem now should be obvious. I’m pinging DNS servers. If Google isn’t prioritizing responding to ICMP requests, then it would make sense that other providers are doing that as well. That realization hit me like a bag of bricks. But it doesn’t end there. As I began to try and figure out what IPs I could actually ping to test my connection, I suddenly realized that this entire idea was an epic fail.

Why is that? Well it’s because when you ping some external IP, that packet is being routed through any number of routers, only a portion of which will be controlled by your ISP. The internet is a collection of connected yet independently operated network devices. Lets check out this traceroute from one of my servers to 8.8.8.8:

traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
 1  _gateway (192.168.128.253)  0.551 ms  0.448 ms  0.352 ms
 2  * * *
 3  dtr01wavsnc-gbe-0-0-1-2.wavs.nc.charter.com (96.34.65.244)  14.287 ms  14.168 ms  14.090 ms
 4  crr01gnvlsc-bue-100.gnvl.sc.charter.com (96.34.67.12)  14.214 ms  14.146 ms  15.439 ms
 5  crr12gnvlsc-tge-0-1-0-1.gnvl.sc.charter.com (96.34.92.62)  15.372 ms bbr01gnvlsc-bue-3.gnvl.sc.charter.com (96.34.2.112)  16.865 ms crr12gnvlsc-tge-0-1-0-1.gnvl.sc.charter.com (96.34.92.62)  16.786 ms
 6  bbr01chcgil-tge-0-1-0-6.chcg.il.charter.com (96.34.0.135)  22.548 ms bbr01spbgsc-bue-4.spbg.sc.charter.com (96.34.2.50)  25.352 ms bbr01chcgil-tge-0-1-0-6.chcg.il.charter.com (96.34.0.135)  20.630 ms
 7  bbr02slidla-tge-0-1-0-4.slid.la.charter.com (96.34.0.133)  29.150 ms  28.740 ms bbr01chcgil-tge-0-3-0-8.chcg.il.charter.com (96.34.0.184)  26.993 ms
 8  prr01snjsca-tge-0-0-0-1.snjs.ca.charter.com (96.34.3.35)  28.525 ms bbr02atlnga-tge-0-2-0-0.atln.ga.charter.com (96.34.3.111)  28.416 ms  28.341 ms
 9  74.125.51.142 (74.125.51.142)  19.665 ms  24.198 ms  22.476 ms
10  108.170.249.161 (108.170.249.161)  22.400 ms *  15.815 ms
11  dns.google (8.8.8.8)  22.684 ms 108.170.225.117 (108.170.225.117)  16.751 ms 209.85.241.153 (209.85.241.153)  16.602 ms

Now that should be enlightening. However for those of you who don’t know how to read that output, let me explain. It basically tells me that my request to 8.8.8.8 goes through ten different devices before reaching its ultimate destination. Looking closely we can see that only points 2 through 8 appear to be controlled by my Charter / Spectrum whereas everything before and after that is outside of their control. Presumably point 2 is the immediate internal gateway accessed by my router (which is 192.168.128.253) and it is not directly pingable.

In any event, its really not fair to hold Charter / Spectrum accountable for things that happen outside of their network, is it? It’s really not. I guess I could start pinging some of those Spectrum IPs directly using the previous script, but what if Spectrum changes something and modifies their network routing somehow? Well then I have to manually update the script with new IPs anytime that happens. I really didn’t want to have to commit to doing that, so I decided that it was time to take a different approach to solving this problem. Below is the current version of the script:

#!/bin/bash
IP=8.8.8.8
MIN=3
MAX=8

rm /tmp/traceroute_test_output.txt > /dev/null 2>&1
traceroute -n -f $MIN -m $MAX $IP > /tmp/traceroute_test_output.txt 2>&1
if grep -qi "*" /tmp/traceroute_test_output.txt; then
 echo ALERT: Traceroute Test to $IP lost at least one packet!
 echo -----------------------------------------------------
 cat /tmp/traceroute_test_output.txt
 echo -----------------------------------------------------
 exit 1
else
 exit 0
fi

So instead of pinging 8.8.8.8 we are now attempting to route a packet to 8.8.8.8 and only testing the 3rd through 8th hops of the transaction. As I noted before, those are all Spectrum devices, so it’s reasonable to hold them accountable for those devices being able to service requests regardless of what 8.8.8.8 ultimately decides to do with my ICMP request.

So far this script has worked far better. I’m still getting two emails each day since I implemented this and each and every time the point of failure has been the 8th hop. From my perspective, my connection has been functioning perfectly so these emails are likely false negatives and may require me to modify the script to stop at the 7th hop rather than the 8th hop. I’m going to let it progress for the next few days and modify the script appropriately then.

You have to be careful when it comes to testing network connectivity because in a lot of cases you may be biting off more than you can chew and not even realize it. Simply pinging an IP requires a lot of independent devices and networks to work in concert with one another and it’s not fair to hold the operator of one network accountable for the actions of operator of a different network. It is sometimes easy to forget how wonderfully complicated all of this is and that’s okay.

Needless to say this is a forgivable sin just so long as you course correct appropriately. Speaking of that, the next time I talk to Spectrum (which should be a few days as they put a monitor on my cable modem that expires on Friday) I plan on apologizing profusely to them. While there initially was an actual issue, it appears that their initial work resolved the problem. The subsequent four times I have called were in error and I sincerely regret wasting their time on the issue.

So what’s the moral of the story? You live, you learn. But don’t forget that last part as it is quite important.