Recently I was contacted by a good client of mine because their organization was having issues with their Meraki wireless after installing the Cisco 500X switches connected to Cisco 550X core. When I asked him the details, it turned out that Cisco had already replaced his switches in an unsuccessful attempt to fix the problem and had actually issued him an RMA for all the devices. I had never heard of anything like that before, so I knew this was going to be hard and I thought no pain no gain (I’m sure many of you are thinking otherwise and/or have smart comments about this which I wouldn’t mind reading in the comment section).
Details: This network had multiple VLANs and is a completely virtualized environment with a SAN. The user switches were stacked 500X (4 of them) and connected with two 10 Gig interfaces to the core (stack of two 550X switches). When he removed one of the links from the 500X stack, everything worked great, but as soon as he put back the secondary link, some of his VMs were no longer accessible and his Meraki wireless started dropping packets.
I’ve been configuring these switches for many years so that made no sense and I decided to physically inspect the site to see if I could find any issues. Lo and behold the first item was that they were using aftermarket 10g stacking cables. I knew from previous experience that this doesn’t work and I told him that we could not work on this unless he replaced the cables. After we replaced the cables, some of the symptoms disappeared (Access to the VMs remained stable post plugging back the dual cables to the core), however Meraki access points continued to drop packets.
Whenever things are hard like this, I always like to take a sniffer capture, so we spanned the port going to one of the wireless access points and noticed 30-50 CDP packets per second originating from the Meraki AP, which seemed odd to me. We called their support and they said “you must have an L2 loop somewhere and that’s why we are generating too many CDP packets and have issues with connectivity.” Although it was clear from the capture that the source was Meraki and not a duplicated packet, I placed one of the switches out of the stack, placed my laptop on the same VLAN, and plugged the wireless access point into the same switch. Guess what – same results. We called the same engineer and he said there must be a Layer 2 loop in this configuration. But it was clear to me that a single switch + single laptop + single Meraki with no other cables could not possibly do an L2 loop, so I asked to speak to another engineer. The second engineer was not any better, however, they both agreed with me that we should simply kill CDP on the Meraki, which solved the problem completely.
I wanted to share this with everyone because although on the surface, it seems like a simple issue, there was a lot of time and effort spent to get to the bottom of this before it was brought to our attention. The customer actually went through 15+ change management and outage notifications and more than two months of replacing switches and testing the stacking and cross switch links before contacting me.
So the bottom line is:
- Buy Cisco only cascade cables (they are cheap)
- If the design is correct then isolate the problem as much as possible
- Spanning tree was not of interest to me because I never saw duplicated message or TTL to infinity
- Sometimes, the engineer on the other side doesn’t have a strong grasp of the subject matter and now you are better off telling them what to do on the equipment vs. opening it up to them to try things out
President and Enterprise Architect