Wireless network loop, point-to-point down

Hey all, Today one of my wireless networks pulled me a new one. I had basically created a loop in the network, but it took a while to figure this one out. I had noticed in the virtual SmartZone that the network of one of our APs had started to mesh differently from our design. So much so, that even management of some remote AP’s became a bit sluggish. The network is set up with several Ruckus P300 point-to-point AP’s to cover a large area and a couple of meshing AP’s in between. Below is a basic diagram where the red lines represent the point-to-point links. Smartmesh is used to have the access points meshing with each other automatically.

Meshing topology with smartmesh

Note: this project’s network design was constrained by a lack of existing network infrastructure, geographical challenges as well as costs, thus we ended up having to setup a network where the backbone is based on two point-to-point uplinks. The APs use smart meshing to connect to each other. The image above represents the approximate meshing opportunities. From left to right the APs are supposed to connect to each other based on numbering.

Note 2: This is by no means a best practice design when taking redundancy into account. However, costs and reliability have been carefully weighted and it was decided that due to the purpose of this public WiFi network, it does not require a high level of redundancy.

When I noticed the change in meshing, I started looking into the site survey on the P300s on both ends of the link. This feature produces a list with available networks. From that list, I concluded that both P300 were still facing the same direction as we installed them. I could see all of our network and associated MAC addresses from our APs. Nevertheless, since I was on-site anyway, I decided to check the physical setup, because the first autumn storm had passed recently and the network had only been setup a few weeks prior. It seemed that the point-to-point were perfectly aligned, but in their respective control panels, they were still not talking to eachother. So far so good.

Next, I turned my attention to the configuration, even though I didn’t spot anything wrong the first time around. I noticed that the Root-bridge P300 had a different channel than the non-root-bridge P300. This was because of an auto-channel and DRS setting. The non-root bridge was still on the old channel and was not aware of an updated channel because the two P300s were not talking to eachother. However, changing the channel on one side to match the channel on the other side did not fix the problem either.

Logs save the day

So now I was stuck. I decided to take a another, closer look at the logs on both ends. The root-bridge P300’s log mainly contained this kind of logging:

Sep 20 09:35:22 PP03 daemon.err channel-wifi0: on channel 60 (expected 157)
Sep 20 09:35:44 PP03 daemon.notice meshd[729]: Scan returned 0 entries.
Sep 20 09:36:15 PP03 daemon.notice meshd[729]: Scan returned 0 entries.
Sep 20 09:36:19 PP03 user.warn kernel: findchannel:925 Found matching channel for chan(36) chanflag 0x10100 flags(0X10100)
Sep 20 09:36:22 PP03 daemon.err channel-wifi0: on channel 36 (expected 60)
Sep 20 09:36:46 PP03 daemon.notice meshd[729]: Scan returned 0 entries.
Sep 20 09:37:17 PP03 daemon.notice meshd[729]: Scan returned 0 entries.
Sep 20 09:37:20 PP03 user.warn kernel: findchannel:925 Found matching channel for chan(149) chanflag 0x10100 flags(0X10100)
Sep 20 09:37:22 PP03 daemon.err channel-wifi0: on channel 149 (expected 36)
Sep 20 09:37:48 PP03 daemon.notice meshd[729]: Scan returned 0 entries.
Sep 20 09:38:20 PP03 daemon.notice meshd[729]: Scan returned 0 entries.

So nothing exciting here. The other P300 had somewhat more informative logging:

Sep 20 10:45:41 PP04 daemon.notice meshd[730]: Err 16 Failed to start scan
Sep 20 10:45:43 PP04 daemon.notice meshd[730]: wiredm_pick_wired_vs_wireless: Disqualified proposed_uplink b/c gw_detected=1
Sep 20 10:45:43 PP04 daemon.notice meshd[730]: uplink(wired=0)=null(0.0) best_wired=null(0.0) proposed=null( ,0.0,0.0)
Sep 20 10:45:43 PP04 daemon.notice meshd[730]: Advertise IF_code=- depth=1 downlinks=0 reason=
Sep 20 10:45:44 PP04 daemon.notice meshd[730]: Advertise IF_code=s depth=1 downlinks=0 reason=
Sep 20 10:45:49 PP04 daemon.notice meshd[730]: Err 16 Failed to start scan
Sep 20 10:45:50 PP04 daemon.notice meshd[730]: wiredm_pick_wired_vs_wireless: Disqualified proposed_uplink b/c gw_detected=1
Sep 20 10:45:50 PP04 daemon.notice meshd[730]: uplink(wired=0)=null(0.0) best_wired=null(0.0) proposed=null( ,0.0,0.0)
Sep 20 10:45:50 PP04 daemon.notice meshd[730]: Advertise IF_code=- depth=1 downlinks=0 reason=
Sep 20 10:45:51 PP04 daemon.notice meshd[730]: Advertise IF_code=s depth=1 downlinks=0 reason=

This log shows that the P300 had detected both a wireless and wired uplink, and chose to use the wired uplink. This is the reason why the wireless link was down, because the PP04 (see figure) found a gateway over the wire and decided the wired option was the better choice. However, the P300 doesn’t measure any bandwidth on that wire and was unaware of the consequences for the network topology. The wire was connected to other access points that meshed all the way back to our wired infrastructure. Instead, it should’ve used the point-to-point link, which was now down. This resulted in a mesh depth of 4 in the network, so the wired connection wasn’t faster at all!

Solution

In the P300 config, there is no setting to override this default behaviour. There is another way to solve this however. In the vSZ you’re able to change the meshing strategy of the APs. Make sure you have written down all AP’s mac addresses before proceeding. To find the mac addresses, go to Access Points in the main menu.

To change the meshing strategy, go in the the main menu to Access Points, go to the access point you want to limit in the meshing ability and click Configure. Scroll down all the way to Mesh Options and unfold. Under plink Selection select Manual (Only selected AP’s can be used for uplink) Under Uplink, select the AP’s with which this AP is allowed to mesh.

Manual Meshing

By doing the same thing on the other AP’s, you’re able to prevent a loop in the network in which the Point-to-point auto-disables.

This is how the network looks like after the change in the vSZ. Note that this does take away some of the meshing flexibility, but it’s better than the alternative.

Altered meshing topology


Edit - 21 sept 2017:

I’ve been thinking about alternative solution to this, but I haven’t found one yet that would work. Setting routes on the P300 wouldn’t solve it either since the next-hop is not definable as an interface, but only through the gateway. Until this is updated in the firmware, I don’t think there is another way but disabling Smartmesh.

rkscli: set route
Commands starting with 'set route' :
set route6 : set route6 {wan|video|mgmt|l2tp} {options}
-> add <target> <prefixlen> <gateway>
-> del <target> <prefixlen> <gateway>
-------------------------------------------------------------
** Target is host IP or prefix IP
** Ex. Target: 3000:0:0:88::2 -> prefixlen: 128
** Ex. Target: 4000:0:0:88::  -> prefixlen: 64
-- Set static ipv6 route(s)/default route
set route  : set route {wan|video|mgmt|l2tp} {options}
-> default, assign the CPE default route through this network
-> add <target> <netmask> <gateway>
-> del <target> <netmask> <gateway>
-------------------------------------------------------------
** Target is host IP or subnet IP
** Ex. Target: 192.168.5.15 -> Netmask: 255.255.255.255
** Ex. Target: 192.168.5.0  -> Netmask: 255.255.255.0
-- Set static route(s)/default route