Print

WX5540H A large number of APs offline from time to time troubleshooting problems

2020-03-18 Published

Network Topology

A site uses ouA site uses our wireless controllers WX5540H and WA5320 and other types of APs for on-site network deployment to achieve on-site wireless network coverage. The most common AP layer 3 registration method is used.  

Problem Description

On-site deployment was successfully carried out for wireless network deployment. However, after the deployment was completed, some AP wireless signals were found to be unstable during the use. Some APs were in the I (IDLE) state after checking the AP connection status, but the R / M status was displayed after a while.

  ===============display wlan ap all=============== 

Total number of APs: 1108

Total number of connected APs: 1030

Total number of connected manual APs: 1030

Total number of connected auto APs: 0

Total number of connected common APs: 1030

Total number of connected WTUs: 0

Total number of inside APs: 0

Maximum supported APs: 3072

Remaining APs: 2042

Total AP licenses: 1200

Remaining AP licenses: 170

 

                                 AP information

 State : I = Idle,      J  = Join,       JA = JoinAck,    IL = ImageLoad

         C = Config,    DC = DataCheck,  R  = Run,   M = Master,  B = Backup

 

AP name                    APID  State Model           Serial ID

st1f-1                       287   R/M   WA5320          219801A0YD8186E007GN

tsg1f-1                      296   I        WA5320          219801A0YD8186E009CP

tsg1f-2                      297   R/M   WA5320        

  ……….

Process Analysis

1. First checked the AC device logs and found that a lot of the following information exists in the logs:  

% Aug 31 17: 51: 03: 242 2018 YZYJKQ-WLAN-AC CWS / 4 / CWS_AP_DOWN: CAPWAP tunnel to AP 4ssl4f-404 went down. Reason: Failed to retransmit message. 

// Failed to retransmit message: The AC delivers key AP information that times out and does not respond (generally the configuration is delivered), and the AC actively disconnects 

% Aug 31 17: 49: 46: 891 2018 YZYJKQ-WLAN-AC CWS / 4 / CWS_AP_DOWN: CAPWAP tunnel to AP 2ssl1f-110 went down. Reason: Neighbor dead timer expired. 

 // Neighbor dead timer expired: control tunnel keepalive timer expires, the AC actively disconnect  

% Aug 31 17: 51: 03: 245 2018 YZYJKQ-WLAN-AC APMGR / 6 / APMGR_AP_OFFLINE: AP 4ssl4f-404 went offline. State changed to Idle.          // ap status becomes IDLE According to the log information, we can find that the keepalive timeout of the packets between the AC and the AP caused the AC to actively disconnect from the AP. First, it is suspected that the AP's link or power supply is unstable, causing keepalive packets to be discarded. 

2. According to the analysis in the previous step, we collected the diagnostic information of the POE switch and found that the log of the POE switch contains a large amount of information about the interface's UP and DOWN changes:



According to the log information, we generally think that the frequent UP and DOWN changes of the switch interface may be due to a problem with the physical interface. Therefore, we checked the physical interface pins at both ends, replaced the physical cable, and replaced the POE switch. , Replaced other APs of the same model to test separately, and found that no matter how the test is performed, the interface will frequently be UP, DOWN, and the AP frequently goes offline.

3. After completing the above tests and investigations, the problems of physical lines and APs can be basically eliminated. At this time, we refocus our attention on the AC, and see why there are so many APs going online and offline frequently. We collected the diagnostic information of the AP, and found that there was a system restart record in the AP's log, and an interface UP event occurred on the interface.

%May 2 16:07:08:659 2019 4ssl3f-301 SYSLOG/6/SYSLOG_RESTART: System restarted -- H3C Comware Software. 

%May 2 16:07:49:832 2019 4ssl3f-301 IFNET/3/PHY_UPDOWN: Physical state on the interface Ethernet1/0/1 changed to up.

Since the system restart caused the interface to go up and down, we checked the system version of the ap. It was found that the model of the test AP was WA2610H. The current system version is: Version 7.1.064, alpha 2104sp21.        

 Immediately afterwards, we checked the WC5540H wireless controller software version. The wireless controller software versions were Version 7.1.064, Release 5208P03. We check the version information of the default wireless terminal in the software version manual of the corresponding wireless controller, and find that the WA2610H adaptation version number is: CMW710-R2208P03. 

After comparison, it was found that the ap version that is frequently online and offline on site is not the adapted version.         

The follow-up check of the wireless controller configuration revealed that the field engineer incorrectly configured the global wireless AP software upgrade function to be disabled on the wireless controller: 

 # wlan global-configuration 

   firmware-upgrade disable // Disable the AP version upgrade function

4. Finally, after the on-site engineer enabled the global wireless AP software upgrade function, the AP was online and stable.

Solution

Because the global wireless AP software upgrade function was configured to be turned off by mistake, the on-site AP and AC software versions did not adapt, which led to a strange phenomenon that a large number of APs were registered and went offline for a period of time and then went online again. After the function is enabled, the ap online status is stable. 

 # wlan global-configuration   

firmware-upgrade enable // Enable AP software upgrade function #