Holding your WLAN accountable with service-level expectations

Wi-Fi may not be the only problem when it comes to connectivity issues. Here’s how to accurately and easily address this problem.

In my last post, “#WirelessSucks: Where do we go from here?” I talked about the need for better insight into the root cause of network problems. All too often, the Wi-Fi infrastructure is blamed for bad network connectivity when, in fact, the wired network (e.g. DNS, DHCP, etc.) and/or the mobile devices may be equally at fault.

I identified four components that are required to accurately and easily address this problem:

Monitoring networks at a service level
Real-time visibility into the state of every wireless user
A cloud infrastructure to store and analyze real-time state information and aggregate it to the highest level of commonality
Machine learning to automate key operational tasks, such as event correlation and packet captures

Let’s go into more detail on the first of these requirements: service-level monitoring and enforcement.

If you asked a wireless administrator what type of experience he is delivering to his users, odds are he will have no idea. That isn’t because he doesn’t know what he is doing, but because traditionally there have been no tools for setting, monitoring and enforcing real-time service levels for mobile users; legacy wireless systems were monitoring Access Point uptime and controller uptime.

In the modern smart device era, this is no longer acceptable. As wireless becomes more prevalent and business critical, it needs to be delivered more like a service. That means IT administrators need to know the state of experience delivered to every device at any given time so that they can determine if service-level expectations (SLEs) are being met for those users.

Useful SLE metrics to track

This can be a daunting task because there are over 100 possible states to track for each device, and hundreds or thousands of devices on a network at any given time. But before I talk about how it can be achieved using the cloud, machine learning and other technologies (in future blogs), let’s first talk about what kinds of SLEs are useful:

Time to connect/Failed to connect: This tracks the number of connections that took longer than the specified threshold to connect to the internet (e.g., 2 seconds) or the number of connection attempts that failed to successfully connect. The time to connect to the internet is calculated as the time from the “first” association from the mobile client to the point where the client is able to successfully move data, accounting for all the state transitions along the way.In addition with the “failed-to-connect” metric, we should be able to identify if/when the full connection process failed, as well as where in the process it failed. In other words, if this SLE metric works only when the connection is successful, a big piece of the SLE equation is missing!
Coverage: This tracks how often a client’s Received Signal Strength Indicator (RSSI) is below a threshold configurable by the IT administrator (e.g., -70 dBm). However, the most important element here is being able to track every user, every minute. That gives a true depiction of every user’s coverage experience.
Capacity: This measures per-user available channel capacity and fires off alerts when the available capacity drops below a specified SLE threshold (e.g., 20 percent of total capacity is available per user).
Access Point Health: This SLE metric tracks if access points are unreachable, are disconnected, or have been rebooted to identify if problems exist with specific devices or in specific sections of a building or network.
Throughput: This SLE metric tracks the amount of time a client’s estimated throughput is below the threshold configured by IT. A client’s estimated throughput is the probabilistic throughput given the client’s current wireless conditions. This considers many effects, such as AP bandwidth, load, interferences events, the type of wireless device (protocol, number of streams), signal strength, and wired bandwidth.

Additionally, with SLE metrics, it is now possible to baseline, measure and compare the impact of changes made to the network. This is especially critical when changes are performed automatically, such as when access points change channels to avoid noise or interference. In the past, you did not know if these Radio Resource Management (RRM) changes helped or hurt the user experience. Now, it is crystal clear.

With SLEs metrics (and dashboards), you can accurately understand the quality of service being offered to your wireless users. If SLE thresholds are violated, you can proactively receive insight into the reasons why and know exactly which mobile devices are affected. In other words, if wireless truly does suck, you will know it before your users do. If it is functioning properly, you will know that, too. That’s smart wireless for the modern smart device era!

In upcoming blog posts, I’ll discuss how to better address the root cause of network problems using the cloud, machine learning and other technologies.

Originally posted at NetworkWorld