By Dr. Stefan Vallin, Director of Product Strategy, Netrounds
A cat is a dog is a mouse is a monkey. As we all know, one species of animals does not define all animals; the same is true for monitoring systems. Network monitoring is a broad term, and there is no accepted taxonomy of the different kinds of monitoring systems. For most people, a monitoring system is a central piece of software that collects data from devices (whatever they publish) and in turn does something with that data. That specific type of monitoring system does not define all monitoring systems, just as not all animals are cats as our analogy mentioned earlier.
In this monitoring cheat sheet blog, we will describe and compare the four main categories of monitoring systems that exist:
- Classical monitoring systems: what most people think of
- Telemetry and analytics-based systems: what most people hope will come to the rescue
- Passive probes: what some people have heard of
- Active test and monitoring systems: the missing piece for most people
This category catches what most people think of when they hear the term “monitoring system”. The system is based on centrally installed software that connects to devices over the management plane. Various protocols like SNMP or syslog, or even retrieval of logfiles, are used to get the management data from the devices. The focus of the monitoring software may differ: fault (alarm) management systems try to deduce if there is a fault somewhere that needs to be fixed, while performance monitoring systems compute long-term statistics and perform trending to spot utilization problems, like an overloaded link.
These systems are 100% dependent on the quality of the data that the devices present on the management plane. The devices are in most cases polled by the systems, and the time resolution of the data is in the order of minutes (typically in 5 or 15 minute intervals). Alarms and threshold crossings are commonly reported as they occur, for instance using SNMP notifications.
The data itself is limited to the context of a device and is for the majority mostly device-centric. Some of the data can be related to the link as seen from the device.
As the central software is heavily centered on the device assurance aspects, inferring end-to-end service health is extremely difficult. Attempts in this direction have been made by adding inventory/topology knowledge at the top and seeking mappings from device-centered data to service health. However, these attempts have had limited success. The same is true for attempts on alarm correlation.
Classical assurance systems need to be integrated with each management plane: MIBs, syslog payloads, etc. Therefore, these systems are heavily dependent on adaptors for each device type.
What can you measure? As stated above, the data is typically focused on, and limited to, the status of a device and its interfaces.
The assurance industry is now shifting its terminology and to some degree its technology compared to the scenario described in “Classical Monitoring”. Increasingly, devices publish telemetry data over telemetry protocols. In basic terms this means that the devices push or stream the management data, rather than the management system polling for it. Furthermore, the management systems can more precisely specify which data and events they are interested in.
Such an arrangement enables finer granularity: the data points can be seconds rather than minutes apart. On the other hand, this means that the management system must handle much more data and requires a shift to big data technologies for storing and processing. Note carefully that this shift has purely to do with storage and processing technology; it does not affect data quality. The payload of the telemetry stream is the same as for classical assurance systems, it is simply the transport that differs.
What are called FM and PM systems in classical assurance systems are now called analytical systems. Basically, this is a new term for the same thing.
Telemetry systems need to be integrated with each telemetry interface, protocol and payload, therefore there is still a need for adaptors. Telemetry-based systems do not add any underlying principle creating a more service-aware system.
Most AI/ML-based systems fall into this category as well, with machine learning algorithms (mathematics) being used to analyze the streaming device-oriented data. Machine learning algorithms can be useful to categorize and filter resource-oriented data.
In contrast to the above tools, passive probes sit on the data plane: they listen to the actual traffic and capture the packets. They are not dependent on what is published in the management plane. Passive probes need to be located centrally on the network, or in the data center, to study the data center traffic. The probes can calculate statistics on the distribution of protocols, the number of failed logins, etc. A serious challenge for passive probes going forward is the steadily increasing use of encryption, which breaks the capabilities of passive probing.
Passive probes are useful as after-the-fact tools. They are helpful in elucidating why there was a problem in the past, and sometimes why a specific user had issues. However, by definition they cannot be proactive since they depend on the current user traffic. They also see the network from a central point, not from the point of service usage.
Unlike classical or telemetry-based solutions, passive probes do not need any specific device integration since they connect on the data plane.
With passive probes the data is not focused on individual devices but rather the actual traffic that flows from users or applications. You can get an understanding of traffic patterns, load and statistics from the data-plane as monitored from the central location.
Active test and monitoring uses traffic generators placed at strategic locations in the network. These generate synthetic traffic on all layers: UDP and TCP sessions, VoIP calls, DNS requests, etc. In this way the system will calculate service KPIs directly. Since the traffic is synthetic, there is no issue with encryption or confidentiality of real user data. The traffic generators can also interact with standardized reflection in existing devices, like TWAMP. Active monitoring systems can attain a high degree of precision with hardware timestamping and real throughput tests at line speed.
A challenge for active test systems is the management of the traffic generators; in old-fashioned systems these where hardware-based and expensive, while in modern systems they are pure software with only a small footprint, and the management system orchestrates the traffic generators.
Since an active assurance system sends traffic continuously, it is proactive: for example, it can tell beforehand that VoIP calls will be bad and trigger proactive remedying actions. The constant traffic is also stored in order to enable reactive/historical analysis.
It is very important to understand that each traffic generator acts as a client on the network: it gets an IP address and sends traffic like an end user, as instructed by the management system.
Active systems do not depend on inventory or topology systems to calculate service KPIs. This is because its packets are routed by the network like any other packets, and the management system does not need to map individual device KPIs to a service topology. There is also no need for adaptors, since the traffic generators sit on the data plane.
And Which Pet Do I Need?
You need all three of the four (as the cats are basically the same thing)…
As the analogy describes above, the cats are the same species with very little differences as classical and telemetry-based solutions are essentially the same except for some underlying technology and terminology. You need these to monitor your infrastructure. Passive probes, the sheatfish, will help you analyze what is going in the network and conduct some post-mortem analysis. But in order to focus on what matters for customers – “Is my service working?” – you need active assurance, the bumble bee. This is also the only way to capture problems before customers and users are impacted.
We see how most organizations are making unbalanced investments in the assurance space, with excessive research efforts being directed towards classical and telemetry-based solutions. In order to be proactive and focus on the services that relate to users and customer experiences, it is imperative to invest more in active systems.