By: Mats Nordlund, Co-founder & CEO, Netrounds
In telecom, like in most other industries, there is a lot of hype around Big Data, Artificial Intelligence (AI) and Machine Learning. This blog shares a couple of key success factors you should consider whenever you are going to apply these technologies to fix your broken service assurance solutions. Yes, broken assurance solutions. Keep on reading and you will see what is broken.
Instead of putting your trust in ever-growing big data lakes of dirty, noisy data, the key take away of this blog is that you need access to high-quality and relevant data in order to succeed with your AI efforts within the service assurance domain.
How service assurance (does not) work today
Service assurance is a very broad topic, with a wide definition. Most often, service assurance solutions are overcomplicated as a consequence of being patched up for the last 20 years. However, the main task for service assurance is very simple: To ensure that the paying customer, or corporate employee, is happy.
Still, this is a common situation:
Here, we have a frustrated user experiencing a sluggish network, being unable to use the services he depends on for his everyday life.
What makes this situation even more frustrating is that at the same time, the customer support and operations teams show a happy face with a friendly smile and report that there are no problems — “everything works just fine!”. Is this situation familiar to you?
Clearly, service assurance is broken. But why do we have this situation? Let us take a look at what customer support sees, and what the operations team sees:
The first problem is that there are way too many alarms. Thousands, or tens of thousands of alarms. Most of them totally irrelevant for customers. But the main problem is that there is no specific operations alarm pinpointing our frustrated customer.
To make the situation even worse, operational efficiency is severely deteriorated as the operational teams are often misguided by items on the alarm list that do not have any affect on customers at all. This unnecessary work takes focus away from making customers happy.
As shown in the example above, the operations team spends time on a crashed router, but there are no customer having an issue due to this. In a real scenario there would normally be a redundant IP path for traffic connected through the router, or even a secondary router taking over. Also, alarm levels are many times assigned by the devices themselves, without overall coordination. This means that an alarm can show up with critical severity just because it was once hard coded in the device.
At the core of the problem is the lack of visibility into real customer service quality – resulting in the operations team not working on the right issues. The underlying reason is that the telecom industry, for the last 30 years, has been building service assurance solutions that primarily focus on the health of the underlying device and infrastructure.
Big Data, Artificial Intelligence and Machine Learning
With the amount of operational alarms, there is a growing belief in the industry that big data, artificial intelligence and machine learning will come to the rescue and solve the above. Therefore, it is important to demystify these new technologies to set the right expectations.
Generic Artificial Intelligence, shown to the left in the diagram above, is what the literature refers to as when a machine can handle unexpected and arbitrary situations and to think and function as a human. In the telecom domain, this can be seen as having an AI mystery box taking all device and infrastructure data – being referred to as telemetry – and using AI to answer any and all operational and business support questions. This is still pure fiction.
It is interesting to notice that telemetry is marketed as a new powerful data source with high quality, but it is in fact just the same content earlier accessible through syslog and SNMP. Telemetry just uses another subscription and transport method.
In order for AI to really work, it is important to target only a few specific questions to be answered. The second important input is relevant and high-quality data. With these two inputs, combined with an appropriate mathematical function, it is possible to develop an AI solution targeted towards a narrow and specific use case. This is what the literature refers to as Narrow or Specific AI, and is shown to the right in the diagram above.
Specific AI can be brilliant under the right circumstances: Using relevant high-quality data as input, combined with one or a few very specific questions. We see this in our everyday lives when Spotify and Netflix recommend songs and movies that we probably would like.
Same old boring mathematical methods
There is a belief that the “AI box” is something magic, when in fact it is pure mathematical functions used to train the machine learning models. The math is well-defined and well-known from across four disciplines: probability, statistics, calculus and linear algebra, and they have all been around since the 1950’s. However, recent advancements in CPU/GPU processing power and the capability to manage big data training sets have made it possible to train these machine learning algorithms to a completely new level, using many more layers of neurons in the underlying neural networks.
When you have the right data and a few specific questions to answer, first you need the data scientist, then you need to pick the right algorithm for the job and finally you need to tune it to give the answers.
An example of Specific AI: Predictive Analysis
Netrounds teamed up with data scientists at Elastisys to answer one specific question: “Can latency issues above a specific threshold be predicted one hour before they happen?”.
The red line shows the real packet jitter and the blue the predicted value, based on input from active measurements from layer 2 to layer 7 in a service provider’s network. The model in this example has been carefully developed by Elastisys data scientists combined with Netrounds domain expertise. The model uses linear regression, taking seasonality into consideration, with runtime retraining and retuning to constantly reduce model errors.
This is a good example where Specific AI helps operations teams help frustrated customers.
When and why AI fails
There are many examples where AI fails. A typical example is when there are too many general questions, combined with input data that has little or no correlation with the questions. This is deemed to fail, as there are no mathematical functions in the world that can save this situation.
An example of a bad setup is one in which too many generic questions are used as input and combined with input data that has little to no correlation with the questions, such as illustrated below.
Using more noisy data is not the solution
There is a big misunderstanding in our industry that the broken service assurance solutions can be fixed by simply throwing in more data from devices into what is commonly called big data lakes.
The solution is not to use more noisy data, but instead to use the right data. This is further confirmed by a large survey of 7000 data scientists, carried out by Kaggle last year.
In this survey, the respondents were asked about reasons why their AI/ML efforts failed. As seen, in nearly half of the cases, the reason why AI/ML efforts failed was dirty data. Dirty data means unreliable, irrelevant or unusable data. Also, in one third of the cases the reason was either unavailable input data or the lack of a clear question to answer. All three of these areas are fundamental and critical for Specific AI solutions to work. Note well that respondents didn’t lack the proper mathematical functions to train and use, or shortage of processing power.
Device-related AI can be useful – but will not help the frustrated customer
Existing systems for fault and performance management can provide a useful source of input data to AI and machine learning that targets questions very specified to device-related questions.
A good example could be “in how many weeks would I need to increase capacity on this specific link?”, or “I have a network outage, which similar situations has occurred in the past?”. This is useful for planning and debugging. However, it does not help the frustrated customer.
It is important not to fall to temptation and try asking customer-specific questions using the device-related input data.
New input data from active testing and monitoring
In order to answer customer-specific questions, there is a need for new input data coming from active testing and monitoring. Measuring from end-user locations, across domains and across all protocol layers from L1 to L7, KPIs provide metrics that matter most for the customer, such as one-way latency, TCP throughput, voice and video quality.
As active test traffic is sent on the same data path as the customer’s traffic, it also gives visibility about the service quality across unmanaged off-net parts of the connection. Traditional fault and performance management systems are completely blind here. It is likely that you can find and tune a mathematical machine learning function from the input to the output data set.
How to get from here?
In order to fix your broken service assurance solutions, start by getting your customer care and operations teams together and identify which are the five most important questions to be able to answer. Customer-oriented questions will likely be central.
Then, engage with domain experts and data scientists to look at your existing service assurance solutions and see if they can provide the relevant and high-quality data required to answer those questions. If they cannot, it is time to reconsider where you spend your license money for service assurance. Investing in (yet another) big data platform to store dirty, noisy and useless device-level data is not the solution.
To learn more, download and read our excellent white paper on “Big Data or Small Data.”