Technical support organizations supporting SaaS, on-premises or appliance-based products face a barrage of demands that is increasingly more difficult to manage.
On one hand, it is getting harder than ever to recruit and retain teams with experienced and skilled professionals willing to work under intense time and workload pressures. On the other hand, customer expectations grow, the impact of downtime continues to escalate and customer churn is a greater reality as customers will readily switch to competing products should customer experience disappoint.
Support teams in SaaS are often focused on observability dashboards, monitoring health and performance metrics and looking for any indications of a problem. The easiest trouble tickets—such as the ones stemming from a customer mistake or lack of knowledge – are quickly triaged through the use of common problem responses or checklists. Recent years have also seen adoption of NLP tools that can match ticket notes with known issues and resolutions in a knowledgebase.
More complex technical problems can also be resolved faster by looking for matches against signatures of known problems. Although this approach helps productivity, it requires a big investment in manual creation and updating of rules and signatures as new problem types are discovered and resolved. The complexity and time it takes to build, verify and maintain the signatures so that they work reliably for every new software version is immense. There is almost always a growing backlog of signatures.
At the same time, there is also a class of problems for which signatures can’t or don’t exist. In modern software environments this could be because they are newly introduced problems, infrequent problems or because of the nature and complexity of the problems and the way they exhibit themselves. Although this class of problems might only be a fraction of overall tickets, they can dominate engineering hours and MTTR metrics. Finding the root cause and resolving these is a tedious manual process. Often, it is not just a matter of matching a customer symptom with evidence in the metrics or logs. Instead, it might require finding the details that are buried among tens of thousands or even millions of events, across a plethora of data streams.
Troubleshooting such problems requires skill, experience and sometimes educated guessing to narrow the scope and ultimately resolve the problem. A typical approach involves searching logs for errors, looking for new or unexpected events and finding correlations between these. All of this is complicated by the sheer volume and complexity of data that needs to be looked at. The pressure of having to remedy a problem quickly, countered by the typical shortage of technical staffing, adds to the situation. All in all, finding the root cause of many problems is extremely slow and labor intensive.
Fortunately, the benefits of using machine learning for automated root cause analysis is becoming more widely known, especially for solving complex issues, and there are a growing number of accounts of software organizations more quickly solving customer product issues this way while minimizing down time or disruption. A recently published study shows an impressive accuracy rate of 95.8% for using machine learning to find the root cause of customer problems. With automated root cause analysis, customers stay satisfied due to fast resolution, and at the same time, demands on ops, support and engineering teams decrease. Whether automatically creating signatures for matching a log to a known problem or using ML to hunt among complex interactions showing up in multiple logs, automated root cause analysis provides much needed efficiency and effectiveness.
Of course, engineering organizations are sometimes reluctant to add yet another tool or system to the mix. The latest thinking and best practice is that automated root cause analysis and existing tools and processes need not be two separate things, with separate consoles and dashboards. There is no reason why the functions cannot be integrated, with machine-learning-driven root cause analysis being something like an overlay capability that works through or alongside existing support tools. Toggling between things no longer has to be an issue, boosting team productivity and reducing barriers to usage.
The other advantage is to make root cause analysis a more natural part of daily procedures and less a last resort panic button. It can even evolve from just being a problem solver to also taking on a proactive or preventative role, helping avert trouble or minimize the magnitude of problems. In this way, root cause analysis tools point out potential problems or indicates how to improve some aspect of the system.
Automation using machine learning, is now primed to become the new line of defense for software and support teams dealing with high escalation workloads. And, rather than seen as an escalation or next stage, automated root cause analysis should ideally be an integral part of the system and procedures of providing a better software experience. By adding automated root cause analysis to the existing systems and incorporating its use procedurally, support organizations can streamline and gain the greatest effectiveness and efficiency.
About the Author
Ajay Singh is CEO at Zebrium. Ajay Singh is a strong advocate for creating products that “just work” to address real-life customer needs. As Zebrium CEO, he is passionate about building a world class team focused on using machine learning to build a new kind of log monitoring platform.
Featured image: ©AdobeStock