The Danger in “Just Data”

In recent years, data analytics has become essential for many markets including industrial manufacturing.

When allied with domain knowledge, analytics can be key in finding the sources of uptime losses and margin leakage. However, results can prove sensitive to the context of the data, and sometimes data analysis can produce faulty outcomes.

Why you need guiderails

I would like to be able to tell you, (as a CTO of a startup machine learning company once told me), “just give me the data and I will sort out the problems.” Unfortunately, it does not work like that. Data analysis techniques including machine learning are portable across industries, domain knowledge is not – and you need both to succeed.

An analytical solution must correctly separate causation from simple correlation and alert only on true impending issues. But all data analysis, including machine learning, is NOT a silver bullet. Only with ‘guiderails’ can analysis techniques find correct answers. Otherwise, silly correlations emerge such as a famous one proclaiming that increased consumption of margarine leads to divorce in the state of Maine. The guiderails come from domain knowledge, translated into contextual data limits that establish reasonable expectations of behaviour and exclude the meaningless correlations that machine learning can find when working in isolation.

Machine learning will find all manner of data correlations where some are often meaningless. Understanding causation requires knowledge and experience. What time, skills and experience will you need to attempt a solution, how long will it take, and will it scale? In a sense, machine learning can only go so far. Using “clustering” techniques in unsupervised learning algorithms machine learning can detect and learn similar patterns. Indeed, in the oil and gas sector, clustering can learn to distinguish normal operational behaviour from the signals coming from sensors on and around machines. Then any deviations from normal, called anomalies, are useful to highlight operational issues with a piece of equipment.

Another machine learning technique called supervised learning requires a human to declare an event as a time and date when something happened. Machine learning has no understanding of what happened only the date and time. It requires domain knowledge and understanding of data context to attach meaning to the event. But once an event is declared, the machine learning learns the signature of the precise patterns that led to the event. For example, in heavy industry an event could be a machine failure due to an exact cause such as a bearing failure. With its learned knowledge of the precise degradation and failure pattern the AI then tests new incoming patterns to discover recurrences well before the failure occurs. Such early notification allows action to avoid the degradation completely or provides time to arrange a repair before major damage occurs. The results are much lower maintenance costs and more uptime producing valuable products.

At a plant site, expert staff understand the relationships between machine behaviour and subsequent degradation mechanisms. The staff provide such insight to direct machine learning to find the proper causation patterns. In addition, we are discovering that our complex first principle and empirical models can forecast the likely ‘neighbourhood’ of specific results and consequently can also provide guiderails for machine learning to discover exact patterns of degradation. All in all, that data context is crucial to correctly label events, select variables and direct the data cleanup. Effective solutions always require the marriage of what you know about processes emitting the data combined with expertise in the analytical techniques. Thus, the guiderails must be tough and robust.

Working it through in practice

So how does all this work in practice? Take a two-phased approach. First, do the engineering. Learn about the process producing the data, correctly label the important events and perhaps calculate some imperative events such as known physical limitations. Use this information as guide-rails to cleanse data and subsequent event patterns with an understanding about operating modes. Then, when the engineering effort is done, switch into data scientist mode.

Once there, you’ve supplied the data context: now the algorithms don’t care about your particular problem domain. In the analytical depths the data, algorithms and patterns do not know from whence they came: it’s just data.  Scales, engineering units and data sources can be diverse and do not matter. In this context, we do not strictly need the rigour of engineering models and the implied complex differential equations.

In summary, the data input guide-rails do matter. It always takes carefully “framed” data sets to secure precise outcomes. Understanding frames data with context. So, learn the pertinent process details for each solution and then transition from engineering to data science using the guide-rails.

About the Author

Mike Brooks, Senior Director, APM Consulting, AspenTech. Executive Leadership in industrial software start-ups, Invests/incubates ventures for technology transfer. Led initiative for IT based on work process. Raised $15MM+ for IT startup, uncovering $500MM++ market… acquired by multi-national. Led acquisition strategy, technical/business analysis. Developed start-up businesses, raised capital. Rescued failing software division, created a real-time data warehouse, sold.

Featured image: ©BAIVECTOR