Data Quality vs Data Quantity: What’s More Important for AI?

Business are enjoying a data revolution

The sheer volume of data being collected from Internet of Things devices has skyrocketed in recent years, and sophisticated artificial intelligence and machine learning tools are able to extract valuable information that would otherwise go unnoticed. However, this glut of data has raised a critical question: How important is the quality of the data being collected?

Artificial intelligence and machine learning can provide remarkable insight. However, AI can’t distinguish between good data and bad data on its own, and the algorithms powering AI can only assume the data being analyzed is reliable. Bad data, at best, will produce results that aren’t actionable or insightful. But there’s an even bigger concern: Bad data can lead to results that are misleading. In addition to the time and money wasted analyzing bad data, AI systems can encourage a company to take steps that are even more wasteful.

Martha Bennett at Forrester believes deriving meaningful insights from data is key to staying competitive.

One concern that often arises in statistics is erroneous signals. A small bias in a sensor, for example, can cause AI systems to see an effect that isn’t real. The likelihood of a system picking up on an errant signal rises with the volume of data collected; a tiny bias in a sample is far more likely to be noticed by AI when using the volume of data common with today’s machine learning systems. Even data of reasonably high quality can lead to erroneous results, potentially leading companies down an unproductive path. This is part of the reason why data scientists are in such high demand. Their ability to implement the right algorithms is clearly important, but it also takes human judgment to make sense of the results AI systems produce. Determining whether a signal is a real effect can be a challenging task.

The power of machine learning is largely due to its ability to learn on its own. In order to get started, however, ML systems need to be trained with a set of data, and this data set needs to be of especially high quality, as even small problems can spoil the algorithms from the beginning. ML often works best when it’s left alone; tweaking the results manually can introduce bias and other problems. However, it’s important to carefully note how the ML system was trained and what data set was used. If problems arise later on, being able to examine the original data can be essential.

Relying on AI is important for a growing number of businesses. However, it can be tempting to use AI when it’s not the appropriate choice. In some situations, there simply isn’t enough high-quality data for systems to analyze, yet people often feel tempted to use AI systems anyway. Before launching an AI project, it’s important to examine the data itself and determine if quality results are even possible. AI systems all have their limitations, and none are able to make up for a lack of good data to analyze. Again, human expertise is essential. Data scientists and other statistics experts know how to examine data and find out what type of analysis is appropriate.

In general, more data leads to better results. Eventually, however, there comes a point where no additional data is needed as the data set is already broad enough to get the most out of AI and ML systems. It can be easy to fail to recognize when there’s no need to gather additional data due to the low cost of data storage and processing power. Over time, however, costs can creep up and eventually become less sustainable. This problem is also exacerbated by cloud storage, which makes acquiring storage space only a few clicks away. Before feeding more data into AI and ML systems, organizations should take time to determine all of the associated costs and ask whether doing so is worthwhile. If AI and ML systems are already fully saturated with data, it may make more sense to cut back instead of expand.

Data is driving today’s tech fields, and there’s no sign of this trend slowing down in the near future. However, it’s important to use the right tools when analyzing data to make the most of it, as misusing data can be wasteful or even dangerous. Before feeding more and more data to AI and ML systems, take some time to determine if there are ways to improve overall quality. A bit of data quality improvement can go a long way toward making the most of AI and ML systems.