Synthetic data is the renewable source we need to accelerate the AI industry

It takes an astonishing 20 weeks to gather and annotate the 100,000 real-world images necessary to train a visual AI system to see and understand the world as humans

If you started annotating today, your new year’s resolutions will have comfortably been made and broken by the time you finish.

And that’s just for something novel, like training a system to pick out a lost child in a busy shopping mall. It takes even more images to help a delivery robot service safely navigate spaces where children are playing.

The data scientists working on these systems can spend up to 80% of their time gathering, cleaning, and manually annotating real-world images to be digested by AI systems. It’s too long. It doesn’t leave any time for network development or gleaning insights from the data.

What happens if a project needs the system to fire faster? Or there’s a shortage of images? We’ve recently seen in the UK the real-world consequences of what happens when a scarce resource runs low. Everything grinds to a halt.

Thankfully, there is an alternative, ethical, and endlessly renewable material that can be used to train AI: synthetic data derived from computer-generated images and video. Data produced in this way is easily as good, sometimes even better quality, than that which comes from real-world images, and working with it cuts down the labour intensive process of gathering and analysing images from months to hours.

Crucially, switching has no impact on the AIs being trained. To an AI, there is no ‘real’ or ‘synthetic’; there’s only the data we give it. It’s us humans that need to stop seeing synthetic as some ersatz alternative, and start to understand the opportunity in our hands to scale the AI industry exponentially—if engineers are willing to embrace the synthetic data route.

Major players

It doesn’t matter if they’re start-ups, scale-ups, or even a global enterprise company, teams trying to access enough high-quality images to train their new AI system are competing against the ‘Big Four’: Apple, Amazon, Facebook, and Google. Engineers at the search engine giant have access to more than 4 trillion images alone stored in Google Photos.

These major players tend to restrict access to this wealth of potential training data because it secures their competitive advantage to develop new products and monetise their datasets. They’re not totally immune to the issues the rest of the industry faces though; searching through trillions of images to find the relevant ones is non-trivial, and once found, they still need annotating.

Every company has to navigate the challenge of more readily enforced data privacy regulations too—including the EU’s General Data Protection Regulation (GDPR). Just ask Microsoft, who in 2019 deleted its database of 10 million images—the largest publicly available facial recognition data set at the time—due to data privacy concerns.

These factors combine to create that scarcity of real-world visual data I was talking about, with only the very largest tech companies able to compete, driving down the competition and, ultimately, quality of AI systems on the market.

Is synthetic data the AI leveller?

If we want the best AI, the best technology, then we need a competitive landscape made of businesses of all sizes pushing each other forward. Realising that vision requires three things: democratised access to training data, training data that meets privacy regulations, and data that can be annotated faster.

Synthetic data makes these three demands. It gives machine learning engineers the ability to create photo-realistic 3D worlds and extract unlimited data to fuel and train their visual AI models. They can use synthetic data software creation platforms for AI training to generate the 100,000 high-quality images needed over the course of a couple of days, instead of months.

And because the data is computer-generated, there are no privacy concerns, while biases that exist in real-world visual data can be eliminated too. In the virtual world, different ethnicities, age groups, and diversity in terms of colour of clothing or sex are much easier to create. And as data changes over time, it’s easier to reflect this in a virtual environment to avoid data drift impacting an AI model’s performance.

An upgrade on real-world data

Synthetic data is more than just an upgrade on its real-life counterpart. Findings show that synthetic data enhances a machine learning model’s accuracy. Last year, McKinsey revealed that 49% of the highest-performing AI companies are already using synthetic data to train their AI models.

Along with this, hundreds of thousands of corner cases or scenarios (camera location modeling, different lighting, and other variables) which would be hard to create in the real-world, can be quickly and easily created in a 3D virtual environment. Extreme, ‘nightmare’ scenarios—gun crime, for example—can be also simulated risk-free to create the kind of data that’s difficult to come by from real-world sources.

It’s already used in healthcare to train machines to monitor patients recovering from surgery; in security and surveillance systems to detect suspicious objects or unusual patterns of behaviour inside shopping centres or sports arenas; or delivery drones that need to understand the world around them.

We’re not in a place where real-world data can be totally counted out; research suggests the best training results come from data sets with 90% of synthetic data and 10% of real-world data. But we’re getting close; the accuracy of an AI model trained using 80% synthetic data is close to that of one fueled by data taken from the real-world, according to Deloitte.

Gartner recently predicted that 60% of data used for AI and data analytics projects will be synthetic by 2024, and by 2030, synthetic data will have completely overtaken real data in AI models. I started this piece talking about resolutions; with predictions like this starting to appear, finding out more about synthetic data should be at the top of everyone’s list.

About the Author

Steve Harris is CEO of Mindtech with over 30 years’ experience in the technology market sector. Steve has been instrumental in creating several start-up organisations in Europe and brings with him a track record of success in building strategic relationships and strong revenue streams with tier one companies worldwide. Prior to joining Mindtech Steve held senior positions in sales and business development in leading technology companies including Imagination Technologies, Gemstar, Liberate and Sun Microsystems. Steve holds a masters in Microprocessor Engineering from Manchester University.

Synthetic data is the renewable source we need to accelerate the AI industry

more insights

The Importance of Backups in Data Recovery

World Backup Day: Time to take action on data protection

How to take your first steps in AI without falling off a cliff

Contact Us

Join The list