One of the great ironies of artificial intelligence (AI) is the technology’s dependence on humans to label or tag data for training purposes. According to AltexSoft:
If there was a data science hall of fame, it would have a section dedicated to labeling. The labelers’ monument could be Atlas holding that large rock symbolizing their arduous, detail-laden responsibilities. ImageNet – an image database – would deserve its own stele. For nine years, its contributors manually annotated more than 14 million images. Just thinking about it makes you tired.
While labeling is not launching a rocket into space, it’s still seriously business. Labeling is an indispensable stage of data preprocessing in supervised learning. Historical data with predefined target attributes (values) is used for this model training style. An algorithm can only find target attributes if a human mapped them.
Data might be the new oil, but it’s basically bulky shale if it isn’t labeled. Companies seeking to mine their data with AI are seeking solutions. Some companies assign the tasks in-house while others recruit temporary employees and some use crowdsourcing platforms. And still other companies are increasingly turning to data labeling specialists, such as Amazon Mechanical Turk, Figure Eight, Hive, Globalme, CloudFactory, Mighty AI, LQA, and DataPure.
Labels for Bulky Shale
These solutions have their challenges. The work is tedious, repetitive, and boring. It’s hard to keep workers focused and motivated. Consequently, quality control can become an issue, as well as turnover. All of these factors increase the cost of data labeling.
Workers mouse around camera images and 3-D lidar-generated maps collected by the car sensors. They draw boxes around cars, walkers, and cyclists. They ID certain pixels as road, not tire, or flesh, not steel. Or they double-check that Scale’s automated system has done all these properly by itself.
… Sometimes, those contractors label the same sorts of data, over and over, for different Scale clients. When those workers see a particularly interesting corner case – a July 4th celebrant whose had too much to drink, or an e-scooter, or the logs tumbling off the back of the truck – they don’t alert other clients about what their technology saw. That means self-driving car companies are spending hours and hours of work, and lots and lots of money, collecting and annotating what might be mostly identical road data.
Other companies are looking for help with labeling speech-related data. Consider what AI specialist Explosion AI wrote in a blog post:
The most popular place to source large volumes of annotated data is Amazon Mechanical Turk, the Amazon Cloud of human labour. You can use their platform to publish survey-style “Human Intelligence Tasks” (HIT), which will be completed by workers from all over the world. While this sounds great in theory, it’s often disastrous in practice. The workers make around $5 an hour on average, with no connection to the task, and interfaces reminiscent of early-2000s-style surveys. Incentives are also completely misaligned, so you have to worry about being cheated by the workers – who have to worry about being cheated by you.
So no wonder your data is bad. Don’t expect great data if you’re boring the shit out of underpaid people. The thing is, none of this is news. Our so-called start-up culture is based on the realisation that in order to achieve the best results, we need an engaged team that’s passionate about their work, a motivating work environment, high incentives and fair pay. We know all of this. Yet, when it comes to the absolute core of the application, the training data, all of this knowledge seems to go straight out of the window.
Human Evolution: A Potential Solution?
These challenges bring us back to the notion that human evolution might provide a potential solution to the AI data labeling problem. Juan Enriquez is the founding director of the Life Sciences Project at Harvard Business School and managing director of Excel Venture Management, a life sciences venture capital firm. He has written several books, including Evolving Ourselves: How Unnatural Selection and Nonrandom Mutation are Shaping Life on Earth. In a TED talk called “Will Our Kids Be a Different Species?” delivered in April 2012, Enriquez postulated that human evolution continues, and we may be evolving in real time. As part of his talk, he pointed to the fact that the amount of data most humans process daily is a multitude more than we processed as little as a generation ago. Enriquez also discussed the plasticity of the brain, which allows it to adapt to such inputs.
As evidence, he shared a statistic on autism incidence per 1,000 children. In the year 2000, this figure was 6.7 per 1,000 children. It dipped to 6.4 in 2002, but then rose to 9.0 in 2006 and 11.4 in 2008. This is a 78% increase in autism in less than a decade, and scientists cannot figure out why this is happening. Enriquez said we do know that potentially, the brain is “reacting in a hyper-reactive and hyper-plastic way, and creating individuals that are hyper-perceptive, hyper-mnemonic, hyper-attentive.” In other words, people who are perfectly suited for AI data labeling.
To that end, at the recent AI Summit New York held in December 2018, CEO Bryan Dai of startup Daivergent was a featured speaker. Daivergent sources people on the autism spectrum as a remote workforce for companies seeking, among other things, AI training data labeling and generation. The company is in the early stages of rollouts, so it’s too early to share measured results. But isn’t it interesting how the unique skill sets and capabilities of highly focused autism spectrum individuals could potentially fit so well in enabling breakthrough data science? Further, will our children and grandchildren evolve to become even better at processing and analyzing the ever growing amount of data we are consuming?