Browse By

Study Shows That Training Deep Learning Models Using Vocal Labels Demonstrates Significant Performance Improvements


The methods of labeling datasets for use in training complex machine learning models have been relatively unchanged in recent years. Traditional methods simply used textual labels for the purpose of validating model predictions. These textual labels, typically in the form of bits containing the value 0 are 1, are extremely compact and precise which rarely leads researchers to deviate. Previous research into alternative methods for labeling data has been nonexistent prior to the discoveries made by the team from the Columbia University. Fortunately, researchers from the University of California recognized this pattern and decided to pursue alternative methods for labeling neural networks. Researchers from the University of California performed a basic experiment to determine any potential disparity between the accuracy of models trained using traditional bit labels versus vocal labels. The results of the experiment demonstrated that audio labels have the potential to significantly improve the performance of neural networks.


Traditional Neural Networks

Neural networks are a collection of algorithms that are designed to identify patterns in data. These algorithms are loosely modeled off of the learning process occurring in the brain. Typically these algorithms are fed input in the form of images and output a prediction based on those images. These algorithms are known as image classification algorithms and are typically used as benchmarks for many advances made recently to the fundamental aspects of neural networks. Neural networks essentially act as a large web of interconnected nodes which gradually perform transformations on input data to eventually converge on an output label. After the neural network makes a prediction, the prediction is then mathematically compared to correct labels generated by manual labeling. These predictions are traditionally represented as binary values. If, for example, a neural network was attempting to predict whether an image contained a hotdog or not, the algorithm would either converge on 0, not hotdog or 1, signifying the presence of a hotdog. The loss, how close the algorithm was to predicting the correct labels, is then computed and used to update the weights and biases at each node of the neural network. The process of updating biases and weights of a neural network is known as backpropagation, in which the model attempts to modify the weights and biases at each applicable node in an attempt to decrease the loss. Neural networks are constantly evolving, and many aspects of the description above significantly vary between implementations. The only part of this model yet to be significantly improved is the format of the predictions; binary values have always been standard and researchers have yet to pursue alternative methods. Until now.


Advancements Made

Rather than provide correct labels in the form of a binary representation of labels, Hod Lipson and Boyuan Chen provided labels in the form of a vocal recitement of the targeted object pictured in the image. Rather than simply converge on a binary digit, the network generates an audio label which is then compared in a similar fashion to a traditional neural network. Interestingly, training the exact same data using these two different labeling systems yielded significantly different results. These two neural networks were trained on an identical dataset containing approximately 50,000 images. After training for an extended period of time, both models resulted in around 92% accuracy. For a second test, the researchers performed an identical experiment but instead of 50,000 images, they used 2,500 images. This experiment yielded significantly different results. After training for approximately 15 hours, the traditional network fell to a paltry 35% accuracy whereas the neural network trained on audio labels achieved a stunning 70% accuracy. 



The results of the experiment conducted by researchers from the University of California prove that alternative label formats can completely revolutionize the performance of traditional neural networks. The discoveries made by the team will ideally promote other researchers to look into finding optimizing label methods and pushing the performance boundaries of traditional neural networks. 


Columbia University School of Engineering and Applied Science. (2021, April 6). Deep learning networks prefer the human voice — just like us. ScienceDaily. Retrieved April 25, 2021 from

A Beginner’s Guide to Neural Networks and Deep Learning. Pathmind. (n.d.).,Neural%20Network%20Definition,labeling%20or%20clustering%20raw%20input.&text=Neural%20networks%20help%20us%20cluster%20and%20classify. 

Leave a Reply

Your email address will not be published. Required fields are marked *