Why less can be so much more when training a machine learning model

Jayden Kur

Engineer

September 14, 2023

Jayden Kur

Engineer

Training a machine learning (ML) model on poor quality data will inevitably produce poor quality outputs. This is commonly known as garbage in garbage out. Often developers create cutting-edge, best-practice machine learning models with access to all the compute power they can dream of, yet are provided with terrible low quality training data in abundance. They are then questioned on the garbage output. Being supplied a greater quantity of poor quality data is not the solution here and will not guarantee favourable outcomes.

Focus on quality not quantity

There is a misconception that more data will be better for training models, instead of focusing on the quality of the data. I have seen this repeatedly in industry, especially within enterprise level clients such as production, energy and education. All these industries are, to some degree, are guilty of supplying poor quality data to engineers and expecting magic. This was even the case when I worked for a leader in the AI industry that had implemented a ML solution in one of Australia’s largest mining companies.

I have also experienced this in my own education where I was using convolutional neural networks (CNN) for identifying morphological properties of galaxies. I was given images so blurry that I could not tell if it was a picture of a galaxy, or someone had gone out at night and taken photos of a distant streetlamp on a flip phone. I quickly realised if I wanted to produce anything useful, I would need to secure higher quality training data. I reached out to the owner of a large collection of labelled galactic images in America. I secured a much higher quality and resolution data set of labelled galactic images to train my model. I was able to achieve an accuracy of close to 100% using drastically less images for training model. It has since been used to classify millions of morphological properties of galaxies which are usable by physicists.

I recently participated in our annual company hackathon, where my team created an educator bot with an initial lens of onboarding. We leveraged Generative AI models and intelligent data retrieval to offer a fresh and novel approach to employees accessing, processing and understanding information. Our first optimistic approach was to give it access to all non-critical documents, ask it questions and fact checked the response. This approach did not work and was, well, garbage. We were then selective about the documents provided and restricted the questions that could be asked of it. This proved to be dramatically better than our initial approach. Overall, we found that supplying less data of higher quality produced more useful and accurate responses to our users.

Key learnings: a smaller quantity of higher quality data is actually more

As we leverage increasing ML technologies in our lives, the key learning from my work and research is that when training a machine learning model, a smaller quantity of higher quality data is actually more.

We must be intentional. We cannot throw garbage at these models and expect miracles.

Engineers in industry do not always have the luxury to secured higher quality data and often it does not exist at all. Industry leaders should take it upon themselves to take action to secure the highest quality data for optimal outcomes.