author photo
By Bruce Sussman
Wed | Aug 1, 2018 | 8:30 AM PDT

When it comes to data, meteorologists learned something the hard way over the years while the rest of us laughed. The output from automated forecast models, built on big data, are only as good as the data that goes into them.

Information security leaders face the same challenge right now.

Jon Oliver knows all about this struggle.

He’s a data scientist and senior architect for Trend Micro. And until writing this story, we had no idea that the cybersecurity company started using machine learning back in 2005, long before it became a buzzword.

Big data and machine learning, the potential

Oliver sees two things that are crucial right now in this space.

On the upside, there is a vast and growing amount of threat data being paired with machine learning to automate processes and improve security solutions.

Organizations are becoming both more secure and more efficient as this technology trend helps them identify threats.

“Big data and machine learning go hand in hand in cybersecurity. Threat data provides the necessary information for cybersecurity solutions to work effectively,” says Oliver.

“A large threat data-set enables a machine learning system to spot a wider variety of threats—even variants—and to decide how to best mitigate them before they infect endpoints and networks.”

The hidden risk of big data and machine learning

However, Oliver says with the world now creating about 2.5 quintillion bytes of data per day, there is a growing problem that spills over into cybersecurity.

The problem is dirty data.

And if this dirty data gets used in machine learning, the output will give your IT security team a distorted view of the risk you face.

Examples of dirty data:

  • Corrupted files: a web download which was prematurely ended
  • Mislabeled data: a web page about "Bikini Atoll" being labelled as intimate apparel
  • Data that people disagree about: a newsletter you like to read (legitimate email) and a colleague who has unsubscribed from but failed to be removed from the newsletter list (spam for them)
  • Data that experts disagree about: some files on VirusTotal, one expert labels as malware, another says it is clean. And is it malware or an FP?

This issue is becoming so significant, The New York Times reports that data scientists are spending 30-50% of their time doing what’s known as data cleansing.

Data cleansing: security vendors must do it

Unfortunately, not all security vendors do an equally good job with this, because it is difficult work that gets expensive quickly.

However, because of Trend Micro’s history of working with machine learning and big data, Oliver says he feels extremely confident in the work his company does for organizations.

“Our years of security research provided us with extensive and accurately labelled threat and malware data, as well as the expertise to continue accurately understanding and labeling new data. We focus as well on ensuring the quality of training data-sets to further optimize the performance of our machine learning systems.”

That sounds like an accuracy level that meteorologists around the country can only dream about.

An accuracy that is helping organizations remain secure.

[More: Read Jon Oliver's post, "Is Big Data Enough for Machine Learning," if you'd like more information on this topic.]