Dirty Data and Machine Learning: 4 Things to Watch For

By SecureWorld News Team

Big data and machine learning, the potential

Oliver sees two things that are crucial right now in this space.

On the upside, there is a vast and growing amount of threat data being paired with machine learning to automate processes and improve security solutions.

Organizations are becoming both more secure and more efficient as this technology trend helps them identify threats.

“Big data and machine learning go hand in hand in cybersecurity. Threat data provides the necessary information for cybersecurity solutions to work effectively,” says Oliver.

“A large threat data-set enables a machine learning system to spot a wider variety of threats—even variants—and to decide how to best mitigate them before they infect endpoints and networks.”

The hidden risk of big data and machine learning

However, Oliver says with the world now creating about 2.5 quintillion bytes of data per day, there is a growing problem that spills over into cybersecurity.

The problem is dirty data.

And if this dirty data gets used in machine learning, the output will give your IT security team a distorted view of the risk you face.

Examples of dirty data:

Corrupted files: a web download which was prematurely ended
Mislabeled data: a web page about "Bikini Atoll" being labelled as intimate apparel
Data that people disagree about: a newsletter you like to read (legitimate email) and a colleague who has unsubscribed from but failed to be removed from the newsletter list (spam for them)
Data that experts disagree about: some files on VirusTotal, one expert labels as malware, another says it is clean. And is it malware or an FP?

This issue is becoming so significant, The New York Times reports that data scientists are spending 30-50% of their time doing what’s known as data cleansing.

Data cleansing: security vendors must do it

Unfortunately, not all security vendors do an equally good job with this, because it is difficult work that gets expensive quickly.

However, because of Trend Micro’s history of working with machine learning and big data, Oliver says he feels extremely confident in the work his company does for organizations.

“Our years of security research provided us with extensive and accurately labelled threat and malware data, as well as the expertise to continue accurately understanding and labeling new data. We focus as well on ensuring the quality of training data-sets to further optimize the performance of our machine learning systems.”

That sounds like an accuracy level that meteorologists around the country can only dream about.

An accuracy that is helping organizations remain secure.

[More: Read Jon Oliver's post, "Is Big Data Enough for Machine Learning," if you'd like more information on this topic.]

Tags: Big Data, Machine Learning,

Comments

Dirty Data and Machine Learning: 4 Things to Watch For

Big data and machine learning, the potential

The hidden risk of big data and machine learning

Data cleansing: security vendors must do it

Turn Out the Lights, the Cyber Attack Is Over... Not Quite

Hong Kong Clerk Defrauded of $25 Million in Sophisticated Deepfake Scam

Congress Passes Bill to Block Sale of Americans' Data to Adversaries

Dirty Data and Machine Learning: 4 Things to Watch For

Big data and machine learning, the potential

The hidden risk of big data and machine learning

Data cleansing: security vendors must do it

Turn Out the Lights, the Cyber Attack Is Over... Not Quite

Hong Kong Clerk Defrauded of $25 Million in Sophisticated Deepfake Scam

Congress Passes Bill to Block Sale of Americans' Data to Adversaries

Subscribe to Email Updates