Deception and Forensics for the Next Generation

By Dr. Peter Stephenson

Tue | May 5, 2020 | 5:05 AM PDT

I am going to be slinging around a lot of terms in this series, many of which are not well understood or are misunderstood. Since the core of this series is next generation security for next generation attacks, we should get this "next generation" thing under our belts up front. This is a good time to point out that we are on the cusp between current and next generation cyber activity.

The hypothetical in the last post was sort of the "on-the-horizon" view of where things could go. And, don't misunderstand me: I think this horizon is nearer than many of us think. This is not an "event horizon"—the point beyond which events cannot affect the observer—this is the real deal and it will affect us eventually. Because we are on a cusp, we see cyber events that are on our side of the cusp but are of increasing sophistication. At some point, the adversary will jump the gap and start using true next generation tools for cyberattacks. What I will discuss over the next few posts is how we address the "now" as we prepare for the "then."

The first thing to grasp is what I mean by artificial intelligence (AI). AI is a staple of marketing jargon. Many organizations claim that their products use AI when, in fact, they don't. They may be really smart, but that doesn't make them AI. AI is a collection of technologies and techniques—machine learning (ML), for example—that do not require human intervention to function. ML uses a large collection of data to learn about the data and its subject matter. For example, we might feed an ML program thousands of faces and then ask it to create a composite face that embodies certain characteristics. That, in very simple terms, is what a deepfake is.

We have two main types of ML: supervised learning and unsupervised learning. Supervised learning gives the ML program a large block of data and we tell it to learn about it. What we want it to learn and how we want it to use the results are the products of the ML programmer's task.

If it is unsupervised, the ML program gets no training set and must train itself based upon observation. This is how humans usually learn, although sometimes we learn in a supervised manner, such as when you memorized the multiplication tables as a kid. Once you learned the tables, you could use them to multiply. That is sort of a rudimentary supervised learning ML.

For security purposes, we usually want unsupervised ML because humans cannot anticipate every kind of data that the sensors in a detection system might see. The best monitoring tool in my opinion in this regard is MixMode; it uses unsupervised learning, and I have yet to be able to fool it. However, we will apply unsupervised learning to the BOTsink deception network later in this series for a different purpose.

The benefit that we get from unsupervised ML is that the training set is much harder to compromise since it is being built on the fly by the ML system. As you will see when we look at deception networks—and BOTsink in particular—we can deploy a deception net and it will immediately begin learning the basics of our real network and create a set of decoys that look as if they are part of the network. We can tell it that we want to create email addresses that look like our real email system, and we even can get it to create plausible decoy users and documents. It does all of this by observing the network and its activity and learning from its observations.

Finally, we have adversarial machine learning (AML). This uses ML against ML. In my last post, I mentioned querying the oracle as an attack mechanism. That is a form of AML. There are two major kinds of AML attacks: white box and black box. A gray box attack is a sort of hybrid of black and white.

A white box attack works against a supervised training set. Since the training set is some sort of dataset that the ML program uses to learn, stealing the data set and introducing subtle perturbations in it then reintroducing the altered dataset can perhaps fool the ML into thinking that an action is acceptable when, in fact, it is not. Obviously, a white box attack won't work well on an unsupervised ML.

A black box attack sometimes is effective against an unsupervised ML, and, in fact, there are over 1,500 formal papers published on various forms of AML including black box attacks. In a black box attack, the attacker uses ML to query the oracle—repeated attacks while observing the response from the target ML—and then attempts to inject the small perturbations, again observing the response. I have tried this technique against a few ML tools with mixed results, so, as you might guess, it is not quite ready for prime time, even though it has been demonstrated in labs under very controlled conditions.

Let's move to how, very briefly, ML works. ML uses algorithms to learn. The core algorithm is extremely simple:

Y=f(X)

…where Y is a dependent variable that we are seeking, X is the independent variable that we are given, and f is a function that balances the equation. Sometimes we cannot balance the equation easily so we may add an error:

Y=f(X) + e

to balance the equation.

These algorithms can get quite complicated, but at their heart they take this rather simple form.

Finally, a little clarification on terms applying to the adversary. A hive is a collection of autonomous bots—meaning that they do not require a bot herder or command and control server to function; they work autonomously—with a collective intelligence. Think of the Borg on the Star Trek series or ant colonies in nature. When we have a very large hive net, we have a swarm or "hives of hives," and their bots are called swarm bots. These autonomous systems use some form of machine learning and, while they certainly are not yet prevalent, they are in the wings waiting to make their entrances.

Now that we have the core jargon under control, we'll move on in the next post to an introduction to deception networks. I'll expand on these simple descriptions as we go along, but there is enough here to get us started.

If you want to take a really deep dive into ML, I recommend Dr. Jason Brownlee's great series, "Machine Learning Mastery." A good starting point is "How Machine Learning Algorithms Work."

Tags: Machine Learning, Forensics,

Comments

Deception and Forensics for the Next Generation — Part 2: Learning the Lingo

Right of Publicity Claims Will Rise as States Address AI Deepfakes and Voice Cloning

Hong Kong Clerk Defrauded of $25 Million in Sophisticated Deepfake Scam

NIST Trained AI Could Save Firefighters

Deception and Forensics for the Next Generation — Part 2: Learning the Lingo

Right of Publicity Claims Will Rise as States Address AI Deepfakes and Voice Cloning

Hong Kong Clerk Defrauded of $25 Million in Sophisticated Deepfake Scam

NIST Trained AI Could Save Firefighters

Subscribe to Email Updates