author photo
By Bruce Sussman
Tue | Nov 3, 2020 | 9:59 AM PST

Data are everywhere.

It is at the heart of every organization, across all types, infrastructures, and applications, in cloud and on premise. 

In our ongoing Behind the Scenes interview series, we are uncovering how to achieve sustainable data discovery for privacy, security, and governance by answering the following questions: 

1. How can you continuously assess explicit risk and trust levels in the enterprise?

2. Why is it important for organizations to implement sustainable data discovery for enhancing business agility and dealing with all the regulatory nuances?

3. How do you use AI models to tune the platform and benefit from supervised AI?

4. What are business use cases for security and privacy leaders?

We are speaking with Arun Gandhi and Mariya Saakyants of in today's interview. Watch the entire interview here, or read excerpts below:

Will you tell us briefly what does, for those unfamiliar with your organization?

[Arun Gandhi,]  "Yes, absolutely. With large amounts of data moving across the organization, it makes it very difficult to know where sensitive data is at any instant of time. The companies we are talking to have petabytes of data. Understanding where an organization's sensitive data resides helps it to understand the proper controls for managing the risk.

So what does, it empowers enterprises with an AI-based platform for sustainable discovery and management of data in their ecosystem for privacy, security, and data governance.

Our flagship platform, Inventa, is a cutting edge data discovery and classification platform that provides automated near real-time discovery, mapping and tracking of all sensitive data at enterprise scale. We automatically discover and analyze all data usage and lineage, even if you have no idea where you have your data and where it is located. From what I understand, every company has a different set of security controls that can leverage this data asset information for risk management, controls assessment, integrated response, and privacy management."

Now, let's dive into our topic. For starters, how can you continuously assess explicit risk and trust levels in the enterprise?

[Arun Gandhi]  "Each organization needs to understand their risk appetite, and calibrate their risk tolerance to leverage this information in their environment.

Inventa assumes Zero Trust in data input and we will determine where to scan in a given environment to support a Zero Trust approach. The platform uses a unique and proprietary passive network packet capture process to identify PII is flowing through the organization.

It enables the organization to identify repositories such as databases, applications, file systems, log files, etc. where t is a site or made a site inventor can use packets from tap sign or other sniffers. There is no specific pre processing required before sending this traffic to Inventa. Moreover, the process is out of band and therefore has no impact on production network traffic. The platform then comprehensively scans those repositories to get full visibility into the depth and breadth of the API. F

Finally, it analyzes and consolidates the data identifies as a result of those chants in a structure that allows the users to see the data lineage, respond to the subject access request, identify production data and non production locations, and other privacy, security and data governance items."

Why is sustainable data discovery important for organizations to implement to enhance business agility and for dealing with all the regulatory nuances?

[Arun Gandhi]  "There are challenges around people, process, and technology. It doesn't work having a great process without having the best of all possible tools to support it, or people who can deliver it. That's like a stool with two perfect legs—it can't do the job it is supposed to do.

I see the main challenges as, number one, there is a lack of understanding of what sensitive data exists and where within the organization, how it is used or processed, who it is shared with, and for what purpose and how long it is needed.

Number two, data leaders spend time and money on monetary activities like regulatory compliance and incident response, augmentation at scale. And last, but not the least, is inefficient database response.

There's large amounts of data within the organization that is unknown, and stored not just in structured databases. Critical data can be anywhere from files and user applications, emails, Google Docs, images, and the list goes on. I see this is where the risk lies, as it makes it very difficult for data guardians of an organization to locate and create an inventory of critical data to create a holistic view. So it's not just important to discover the central data and its lineage, but to keep it up to date to enhance capabilities, focus on ongoing administration, resource management, and scalability to support business agility.

We have just announced a new solution to provide AI, to modify AI models on how to identify personal and sensitive information from the discovered data. And the best part is once we've trained the model, it constantly scans the environment and updates when a new system or API is identified, and continues to build up the inventory."

How do you use AI models to tune the platform, as well as through supervised AI?

[Mariya Saakyants,]  "There are a lot of companies that are doing the data discovery and data audit, and they are using approaches that are totally different.

Many are using the manual techniques to configure the discovery process. For example, to enable discovery and get this started, you will need to find all the repositories that need to be scanned; it can be thousands of them. And you need to point the menu. Doing just this small operation in pointing them can take weeks or months, and by the time you finish that, it has already become irrelevant and outdated.

And the reality is that often information flows within the organization so fast that you may not know all the places where the data resides. At, we deliver solutions that automatically scan personal data using context sensitive machine learning, deep inspect applications, workstations, file shares, emails, databases, cloud storages, and real-time streams of informational transactions in order to identify risk and compliance.

Based on the latest research, companies with fewer than 1,000 employees run an average of 22 custom applications, and the largest enterprises with more than 50,000 employees run almost 800 applications on average. And all of these applications, they process, store, and share personal information, and every single piece of it can become a data breach region.

This also means that you wouldn't be able to track the appearance of all sources of personal information manually, even if you actually would like to. By adding a multi-layer machine learning analytic engine, we give the ability to read and understand the data and link all the pieces into the full picture represented in a master catalog.

Also, we are connecting structured and unstructured data sources and automatically discover and build a relationship map between the personal data and its owner, actually PII of a specific person. All we need is to install the product and then the data sources, repositories files, and cloud systems will be detected automatically. From that starting point, you will have a constant monitoring and auditing tool for the personal information within the organization.

It is really critical to have it for the following reasons. First one is security perspective. You need to protect your users and people who are leaving their personal information to you because you cannot protect what you are not aware of. And the second is regulation compliance; if you cannot provide reports, you cannot be compliant, which is a financial and reputational risk to the company.

We are building the unique tool that will enable non data scientists to build, modify, and train AI models on how to identify personal and sensitive information. We want to give them flexibility because you can train the system based on already discovered results without any manual configuration, and advanced and from scratch as well. This means you can build the discovery strategy for the personal information just in a couple of clicks, instead of a couple of months.

For example, the supervised AI requires minimal user input to identify false positives and to convert the AI candidates into PII. We are building the relationship map between personal data and its owner by this tool. And we are also adding the visibility on the personal information that was discovered with the user, to be able to review and validate the sample of data of discovered results, which will bring you a more accurate and sufficient data discovery process within the whole organization.

I already mentioned that the personal data changes with time and that new sources and types of personal information are added regularly. Adding supervised AI, you will be able to easily add to the discovery scheme and proceed with new information to become a part of the AI. 

We want to focus on providing four key benefits from using this tool for our customers. First, sustainability and flexibility to build discovery strategy based on already found results, or from sketch and doing it pretty quickly like in a couple of clicks.

The second one is the scalability. It's more about the quick install procedure and big number of data sources. The third one is about simplicity; you can easily modify and review discovered results and configure the scanning scheme. The fourth, but not the least, is the speed and the accuracy with which you will accomplish data discovery within the organization."

I'd like to ask you about some business use cases. What are you seeing and hearing from security and privacy leaders?

[Arun Gandhi]  "First and foremost, all organizations need to plan how to use data, to find it consistently throughout the business to support business outcomes. This means that organizations need to consider the who, what, how, when, where, and why of data, to not only ensure security and compliance, but to extract the value from all the information collected and stored across the business.

And we are uncovering several use cases as we've talked to many of these large organizations. The problems are real, involving large volumes of data.

Many organizations today have been lumbered with the task of moving data to data lakes, either on premise or in cloud, leading to a lot of data duplication, merging legacy infrastructures, and the list goes on.

In addition, some of the organizations we are talking to need granular visibility into the sensitive data to more selectively apply costly security controls.

And some organizations are looking for better understanding of sensitive data, for more leverage that is adding specialized capability. [They want] land and expand, with minimal to no impact on an organization's risk exposure. This includes new services, new offerings, expanded offerings, new markets and of course, business intelligence."

Listen to our complete interview here, or learn more about here.