Over the past decade, leveraging machine intelligence has gone from providing a business edge to becoming a business mandate for organizations to stay relevant. Machine learning and predictive analytics have found a variety of applications across the board as organizations strive to achieve digital maturity. The gradual push for cloud adoption has created a spread of enterprise data between on-premise data centers and cloud platforms with machine intelligence tools following suit to provide solutions that can run where the data lies. The new technologies in this field rely on vast amounts of data to train the models, creating unique security challenges to protect enterprise data and privacy. This paper aims to explore the data security issues that enterprises must consider when embarking on their digital intelligence journey.
According to an IBM study, it is estimated that around 2.5 quintillion bytes of enterprise data are created every day by billions of internet-connected devices. This number is expected to grow exponentially with Gartner predicting IoT device sales to reach 1 million devices every hour by 2021. This colossal wealth of generated data knows it all—it knows the consumers, it knows the competition, it proves and disproves existing notions, and it uncovers completely new patterns hitherto unknown. Proper utilization of this data can present opportunities in the form of reducing inefficiencies and costs, making better decisions and consumer experiences, and creating completely new product or service offerings. Due to the rapid surge in adoption of data analytics and machine intelligence, it is no longer important as to who has the most data, but rather who can derive the most insights from said data. Driving such insights is now essential to ensure the efficiency, continuity, and discoverability of new business models.
Although enterprises have started the process of digital transformation, the conflicting nature of securing data while making it more available has proven to be a challenge. A machine intelligence solution typically works by consuming large sets of structured and unstructured data to create and train models that comprise inputs, outputs, parameters, constraints, and algorithms. The trained model is used to classify, predict, and prescribe newer data and events based on learnings from the training data. More advanced versions would use multiple models and establish a cognitive loop to correct the model as new data comes in.
While it is important to provide the required data access to the machine learning pipeline, such access brings forth security threats. The inputs may contain sensitive information that needs to be protected at all stages of the pipeline. Enterprises must identify the source and nature of all data and classify it. The classification may vary based on industry and compliance needs such as PCI DSS and SOC, but data can usually be categorized under the following broad classifications:
Secret or High-Risk
|Legal requirements that prevent disclosure of data.
Example: credit card information, HIPAA-protected information
Sensitive or Internal
|Information deemed to be sensitive but not high-risk.
Example: NDAs, certain intellectual property,
Personally Identifiable Information (PII)
|Information that can be used to ascertain an individual’s identity.
Example: name, social security number, date of birth
|Information intended for public disclosure; has no bearing on an organization’s obligations for privacy when shared.
Example: public website content
Data lifecycle policies and access policies are defined for each of these classes of data after assessing the level of risk and business needs. All machine learning and analytics solutions must comply with the existing enterprise privacy and security policies; for instance, confidential data such as credit card information may be required for analysis from a business perspective, but security policies must ensure that the data never leaves the logical network perimeter defined for such data. Alternative solutions must be explored, such as using a random, unidentifiable token as a representative for analysis in place of actual confidential data, or having the algorithm come to the data and run inside a secure network perimeter zone instead of having the data transmitted to the algorithm. Aside from the traditional data security and privacy concerns, enterprises must also consider challenges unique to machine intelligence and analytics. Attackers could use adversarial techniques to:
- Compromise privacy by reverse engineering inputs from published outputs. For example, an analyzed output such as the average compensation of all employees in a geographical location may be published without any PII. However, if an employee leaves the organization and another public dataset discloses the geographical location of the employee, the difference in the published averages can be used to compute the compensation of the employee, thereby breaching privacy. Audits of published data must be conducted as part of security reviews to prevent such accidental exposures. Researchers and practitioners have also prescribed security best practices (such as differential privacy, deferred publishing) that should be adhered to when implementing machine intelligence solutions.
- Compromise predictive outputs by introducing malicious training data. For example, an online marketplace may use publicly generated data, such as reviews and pictures, to predict the ranked relevance of a product for particular keywords. If a malicious user can predict the outcome of the model, they may introduce fake pictures and reviews to poison the learning pipeline and manipulate the output for particular products or keywords.
- Engage in IP theft of the model itself. For example, an online real estate portal can publish a proprietary model for public consumption such that it predicts the listing price of a property when given certain inputs. A malicious attacker may be able to provide a multitude of inputs to such a service and use the outputs to train their own model, thereby duplicating the proprietary learning algorithm without having direct access to it.
To ensure a machine intelligence solution is compliant with data security policies, logs and reporting trails must be maintained. This is a shared responsibility between security and business stakeholders to ensure safety and privacy standards are met while serving business needs .
Enterprises may also run into data rights and ownership issues when working with cloud service providers. Enterprises need to consider the complexity of these factors when deciding on an on-premise or cloud analytics solution. Running analytics where the data is, is a good strategy to begin digital intelligence initiatives.
Keeping sensitive data on-premise gives enterprises greater control over their security. Thus, adoption of cloud analytics has been slow, but will gradually increase in unison with the move of sensitive applications to the cloud. This approach also allows enterprises to prove their return on investment of digital intelligence initiatives before scaling up. While on-premise data centers and private cloud servers give enterprises better control over the location of their data, precautions must still be taken to ensure secure storage, transmission, and use of the data, while implementing machine intelligence solutions. Logical network boundaries such as DMZs must be locked down strategically, allowing access only over designated surface endpoints. All ingestion and transmission of data should be over secure channels such as Transport Layer Security (TLS) as supported by the data source. Sensitive data and the underlying storage infrastructure should be encrypted with industry-accepted standards of encryption such as 256-bit Advanced Encryption Standard (AES-256). Security and data event logs should be maintained on-premise, allowing for a comprehensive reporting trail for audits.
An on-premise deployment strategy offers the added advantage of utilizing existing tools for network monitoring, auditing, backup, and recovery to ensure compliance. A good machine intelligence solution supports deployment as a set of logically separated, containerized components. This provides enterprises with options in terms of underlying infrastructure and individual component isolation within defined network zones.
HYBRID AND PUBLIC CLOUD DEPLOYMENT
Hybrid and public cloud solutions present additional data security challenges pertaining to storage and transmission. They also require additional monitoring and auditing frameworks to be in place to ensure compliance with existing enterprise security policies. Cloud infrastructure is generally shared in nature. Lack of visibility and control over this shared infrastructure heightens security risks even though it provides other advantages such as increased agility and speed. Cloud service providers have invested a great deal in security to provide solutions that enable enterprises to comply with standards such as PCI DSS, HIPAA, and SOC, but the responsibility of data security in cloud deployments is shared between the enterprise and the provider. Providers are generally responsible for the security of the cloud and the underlying infrastructure, whereas the enterprise must take on the responsibility of securing data in the cloud.
Data ingestion methods when integrating with cloud components can be connector-based, broker-based or agent-driven. Regardless of the method of ingestion, communication with the data sources should be encrypted (if the data source allows it) and over a secure channel such as TLS or an IPSec-based virtual, private network. It is also important to reduce the vulnerable network surface area by locking down the network zone and allowing access to processed analytical results through designated endpoints only. Data should be checked against defined security policies for its classification before it is transmitted or stored. As underlying storage is also on shared infrastructure, it is a good practice to encrypt it along with the encryption of the data itself. Cloud providers generally offer provisions to do this with industry standards such as AES-256 encryption. Another good practice is to manage the encryption keys on a private platform thereby giving enterprises the control to update or destroy sensitive data stored on cloud infrastructure should such a need arise.
ACCELERATING DIGITAL TRANSFORMATION WITH DECISIONMINESTM
DecisionMinesTM is a premise-agnostic digital decisioning platform that can be deployed in a controlled and secure environment, whether data resides in a managed data center or in a private cloud. This ensures that data security is consistent with the existing standards and practices and helps prove out returns on investment before scaling up to hybrid or public cloud solutions.
No matter where you are in your digital journey, DecisionMinesTM can help accelerate and leverage your data for business transformation. Our technical and industry expertise, which has been honed in a real business environment for over two decades, enables us to help you with industry-specific embedded business solutions as well as turn-key integrations.
Visit www.decisionmines.com to get a free assessment of your digital strategy or to schedule a demo of our portfolio of solutions.