10 Best Practices for Securing Big Data

Every business wants to collect troves of business intelligence (BI), as much data as executives, marketers, and every other department in the organization can get their hands on. But once you've got that data, the difficulty lies not only in analyzing the massive data lake to find the key insights for which you're looking (without being inundated by the sheer volume of information) but securing all of that data as well.

So, while your enterprise IT department and data scientists are running predictive analytics algorithms, data visualizations, and employing an arsenal of other data analysis techniques on the Big Data you've collected, your business needs to make sure there are no leaks or weak spots in the reservoir.

To that end, the Cloud Security Alliance (CSA) recently released The Big Data Security and Privacy Handbook: 100 Best Practices in Big Data Security and Privacy. The long list of best practices is spread across 10 categories, so we whittled the best practices down to 10 tips to help your IT department lock down your key business data. These tips employ an arsenal of data storage, encryption, governance, monitoring, and security techniques.

1. Safeguard Distributed Programming FrameworksDistributed programming frameworks such as Hadoop make up a huge part of modern Big Data distributions, but they come with serious risk of data leakage. They also come with what's called "untrusted mappers" or data from multiple sources that may produce error-ridden aggregated results.

The CSA recommends that organizations first establish trust by using methods such as Kerberos Authentication while ensuring conformity to predefined security policies. Then, you "de-identify" the data by decoupling all personally identifiable information (PII) from the data to ensure personal privacy isn't compromised. From there, you authorize access to files with predefined security policy, and then ensure that untrusted code doesn't leak information via system resources by using mandatory access control (MAC) such as the Sentry tool in Apache HBase. After that, the hard part is over as all that's left to do is guard against data leakage with regular maintenance. The IT department should be checking worker nodes and mappers in your cloud or virtual environment, and keeping an eye out for fake nodes and altered duplicates of data.

2. Secure Your Non-Relational DataNon-relational databases such as NoSQL are common but they're vulnerable to attacks such as NoSQL injection; the CSA lists a bevy of countermeasures to protect against this. Start by encrypting or hashing passwords, and be sure to ensure end-to-end encryption by encrypting data at rest using algorithms such as advanced encryption standard (AES), RSA, and Secure Hash Algorithm 2 (SHA-256). Transport layer security (TLS) and secure sockets layer (SSL) encryption are useful as well.

Beyond those core measures, plus layers such as data tagging and object-level security, you can also secure non-relational data by using what's called pluggable authentication modules (PAM); this is a flexible method for authenticating users while making sure to log transactions by using a tool such as NIST log. Finally, there's what's called fuzzing methods, which expose cross-site scripting and injecting vulnerabilities between NoSQL and the HTTP protocol by using automated data input at the protocol, data node, and application levels of the distribution.

3. Secure Data Storage and Transaction LogsStorage management is a key part of the Big Data security equation. The CSA recommends using signed message digests to provide a digital identifier for each digital file or document, and to use a technique called secure untrusted data repository (SUNDR) to detect unauthorized file modifications by malicious server agents.

The handbook lists a number of other techniques as well, including lazy revocation and key rotation, broadcast and policy-based encryption schemes, and digital rights management (DRM). However, there's no substitute for simply building your own secure cloud storage on top of existing infrastructure.

4. Endpoint Filtering and ValidationEndpoint security is paramount and your organization can start by using trusted certificates, doing resource testing, and connecting only trusted devices to your network by using a mobile device management (MDM) solution (on top of antivirus and malware protection software). From there, you can use statistical similarity detection techniques and outlier detection techniques to filter malicious inputs, while guarding against Sybil attacks (i.e., one entity masquerading as multiple identities) and ID-spoofing attacks.

5. Real-Time Compliance and Security MonitoringCompliance is always a headache for enterprises, and even more so when you're dealing with a constant deluge of data. It's best to tackle it head-on with real-time analytics and security at every level of the stack. The CSA recommends that organizations apply Big Data analytics by using tools such as Kerberos, secure shell (SSH), and internet protocol security (IPsec) to get a handle on real-time data.

Once you're doing that, you can mine logging events, deploy front-end security systems such as routers and application-level firewalls, and begin implementing security controls throughout the stack at the cloud, cluster, and application levels. The CSA also cautions enterprises to be wary of evasion attacks trying to circumvent your Big Data infrastructure, and what's called "data-poisoning" attacks (i.e., falsified data that tricks your monitoring system).

6. Preserve Data PrivacyMaintaining data privacy in ever-growing sets is really hard. The CSA said the key is to be "scalable and composable" by implementing techniques such as differential privacy—maximizing query accuracy while minimizing record identification—and homomorphic encryption to store and process encrypted information in the cloud. Beyond that, don't skimp on the staples: The CSA recommends incorporating employee awareness training that focuses on current privacy regulations, and being sure to maintain software infrastructure by using authorization mechanisms. Finally, the best practices encourage implementing what's called "privacy-preserving data composition," which controls data leakage from multiple databases by reviewing and monitoring the infrastructure that's linking the databases together.

7. Big Data CryptographyMathematical cryptography hasn't gone out of style; in fact, it's gotten far more advanced. By constructing a system to search and filter encrypted data, such as the searchable symmetric encryption (SSE) protocol, enterprises can actually run Boolean queries on encrypted data. After that's installed, the CSA recommends a variety of cryptographic techniques.

Relational encryption allows you to compare encrypted data without sharing encryption keys by matching identifiers and attribute values. Identity-based encryption (IBE) makes key management easier in public key systems by allowing plaintext to be encrypted for a given identity. Attribute-based encryption (ABE) can integrate access controls into an encryption scheme. Finally, there's converged encryption, which uses encryption keys to help cloud providers identify duplicate data.

8. Granular Access ControlAccess control is about two core things according to the CSA: restricting user access and granting user access. The trick is to build and implement a policy that chooses the right one in any given scenario. For setting up granular access controls, the CSA has a bunch of quick-hit tips:

  • Normalize mutable elements and denormalize immutable elements,
  • Track secrecy requirements and ensure proper implementation,
  • Maintain access labels,
  • Track admin data,
  • Use single sign-on (SSO), and
  • Use a labeling scheme to maintain proper data federation.

9. Audit, Audit, AuditGranular auditing is a must in Big Data security, particularly after an attack on your system. The CSA recommends that organizations create a cohesive audit view following any attack, and be sure to provide a full audit trail while ensuring there's easy access to that data in order to cut down incident response time.

Audit information integrity and confidentiality are also essential. Audit information should be stored separately and protected with granular user access controls and regular monitoring. Make sure to keep your Big Data and audit data separate, and enable all required logging when you're setting up auditing (in order to collect and process the most detailed information possible). An open-source audit layer or query orchestrator tool such as ElasticSearch can make of all this easier to do.

10. Data ProvenanceData provenance can mean a number of different things depending on who you ask. But what the CSA is referring to is provenance metadata generated by Big Data applications. This is a whole other category of data that needs significant protection. The CSA recommends first developing an infrastructure authentication protocol that controls access, while setting up periodic status updates and continually verifying data integrity by using mechanisms such as checksums.

On top of that, the rest of the CSA's best practices for data provenance echo the rest of our list: implement dynamic and scalable granular access controls and implement encryption methods. There's no one secret trick to ensuring Big Data security across your organization and every level of your infrastructure and application stack. When dealing in data batches this vast, only an exhaustively comprehensive IT security scheme and enterprise-wide user buy-in will give your organization the best chance to keep every last 0 and 1 safe and secure.

This article originally appeared on PCMag.com.