Big Data Basics: How to Build a Data Governance Plan

We've written a lot about the role of data in modern businesses. From startups and small to midsize businesses (SMBs) to large enterprises, data insights and analysis are more accessible to businesses of all sizes than ever before. This is, in part, thanks to the rise of self-service business intelligence (BI) and data visualization tools.

Though, before you can employ BI tools or run predictive analytics on a data set, there are a host of factors to square away. It starts with simply understanding what Big Data is, what it isn't (hint: not a crystal ball), and how to manage data storage, organization, permissions, and security within your enterprise data architecture. This is where data governance comes in. The processes by which you ensure governance within an enterprise differ depending upon who you talk to. But, at its core, data governance is about data trust and accountability, married with comprehensive data security best practices.

I talked to Hortonworks and MapR, two of the biggest enterprise Hadoop vendors in the market. Scott Gnau, Chief Technology Officer at Hortonworks, and Jack Norris, Senior Vice President of Data and Applications at MapR, each explained what data governance means to their organizations. They discussed how to tackle the complex challenge of ensuring data governance within the complex data architectures and organizational hierarchies of a large enterprise.

What Exactly Is Data Governance and Why Do We Need It?Governance means making sure enterprise data is authorized, organized, and permissioned in a database with as few errors as possible, while maintaining both privacy and security. It's not an easy balance to strike, particularly when the reality of where and how data is housed and processed is constantly in flux. MapR's Norris explained why businesses need to look at data governance from a higher level and focus on the larger data pipeline at play.

"When you start scaling the variety and velocity of the Big Data we're dealing with, you've got to have data governance but it's in this broader context. What's the data you have, who has access to it, and how are you managing the lineage of that data over time?" said Norris. "From a data governance standpoint, you can have different stages of the data that exist within a system that can be snapshotted so you can return at any point in time in the pipeline. It's about building auditability and access control into the data platform to make sure data discovery and analytics are transparent, whether you're a business manager looking at financial data sets or a data scientist working with raw upstream data."

Source: Rimes

Hortonworks' Gnau keyed in on a similar point. Whether you're dealing with a data warehouse or data lake architecture, data governance is about balancing opposing forces. It's about unfettered data access to drive innovation and derive insights, and granular permissions and privacy to simultaneously protect that data end to end.

"Compare and contrast the old world of traditional governance in the data space; it was a little bit easier," said Gnau. "Data used to be well-defined by job role or application. In the new world, you get the most value when data scientists have access to as much data as possible, and finding that happy medium is very important.

"It's driving a whole new paradigm in how you need to approach governance," added Gnau. "In this new world, I consider governance and security topics that need to be covered together. A lot of companies are still struggling to move through that to enable their data scientists to be effective in finding those new use cases while, at the same time, understanding how to handle security, privacy, governance—all the things that are important from a bottom-line perspective and also from a company reputation perspective."

How is an enterprise data governance plan supposed to encompass and satisfy all of those opposing forces? By tackling each requirement methodically, one step at a time.

How to Build a Data Governance PlanHortonworks, MapR, and Cloudera are the three biggest independent players in the Hadoop space. The companies have their own spheres of influence when it comes to data governance. MapR has released a number of white papers on the subject and built data governance throughout its Converged Data Platform, while Hortonworks has its own data security and governance solution and co-founded the Data Governance Initiative (DGI) in 2015. This led to the open-source Apache Atlas project that provides an open data governance framework for Hadoop.

But when it comes to how each vendor crafts comprehensive data governance and security strategies, Gnau and Norris both spoke along similar lines. The following are the combined steps that Hortonworks and MapR recommend businesses keep in mind when building a data governance plan.

The Big One: Granular Data Access and AuthorizationBoth companies agree that you can't have effective data governance without granular controls. MapR accomplishes this primarily through Access Control Expressions (ACEs). As Norris explained, ACEs use grouping and Boolean logic to control flexible data access and authorization, with role-based permissions and visibility settings. He said to think of it like a Gartner Hype Cycle model. On the Y-axis at the lower end are strict governance and low agility, and on the X-axis at the top end are higher agility and less governance.

"At the low level, you protect sensitive data by obfuscating it. At the top, you've got confidential contracts for data scientists and BI analysts," Norris said. "We tend to do this with masking capabilities and different views where you lock down raw data at the bottom as much as possible and gradually provide more access until, at the upper end, you're giving administrators broader visibility. But how do you give access to the right people?

"If you look at an access control list today, it'll say something like 'everyone in engineering can access this,'" added Norris. "But if you want a few select directors on a project within IT to have access or everyone except [a certain] person, you have to create a special group. It's an overly complicated and convoluted way to look at access."

That's where granting access rights to different levels and groups comes in, according to Norris. "We've combined ACEs with the various ways you can access data—through files, tables, streams, etc.—and implemented views with no separate copies of the data. So we're providing Views on the same raw data and the Views can have different levels of access. This gives you more integrated security that's more direct."

Hortonworks handles granular access in a similar fashion. By integrating Apache Atlas for governance and Apache Ranger, Gnau said the company handles authorization at an enterprise level through a single pane of glass. The key, he said, is the ability to contextually grant access to the database and on specific metadata tags by using tag-based policies.

"Once someone is in the database, it's about guiding them through the data they should have relevant access to," said Gnau. "Ranger's security policies at the object level, fine-grained, and everywhere in between can handle that. Tying that security into governance is where things get really interesting.

"To scale in large organizations, you need to integrate those roles with governance and metadata tagging," added Gnau. "If I'm logging in from Singapore, perhaps there are different rules based on local privacy laws or corporate strategy. Once a company defines, sets, and understands those rules from a holistic top-down perspective, you can section off access based on specific rule sets while executing everything inside the core platform."

Source: IBM Big Data & Analytics Hub

2. Perimeter Security, Data Protection, and Integrated AuthenticationGovernance doesn't happen without endpoint security. Gnau said it's important to build a good perimeter and firewall around the data that integrates with existing authentication systems and standards. Norris agreed that, when it comes to authentication, it's important for enterprises to sync with tried-and-tested systems.

"Under authentication, it's about how you integrate with LDAP [Lightweight Directory Access Protocol], Active Directory, and third-party directory services," said Norris. "We also support Kerberos username and passwords. The important thing is not to create a whole separate infrastructure, but it's how you integrate with the existing structure and leverage systems like Kerberos."

3. Data Encryption and TokenizationThe next step after securing your perimeter and authenticating all of the granular data access you're granting: Make sure files and personally identifiable information (PII) is encrypted and tokenized from end to end through your data pipeline. Gnau discussed how Hortonworks secures PII data.

"Once you get past the perimeter and have access to the system, being able to protect PII data is extremely important," said Gnau. "You need to encrypt and tokenize that data so, regardless of who has access to it, they can run the analytics they need to without exposing any of that PII data along the line."

As for how you securely access encrypted data both in motion and at rest, MapR's Norris explained that it's important to keep in mind use cases such as backup and disaster recovery (DR) as well. He discussed a concept of MapR's called logical volumes, which can apply governance policies to a growing cluster of files and directories.

"At the lowest level, MapR has architected WAN [wide area network] replication for DR, and time-consistent snapshots across all the data that can be set up at different frequencies by directory or volume," said Norris. "It's broader than just data governance. You can have a physical cluster with directories, and then the logical volume concept is a really interesting management unit and way to group things while controlling for data protection and frequency. It's another arrow in the IT admin's data governance quiver."

4. Constant Auditing and AnalyticsLooking at the broader governance picture, both Hortonworks and MapR said the strategy doesn't work without auditing. That level of vsibility and accountability into every step of the process is what allows IT to actually "govern" data as opposed to simply setting policies and access controls and hoping for the best. It's also how enterprises can keep their strategies current in an environment in which how we see data and the technologies we use to manage and analyze it are changing every day.

"The final piece of a modern governance strategy is logging and tracking," said Gnau. "We're in the infancy of Big Data and IoT [Internet of Things], and it's critical to be able to track access and recognize patterns in the data so that, as the strategy needs to be updated, we're ahead of the curve."

Norris said auditing and analysis can be as simple as tracking JavaScript Object Notation (JSON) files. Not every piece of data will be worth tracking and analyzing but your business will never know which—until you identify a game-changing insight or a crisis happens and you need to run an audit trail.

"Every JSON log file is opened up for analysis and we have Apache Drill to query JSON files with the schemas, so it's not a manual IT step to set up metadata analysis," said Norris. "When you include all data access events and every administrative action, there's a wide range of analytics possible."

5. A Unified Data ArchitectureUltimately, the technology officer or IT admin who oversees an enterprise data governance strategy should think about the specifics of granular access, authentication, security, encryption, and auditing. But the technology officer or IT admin shouldn't stop there; rather, that person also should think about how each of these components feeds into their larger data architecture. He or she should also think about how that infrastructure needs to be scalable and secure—from data gathering and storage all the way to BI, analytics, and third-party services. Gnau said data governance is as much about rethinking strategy and execution as it is about the tech itself.

"It goes beyond a single pane of glass or a collection of security rules," said Gnau. "It's a single architecture where you create these roles and they sync across the entire platform and all the tools you bring into it. The beauty of securely governed infrastructure is the agility with which new methods are created. At each platform level, or even in a hybrid cloud environment, you've got a single point of reference to understand how you've implemented your rules. All data passes through this layer of security and governance."

This article originally appeared on PCMag.com.