Data Lakes, Explained

By Features PCmag

The Big Data revolution has redefined the way enterprises work; data underpins everything. Not only have open-source tools such as Apache Hadoop and Spark made vast quantities of data easier to collect, process, and store in real time, but business intelligence (BI) and data visualization tools have begun to help us scratch the surface of analyzing and transforming that data to inform core business decisions.

Continue Reading Below

Though, despite how much Big Data and BI technology has evolved, we're still dealing with such massive volumes of constantly compounding data that finding the right points to analyze still feels like diving for needles in a never-ending haystack. The solution? Redesign the haystack.

Enter data lakes, a new type of cloud-based enterprise architecture that structures data in a more scalable way that makes it easier to experiment with; makes it more open to exploration and manipulation rather than locked in rigid schemas and silos. Nasry Angel, an Enterprise Architecture Researcher at Forrester Research, explained why enterprises are embracing data lake architectures.

"It sounds cliché, but when you think about an effective modern data environment, it's a lot more experimental," said Angel. "You need to be able to learn fast and fail fast. In the past, managing data, especially in a warehouse, was all about quality, down to the decimal point; making sure everything was completely accurate and true. It's called chasing a single version of the truth. Then generating a pixel-perfect report and blasting it out to 5,000 users.

"Nowadays, it's a more scientific process. You walk in with a hypothesis about the data you want to test and you want to be able to play with the data, mix and match, to try out different things before you go and productize something."

What's In a Data Lake?
A data lake is a storage repository. Though, unlike a data warehouse or "data mart," Angel explained that data lakes are distributed over multiple nodes rather than in the fixed, structured environment of a data warehouse relying on schemas (see infographic below).

Continue Reading Below

"A data lake allows you to apply a schema when you write the data versus a data warehouse that requires you to do a schema on read. So, essentially, a data warehouse requires you to model the data before you understand its context, which doesn't really make sense," said Angel.


Source: JustOne Database, Inc. (Click on graphic above to see full view.)

"Typically, in a warehouse, you have IT professionals coming up with what they think are the best data models, and they're not the eventual users of the data. You can quickly see how that hinders productivity and business value," he added. "Ultimately, you and the business users need to be the ones that make decisions about the structure of data, and, in a data lake, you can first explore and figure out what's there and then figure out a schema to best organize it."

Data lakes are typically built on Hadoop, and enterprise Hadoop distributions such as Hortonworks and MapR offer data lake architectures. Businesses can also build data lakes by using Infrastructure-as-a-Service (IaaS) clouds including Amazon Web Services (AWS) and Microsoft Azure. Amazon's Elastic Compute Cloud (EC2) supports data lakes while Microsoft has a dedicated Azure Data Lake platform to store and analyze real-time data. Angel said data lakes are maturing to the point within the Big Data space where businesses can begin investing in them with reasonable confidence.

"A few years back, Hadoop was all the rage. Now we're getting to a point where Hadoop is commoditized," said Angel. "The question is not if Hadoop but when, and what you're going to do with it. What types of applications are you going to build on top of Hadoop once you've gotten the data into a common place like a data lake? At this point, it's about using the data to develop applications to meet your specific business needs."

Building Atop a Data Reservoir
The most exciting part about Big Data is all of the possibility it unlocks. Once you've set up a data lake in which to play and experiment with different data combinations and business outcomes, you can begin layering innovative analysis techniques on top.

Machine learning (ML) algorithms are already becoming part of the fabric of cloud infrastructure, and researchers are continually improving deep learning techniques and neural networks to train machines and data systems to recognize complex patterns. Predictive analytics is being baked into more and more data tools and enterprise platforms as well, used for everything from predictive scoring and automated segmentation for customer relationship management (CRM) to identifying financial market trends and preemptively catching mechanical failures in machinery.

All of this happens on top of whatever data store your business is feeding and scaling according to its needs. Angel talked about some of the real-world use cases in which he's seen data lakes change the way organizations function.

"I was working with a publishing company that has a portfolio of different magazines—they have a publication for lawyers, another for accountants, another for consultants, etc.—and each publication had its own data warehouse. Effectively, each publication had its own silo," explained Angel.

"So we extracted all the data from a warehouse and put it into a data lake, and the data lake allowed them to see across silos. They were able to explore the data and do data discovery, and realized that across all these different publications, customers from every magazine were interested in cybersecurity. The readership for cybersecurity was strong across all these different roles. So what did they do? They made cybersecurity the theme of their annual conference."

Another example Angel talked about is e-commerce. Another client, an online art retailer, was dumping a ton of information into a data lake and using it not only as a repository but as a canvas of sorts to put together business insights. The retailer brought transaction data (orders, invoices, payments, etc.), clickstream data (each website visitor's succession of clicks and pages), and data from the retailer's data warehouse all into the lake, and used it in concert to combat shopping cart abandonment and conversions.

"You want to build on top of a data lake and use it to formulate complex business insights," said Angel. "The art retailer was able to look at a customer's clickstream data and match clicks with customer profiles, then use transactional data to see what the customer bought in the past and use those insights to run very specific email campaigns. So, if a customer abandoned their cart, the retailer could follow up two hours later and say, 'We saw you were checking out this Picasso; here's the link if you want to look at it again.'"

Data lakes are universally applicable across all sorts of business use cases. But, for a Chief Technical Officer (CTO) or Chief Information Security Officer (CISO) considering migrating to the architecture, Angel stressed that data warehouses aren't yet obsolete, not by any stretch. For most enterprise organizations, whether you're using a cloud provider or a custom Hadoop distribution, businesses still need both.

Data lakes give you access to unparalleled insights by removing the limits of conforming data to a particular schema, and come with a much lower total cost of ownership given the use of cheap, flexible cloud storage such as AWS to scale up and down—while only paying for the processing power you actually use. Running a data warehouse is more expensive and, consequentially, makes IT professionals more selective about what data comes in and out. But for an enterprise's most mission-critical data, that's not a bad thing.

"The data warehouse has advantages in terms of security and being a very easy tool to control data governance," said Angel. "So you still want to keep your most sensitive information in the warehouse, the mission-critical stuff. But when it comes to new business opportunities and discovering hidden insights, you want to be leveraging a data lake."

This article originally appeared on PCMag.com.