AI Databases: What They Are and Why Your Business Should Care

Data and business intelligence (BI) are two sides of the same coin. Advancements in storage, processing, and analysis have democratized data to the point where you don't need to be a database professional or data scientist to work with massive data sets and derive insights. There's still a learning curve, but self-service BI and data visualization tools are redefining the way businesses leverage all of the data they collect into actionable analytics. However, there is a difference between a BI or database company hawking advanced analytics and an artificial intelligence (AI) database that's purpose-built for training machine learning (ML) and deep learning models.

ML algorithms are being woven into the fabric of much of the today's software. Consumer experiences are melding with AI through virtual assistants and, in business software, there are examples such as Salesforce Einstein that act as an intelligent layer beneath the company's entire customer relationship management (CRM) portfolio. Technology giants, including Google and Microsoft, are pushing our intelligent future even further, not only with research but by rewriting how their tech works from the ground up with AI.

One of the challenges with training machine and deep learning models is the sheer data volume and processing power you need to train a neural network, for example, on complex pattern recognition in fields such as image classification or natural language processing (NLP). Hence, AI databases are beginning to pop up in the market as a way to optimize the AI learning and training process for businesses. We spoke with GPU-accelerated relational database provider Kinetica, which has built an AI database of its own, and PCMag's resident BI and database expert Pam Baker to demystify what an AI database is and how it works compared to traditional databases. More importantly, we asked for their help to sort through the hype and marketing speak to determine whether or not this emerging tech has real business value.

What Are AI Databases?

The rapidly changing nature of the AI space can make it difficult to establish terminology. You often hear terms such as ML, deep learning, and AI used interchangeably when, in fact, they are still-developing techniques under the larger umbrella of AI. As such, Baker said there are two vastly different definitions of what an AI database is depending on who you talk to: one practical and the other more pie-in-the-sky.

"There's a kind of loose consensus in the industry that an AI database would be one that would work entirely off of natural language queries. The user interface would be such that you wouldn't have to rely on search terms and key phrases to find the information you need, allowing the user to summon data sets with NLP," said Baker. "You could make a very limited argument that IBM Watson can pose natural language queries to the system, but you have to be connected to the data already and choose the data yourself. So, right now, that definition is a stretch."

The more practical definition, and the subject of this explainer, is essentially using a purpose-built database to speed up ML model training. A number of tech companies are already developing dedicated AI chips to alleviate the heavy processing load in new hardware products as vendors roll out more AI-based features that require significant compute power. On the data side, using an AI database can help you better wrangle the volume, velocity, and complex data governance and management challenges associated with training ML and deep learning models to save time and optimize resources.

Image credit: Todd Jaquith at Futurism.com. Click to expand full infographic

"Right now there are a lot of efforts to speed up ML training through several different tactics," explained Baker. "One is to separate the infrastructure from the AI researchers doing the coding, so that automated functions are handling the infrastructure and training the ML model. So, instead of spending something like three months [to train a model], you may be looking at 30 days or 30 minutes."

Kinetica breaks that idea down into an integrated database platform optimized for ML and deep learning modeling. The AI database combines data warehousing, advanced analytics, and visualizations in an in-memory database. Mate Radalj, Vice President and Principal Software Engineer of Kinetica's Advanced Technology Group, explained that an AI database should be able to simultaneously ingest, explore, analyze, and visualize fast-moving, complex data within milliseconds. The goal is to lower costs, generate new revenue, and integrate ML models so that businesses can make more efficient, data-driven decisions.

"An AI database is a subset of a general database," said Radalj. "Right now, AI databases are very popular. But a lot of solutions use distributed components. [Apache] Spark, [Hadoop] MapReduce and HDFS are always spinning back and forth rather than in-memory. They don't have the confluence of factors like our database, which was built from the ground up with tightly integrated CPUs and GPUs on a single platform. The high-level benefit for us is faster provisioning and a lower hardware footprint of model-based training, with a quick turnaround and analytics integrated into the same platform."

How an AI Database Works

There are a number of examples of AI databases in practice. Microsoft Batch AI offers cloud-based infrastructure for training deep learning and ML models running on Microsoft Azure GPUs. The company also has its Azure Data Lake product to make it easier for businesses and data scientists to process and analyze data across a distributed architecture.

Another example is Google's AutoML approach, which is fundamentally re-engineering the way ML models are trained. Google AutoML automates ML model design to generate new neural network architectures based on particular data sets, and then test and iterate on those thousands of times to code better systems. In fact, Google's AI can now create better models than human researchers.

"Look at Google AutoML: ML writing ML code so you don't even need people," said Baker. "This gives you an idea of what an extreme difference there is in what vendors are doing. Some are trying to pass off advanced analytics as ML—and it isn't. And others are doing ML at such an advanced level that's beyond what most businesses can comprehend at the moment."

Then there's Kinetica. The San Francisco-based startup, which has raised $63 million in venture capital (VC) funding, provides a high-performance SQL database optimized for fast data ingestion and analytics. Kinetica is what Radalj described as a massively parallel processing (MPP) distributed database and computing platform in which every node features co-located in-memory data, CPU, and GPU.

What makes an AI database different from a traditional database, Radalj explained, comes down to three core elements:

  • Accelerated data ingestion,
  • Co-locality of in-memory data (parallel processing across database nodes), and
  • A common platform for data scientists, software engineers, and database administrators to iterate and test models faster and apply results directly to analytics.

For all the non-database and AI model training experts reading this, Radalj broke each of these three core elements down and explained how the AI database ties to tangible business value. Data availability and data ingestion are key, he said, because the ability to process real-time streaming data lets businesses take fast action on AI-driven insights.

"We have a retail customer [that] wanted to track selling rates by store, every five minutes," said Radalj. "We wanted to use AI to forecast, based on the last few hours of historical data, whether they should replenish inventory and optimize that process. But to do that machine-driven inventory replenishment requires the [database] to support 600-1200 queries per second. We're a SQL database and an AI database, so we can ingest data at that rate. Us meeting that business mission resulted in an application that drove more ROI [return on investment]."

Baker agreed that ML requires a vast amount of data so ingesting it quickly would be very important for an AI database. The second factor, the concept of "co-locality of in-memory data," takes a bit more explanation. An in-memory database stores data in the main memory rather than in separate disk storage. It does so to process queries faster, particularly in analytics and BI databases. By co-locality, Radalj explained that Kinetica doesn't separate CPU and GPU compute nodes versus storage nodes.

As a result, the AI database supports parallel processing—which mimics the human brain's ability to process multiple stimuli—while also remaining distributed across a scalable database infrastructure. This prevents the larger hardware footprint, resulting from what Radalj called "data shipping" or the need to send data back and forth between different database components.

"Some solutions use an orchestrator like IBM Symphony to schedule work across various components whereas Kinetica stresses function shipping against co-located resources, with advanced optimization to minimize data shipping," said Radalj. "That co-locality lends itself to superior performance and throughput, especially for highly concurrent heavy querying on large data sets."

In terms of the actual database hardware, Kinetica is partnered with Nvidia, which has an expanding lineup of AI GPUs and is exploring opportunities with Intel. Radalj also said the company is keeping an eye on emerging AI hardware and cloud-based infrastructure such as Google's Tensor Processing Units (TPUs).

Finally, there's the idea of a unified model training process. An AI database is only effective if those benefits of faster ingestion and processing serve larger, business-oriented goals for a company's ML and deep learning efforts. Radalj refers to Kinetica's AI database as a "model pipeline platform" that performs data science-driven model hosting.

This all lends itself to faster testing and iteration to develop more accurate ML models. On this point, Baker said collaborating in a unified way can help all of the engineers and researchers working to train a ML or deep learning model iterate faster by combining what works, as opposed to continually reinventing all of the steps in the training process. Radalj said the goal is to create a workflow in which the faster batch ingestion, streaming, and querying generate model results that can immediately be applied to BI.

"Data scientists, software engineers, and database administrators have a single platform where work can be cleanly delineated on data science itself, software program writing, and SQL data models and queries," said Radalj. "People work more cleanly together in those various domains when it's a common platform. The goal more often than not with running ML and deep learning [models] is, you want to use the results of that—the co-efficients and variables—in conjunction with analytics, and use the output for things like scoring or to predict something useful."

Hype or Reality?

The bottom line value of an AI database, at least in the way Kinetica defines it, is in optimizing compute and database resources. This, in turn, lets you create better ML and deep learning models, train them faster and more efficiently, and maintain a through line to how that AI will be applied to your business.

Radalj gave the example of a fleet management or trucking company. In this instance, an AI database could process massive streams of real-time information from a fleet of vehicles. Then, by modeling that geospatial data and combining it with analytics, the database could dynamically re-route trucks and optimize routes.

"It's easier to quickly provision, prototype, and test. The word 'modeling' is thrown about in AI, but it's all about cycling through different approaches—the more data, the better—[and] running them again and again, testing, comparing, and coming up with the best models," said Radalj. "Neural networks have been given life because there's more data than ever before. And we're learning to be able to compute through it."

Ultimately, Kinetica's co-located database and model pipeline platform are but one approach in a space that can mean a lot of different things depending on who you ask. Baker said the challenge for the buyer in a market that's still evolving and experimental is to figure out exactly what an AI database vendor is pitching.

"As a business concept, deep learning, ML, and all of that is a solid concept. What we're working out are tech issues that are solvable, even if we haven't solved them yet," said Baker. "That's not to say this is a mature space because it is definitely not. I would say 'buyer beware' because something pitched as ML may or may not be. It might just be garden-variety advanced analytics."

As to whether AI databases are all hype right now or whether they represent an important trend for where business is going, Baker said it's a bit of both. She said Big Data, as a marketing term, is out of favor now. Baker said there's now some market conflation between advanced, data-driven analytics and true ML and deep learning algorithms. Regardless, whether you're talking about a database for ML modeling or the self-aware AIs dreamed up by pop culture, it all begins and ends with data.

"Data will be used in business until time ends; it's just that central to doing business," said Baker. "When you're talking in terms of science fiction, AI is a self-realized intelligence. That's when you start talking about singularities and robots taking over the world. Whether that happens or not, I don't know. I'll leave that to Stephen Hawking."

This article originally appeared on PCMag.com.