How to Apply Big Data to the Real World

On this edition of Fast Forward, I spoke to Hicham Oudghiri, the CEO and co-founder of Enigma, a company that specializes in collecting and make sense of large data sets. Enigma is an operational data management and intelligence company for private clients, but it is perhaps best known for Enigma Public, a collection of searchable, publicly accessible datasets that include everything from the salaries of White House Office staffers to New York City restaurant inspections. We spoke about the power of big data, the limits of consumer privacy, and the future of our data-driven world.

Why don't you explain to me a little bit about what being an open data company means today?

Absolutely. We started out just collecting a massive amount of public data anywhere we could find it, with the mission really being to try to connect very disparate facts about the world. We realized, in the process, that just as much as access to this underlying data was broken, this pattern was reverberating for people's own data, for public-private data reporting schemes like in regulatory environments. Really, what we brought was this notion of open data as an operational model everywhere we went.

Our sweet spot today is cultivating this massive asset repository of public data and bringing it to bear in actual problem environments often behind the firewall for enterprises. Though we collect and distribute a tremendous amount of data, we've found that taking the next step forward of actually interpreting that data and linking it to private data really helps scale the impact of some of the problems we wanted to solve.

People hear about open data sets, public data sets, private data sets. What kind of data sets are we talking about here?

We're talking about source data, official data, things that government agencies would publish, things that international agencies would publish, everything that's disparate, from corporate registration records and property assessments to H-1B visas or cargo container shipments. Definitely not talking about things like LinkedIn data, which has been a huge topic of debate recently as to whether or not that's even a public data set. There was that lawsuit with much contention recently.

But we're talking mostly about official source data, where there's been a mandate and a kind of formal legal approval to put this out into the public domain, mostly for increasing transparency in the economic and trade system. It's very important for us to know, for instance, from an accountability standpoint, what our government spends with the various private companies, or, from an accountability standpoint, what the distribution of visas is going to amongst companies. That data, it's collected often by the government for alternative purposes like reporting, planning, resource allocation, and then given back to the public for this secondary and often tertiary benefit. The most popular example being just weather data, right?

All of the weather data that we collect comes from official sources, or GPS as a technology.

So you take all those public data sets and then you can merge them with private data sets that a company will give you specifically and really see the insights between combining the two?

Yes, very often. Think about a canonical use case where you're trying to do something like figure out if a company is even real. If it's a small company, take, say, a restaurant or a small business. Very often, the sort of profile they would have on them is extremely thin. But if you were to look at things like their liquor licenses or even Department of Labor inspections or health record inspections, you get a much more granular picture of who they are.

Often, that helps these companies kind of instantiate that they're even real for getting their access to credit, for getting insured, these sort of things. Moving from the, "Here's your 18-page application," and a very annoying process through seven different compliance sets, to something that can happen online in an automated way and a less risk-worthy way in general.

So instead of just typing them into Google to see if they have a website and that they're real, you can have all these other data sets validate for even basic stuff?

Absolutely.

We were talking, before we went live, about Ozark, so your favorite show, my new favorite show, and the idea of using these data sets for compliance and for financial reporting and even to hunt down money launderers.

Yeah. First of all, one of the best shows out there. Huge plug to Netflix, has become first-in-class Hollywood studio.

They've paid for it. They've bought their way into that market.

They certainly have. But the show is about this Jason Bateman character who finds himself as a money launderer to this drug cartel. The catch is that he saves his life by saying that he's going to go to the Ozarks and find new channels to launder money through. He starts buying into these sleepier businesses and then passing through a variety of costs.

The money-laundering problem is a huge theoretical problem in that, honestly, you're looking at patterns of activity amongst different merchants or consumers of financial services and also the connections in between them. So you'll have like a registered agent, obviously, someone like Jason Bateman, who's going around and doing this for a couple of businesses. He's buying-in privately to them and starting to get his name on a variety of different forms, and you'll notice that pattern of activity. This is something that banks have to fight against, obviously, because it's a detriment to the system and they're on the hook for doing this.

Crime has gone just as digital and decentralized as music has. This is a much bigger problem. There isn't one big mob family that the government can be lurking around for months and getting them Capone-style. This is an all-out chase on many fronts. We've helped and worked to bring public data to bear on that problem, but also bringing our technology that we've used to aggregate all of this public data to bear on that problem, just because the banks have a lot of technological uplift to do to merge their own data sets into powerful, contextual clues for these investigators that they have on staff.

I feel like we're at a point now where we've got all this public data created by government agencies. We've got all these private data sets. Every company has multiple data sets and many different formats, often, within the same company. Yet, there's not a lot of standardization, and making them work together is actually a major challenge.

It's a huge challenge, and probably one of the biggest theses that we have at Enigma is a big divide. One of my investors called it this way - there's a world where data is instrumented in bits and there's a world where it's instrumented in atoms. The tech companies, Google, Facebook, Amazon, they've all done an amazing job taking the data that they get from your activity browsing the web and creating these new services like search and better e-commerce experiences. But that data all exists. It's digitally native. It's just listening to you on the web. The web is a protocol, and those protocols were designed to speak each other.

But when you have this data that is instrumented in atoms, or the real world, like someone going into a bank in the Ozarks and asking for a small loan, that looks different than someone else walking into a different bank branch, or a cargo container ship coming in that's asking for the name of the company that's doing the shipping. All of this data was designed - or not designed - to speak to each other so there's a huge problem of stitching this data together. I think it will take these less, purely tech industries a longer time to reap the benefits of what you've seen in tech with big data. But when they do, I think it'll change a lot of how we live day-to-day in a pretty impactful way.

I also get the sense that, when there's a financial motive to stitching together these data sets and creating these insights, businesses find a way to pay for it and they find a way to get it done. Credit card companies are one of the first companies to be able to identify patterns and identify fraud. I feel like the public sector is pretty far behind when it comes to creating insights from these amounts of data. Is that a fair assessment?

The private sector has always, in some senses, had an edge in operationalizing technology. The financial incentive is huge and also the operating style of a smaller unit. The US government is just factually one of the biggest organizations in the world, and getting anything done is really a people problem. Making sure incentives are aligned, making sure people are taking the right amount of risk.

But we've seen the government do some very innovative things. We collaborated with the City of New Orleans, I think it was like two years ago, to help them basically predict where the slum landlords were, mostly to install smoke detectors in these homes. Post-Katrina, you had this huge amount of blight. A lot of landlord were getting away with leaving people with bad conditions. Honestly, smoke detectors do just a great job of preventing death from fire. Instead of sending a fireman to a random home, what if you used factors like demographics and how old the building was and the last time that there was a certain kind of installation of some sort of infrastructure like telecommunication infrastructure?

You use all of these facts and you get a hit rate of the doors that you're knocking on that's substantially higher. We've seen a lot of this kind of moneyball for local government stuff play out pretty strongly. Obviously, there's been a tremendous amount of data usage in the intelligence community, as you can imagine. We do find that there are pockets of innovation. Again though, it's all about how you operationalize it.

You have all those data points but then you have to query it in the appropriate way, look for the patterns. You almost have to search for the correlations, and that's a whole series of questions and answers. It's establishing a relationship with the data that, I think we're just starting to figure out how that works.

Yes. We're starting to figure out how it works from a skillset perspective. And, there's like a mind shift in terms of statistical thinking versus not statistical thinking. There's this saying: "All models are wrong but some are useful," - so it's really about whether you can, without the data, without the algorithms, contextualize a little bit, the parameters of your statistical thinking. I may not get this right, like in the case of the fire, we may not get this right but we may increase our chances of getting it right or we may reduce our surface area of risk or what we have to search for. Bringing that get-it-done attitude to the problem, that's skillset number one when it comes to being able to think statistically. Some folks are locked into, "Well, the only way we can be sure is if we have X, Y, and Z."

I'll give you a case in a private example. Very often in banks, for reasons of historical fraud and compliance, the way they would verify whether someone was real before they issued a credit card was making sure their telephone number and their address matched whatever they had on the application. Not all companies use actual telephone landlines now. Not all companies use their main address as the one they're actually operating out. There's some kind of outmoded realities of people working at WeWork now and people using voice over IP. Getting comfortable with identifying folks through their social presence or through some of the data sets that we bring in at Enigma that provide these ancillary proof points. Looking and running historically the statistics to see whether the likelihood of it being real is strong, versus the guarantee that you would get from these alternative means beforehand.

I think that's an interesting point too, that assumption that all models will be wrong, either largely wrong or wrong in a smaller way, but that's okay because it can still help you make good decisions. Is that a skill that we are doing a good job of teaching our children, and where would they even get that training? I mean, it wouldn't be in math necessarily. It wouldn't be in Social Studies. Where do they get that sensibility?

Statistics has often been sub-classed, like math education in general, but you see it in other places. You see it popping up even in your ESPN feed these days. People are much more comfortable with prediction being part of their lives. Honestly, I love these black swan moments where all of that flies in our face. Take the last election. You had Hilary winning, and you had the world's best data scientists at some of the finest institutions call it wrong.

Winning, but winning was not having a 70 percent likelihood to win because that still means that, one out of three times, Donald Trump wins. And guess what? This was one of those three times.

Absolutely. And then there's the education that we're seeing these patterns get people more comfortable. In the classrooms, I think one of the biggest problems that we have is just the applied learning. It's like, I have no idea why they don't teach personal finance in the classroom. I mean, I was an idiot with my money at the age of 18 and the effect on debt and all of that. I'm still amazed that they don't do that, so I feel like we're moving in a world where education will get more and more about the applied stuff and less about the theoretical stuff. But then I worry if we're losing some parts of cultural learning. It's all a trade-off.

I'll go even further down that road and talk about artificial intelligence.

Artificial intelligence, a hugely transformational technology. It seems to me that there's a role for artificial intelligence in helping us make sense of this world of overabundance of data and find those patterns for us. Are you optimistic about AI helping us make sense of that or is that going be something totally separate from the rest of our human experience?

No. I mean, I'm optimistic in the sense that I'm optimistic about humanity in general. I feel like that's a flip gene thing that happens to folks at some period in time. One of the things that I like the most about the promise of artificial intelligence is that it'll actually help the technology go away because right now, the focus is on technology and data being so present. But in reality, the work of data is very intensive. There's a reason they call it data mining when you're looking for stuff in a data set. It's very nasty. The data sets aren't clean. It's kind of brutish in a sense.

What I like about AI is that it creates these feedback loops from observed experience. Though you're collecting all of this data from all of these places, you don't really necessarily know how it will come together so you start to study the outcomes. Machine learning helps us really be a bit more outcome-oriented in how we get to statistical thinking. I think it'll help us abstract away some of the nastiness of that work and be a bit more outcome-oriented in how we approach it. Now, it's definitely going be scary in terms of the impact on automation in some areas where, frankly, I think AI should be left alone, like replacing a jury. Will we ever get that emotional intelligence quality? I don't know.

And you'd have to choose and say you want that emotional quality in the jury as opposed to a pure likelihood that this person is guilty or not guilty?

Yes. For me, the underlying humanity, I think is super important. Frankly, just being in the business and seeing how much the human touch is important to even convincing people to start thinking statistically, I'm optimistic that we won't lose that with the advent of AI at scale.

We touched a little bit about whether LinkedIn was a public data set. A lot of people, they sort of sense that they're living in this world where everything about them is available online, from their purchase patterns to their age to their medical history. It makes people uncomfortable. It makes people worry that the government has too much information. I'm personally more worried that private companies have too much information and they're far less regulated.

Yes.

Do we need laws to protect our personal information? Should personal information be treated separately than your government records?

Absolutely. We have very little protection as to the laws that govern the way in which we give our data away. Think about it in certain professions. In the medical profession, it's on lockdown. But for some reason, it's not necessarily on lockdown in other industries. The reason was, back then, there wasn't much you could do with your personal information. Today, they have a really good sense of how to get you to convert or the likelihood that you'll be somewhere. For all intents and purposes, that's actually mostly beneficial to us, in my opinion.

But at the same time, our data still deserves that amount of kind of sanctity in how it's handled. Europe has been coming out with very strong laws. There's a law coming out called GDPR. It's set to be enacted in 2018, and it carries everything from making sure companies are tracking the lineage of their personal data, who has it, how is access given to it within the company, right to be forgotten measures. When you say, "Delete my data," are you actually deleting it or are you keeping it for some other piece of information? So there's an exchange, always, in between consumers and the services that they work for. A lot of these services are free and we love them, right?

I would give away part of myself for YouTube access, right? I'm just very happy about it.

And probably, you have.

And probably, I have. But it doesn't mean that that part that I give away shouldn't be put into a safe box and that I know that that box is under a bunker and all of those good things.

Also, the idea of expiration of data, which, in the digital world today, is a relatively new concept. It used to be that there was a certain obscurity. If something happened 30 years ago, it would be difficult to find records and get a profile from back then. But there the kids today who have been online their entire lives, and what they did and posted when they were 13 is going be there when they're 63.

Yes.

We don't have a legal infrastructure that can deal with that in any meaningful way.

No, we don't and it's a hairy area. It's a hairy area in employment law. It's a hairy area for dating, right?

If you look at someone's Facebook profile - I think that culture will adapt to that, to someone's online presence being public. But it's almost theatrical. It's like your public presence is not the real you. What was that Jim Carey movie? We all put on a mask, metaphorically speaking. So I think your online presence will be more like this gallery or this art piece that describes you and then there's the real you. But there's still you doing a body shot or something like ... That, you don't want to be ever public. There's a real question of whether people who are young enough have the ability to decide whether it's smart to put that online or not. It's scary, for sure.

Speaking of putting stupid things online, let's talk about the Trump administration. I have heard on multiple fronts ... You're obviously working with a lot of public data sets. You have to go and ask permission to get this information a lot of times, or figure out how to ingest it. Is it easier now? How has the access to public data sets changed since the Trump administration took office?

Yes. My first caveat when I talk about this stuff is, big difference in between the Trump administration and the US government. The US government is by far one of the more transparent institutions I have ever come across in the world. We are wildly transparent relative to our peers for the amount of data that we put out, for how much we fund this sort of stuff, so caveat number one.

When it comes to Trump, I mean, it's been very clear to me that everyone should be very anxious about this administration's stance with transparency and sharing of information. First of all, there is very explicit stuff like taking down the list of visitors to the White House, which was a practice that Obama put in place and I think one of the most central accounting systems of the government. There's been EPA data, there's been climate data, and generally, there's been even debate about some census data being affected by this. You've got to remember, these are no small endeavors. I think the US census is over a $4 billion investment every time it happens, with something over 300,000 volunteers involved.

Some of these things, we'll see their impact in four years, just given the funding cycles of how it happens. Though this administration is certainly not friendly, I think that the transparency backbone in this country is strong enough. Oddly, that comes from both the left and the right. Strong enough to make sure that this movement towards openness of information is here to stay.

And there's a lot riding on these data sets.

Yes. It's how we decide where to put hospitals. It's how we decide how to route ambulances. It's how we decide just so many of the base services, like waste management relies on these sort of things.

Tell people who are looking at the Enigma public data set, which I've visited multiple times ... super, super cool. What should people expect when they go there? What can they get out of it?

One of our commitments is to continuously being honest about this mission of collecting all of the data, but giving it back as much as we can to folks. It's completely free to use for non-commercial purposes, journalistic purposes. We want to make sure that everyone has access to this data. You don't even need to login or need to give us any information to go ahead and access it. When we founded the company, there was a big premise on access.

As we've learned a lot more through the years, access and interface design and search and credibility have been very important. The other one has been curation and that's the huge focus of Enigma Public, which we re-launched this summer, was this notion that people need to know how this data is being used. People need to know not only best practices for how to work with data but which data sets are good for what. What's new, what's exciting? I think that sort of education is something that we're very excited to be a part of and something that we hope people will get the second they land on the site.

It's definitely worth checking out. I think, again, businesses see that data and they know that they can build businesses on top of it. I think for journalists and for citizens, there's a lot more education that's required.

Absolutely, a lot more education and, hopefully, a whole layer of services on top of it delivering things to people like me and you when we don't geek out, so to speak.

Let me ask you the questions I ask everybody that comes on the show. What technological trend concerns you the most? Is there anything that keeps you up at night?

The trend that concerns me the most or the thing that I think, on the horizon, that we should watch out for the most is this notion of biological programming, so the extent to which we are getting much better at programmatically creating strands of biological living organisms. That has huge impact for good, but also has huge impact for the ability to create small-scale, basically malfeasance through this thing. Wherever technology and bio meets, I'm always a bit concerned as to how that's handled. It's like the next wave for me, post-nuclear, is really our ability to do things like programmatically sequence stuff in a small-scale lab and distribute it.

The challenge is that even if we pass laws here in the United States, that doesn't mean that someone can't do the same research in China or in Russia.

Absolutely - and even from a safety perspective, right? So we really start to have the means now for anyone to DIY their own biological warfare program. So that, for me, is the thing that concerns me the most. But the flipside includes things like personalized medicine, the fact that you can really understand my body, you can almost create this biological version of a software program designed to cure whatever illness I have. Just as concerned as I am, I'm also excited for that.

I think the shortcoming there will be we need some kind of ethical structure to put these new technologies in. We did it with nuclear weapons and nuclear power, barely, but we did it there and I think we're going to need to develop something similar. On a personal level, is there a technology that you use every day that's just transformed your life, that you're amazed by?

This is kind of weird, but just FaceTime. Or video chat. I have some family members abroad and I travel a lot for work. The difference in between a phone call and a video chat just kind casually on the phone, it's really made me feel the whole promise that internet has connected everyone. Being able, in a matter of 15 seconds. I'm originally from Morocco, so seeing someone across the globe and saying, "Hey, what are you up to?" , seeing what the weather looks like in their environment and how they're dressed and their demeanor, that has really changed how I feel connected to folks around me and made me feel like we all live in this big village a bit more, and I like that feeling.

There's something interesting too that, I watched the video conferencing boom sort of rise. It was going to be the next thing. Nobody would be making phone calls anymore. Video conferencing never really took off but video chat, more personal, profoundly different and not in a work environment, something almost more casual than a telephone call. Like it can be an instantaneous thing.

I have a 3-year-old daughter and she totally has the hang of it. She video chats before she phone calls. She doesn't know what a phone call is. You put a speaker phone and you ask her to chat to someone and she's not at all interested. You put her in front of her grandfather on FaceTime and she could be there for 20 minutes.

It's going to be as strange to her as those rotary phones that kids today don't know how to use. Hicham, how can people follow you online, find out what you're doing, and keep up with Enigma?

Go to enigma.com. Check out Enigma Public for sure, that's public.enigma.com. Check out our website. We have a pretty active Twitter account, no Instagram for us yet.

Never say never.

Never say never. But-

You could do great things with infographics.

Yeah, that's true. We're really huge fans of data vis. We do have this cool part of our site, labs.enigma.com, where it's all of our experiments and some of our pro bono projects like the one I mentioned with New Orleans, so I'd check that out as well.

Very cool. Thanks so much for coming on.

Awesome. Thank you so much for having me.

For more Fast Forward with Dan Costa, subscribe to the podcast. On iOS, download Apple's Podcasts app, search for "Fast Forward" and subscribe. On Android, download the Stitcher Radio for Podcasts app via Google Play.

This article originally appeared on PCMag.com.