Banks have a secret weapon known as synthetic data. Here’s how it works

By Sam Becker

Millions of Americans don’t trust banks and financial institutions. In fact, that was the second-most cited reason that unbanked U.S. households don’t have bank accounts, according to survey data from the FDIC. And if Americans don’t trust banks with their money, they’re likely not going to trust them with their personal data and information, either.

But there’s a limit to how much banks can leverage your data, anyway. That’s because the ways in which banks can use customer data to research tools and methods, or devise new product offerings, are rife with red tape—making it difficult not only for banks to drive revenues, but also to fine-tune their products and services.

That’s why banks are turning to another way of generating the data they need, by creating what’s called “synthetic data.” In a world that’s quickly adopting artificial intelligence and creating meatless meat products, why not embrace phony data too?

Is it real or fake?

But phony isn’t the right way to phrase it, experts say. Synthetic data wasn’t just pulled out of the ether. Instead, synthetic data is an artificial version of real data—it has the same characteristics and structures as real data, and similar statistical properties.

“Synthetic data is realistic-looking data generated by a machine learning model, and we differentiate between fake and synthetic data,” says Kalyan Veeramachaneni, Principal Research Scientist Institute for Data, Systems, and Society at the Massachusetts Institute of Technology, and cofounder of DataCebo, a synthetic data company. Veeramachaneni’s work over the years has been on the cutting edge in the industry—he and his team also developed the Synthetic Data Vault, which is helpful for creating synthetic data sets—and he says he also works with several large banks.

“Fake data is randomly generated,” he says, “while synthetic data is trying to create data from a machine learning model that looks very realistic.”

It also isn’t new. Synthetic data has been around for years, but its use is picking up steam as generative AI tools become more common in the corporate sphere. Many businesses are now finding ways to put it to work.

As for how they’re doing it? Veeramachaneni says that it starts with the creation of a generative model, analyzing real data and applying that to the model, and then allowing it to generate synthetic data based on the real stuff. Synthetic data can be generated and analyzed much faster than real data, and there are no privacy or cybersecurity concerns that are attached to it.

The key, he explains, is that the synthetic data “needs to be realistic, and it’s a lot like language—the realism of the generative models also includes breaking down and understanding context.” That might include looking at a dataset containing phone numbers and being able to derive applicable tax rates for households based on area codes, he says. There’s a lot of information that can possibly be gleaned from one piece of data.

The “crash-test dummy” of data

There are two big value-adds for banks and financial institutions when it comes to the generation and use of synthetic data. First, as mentioned, is the ability to analyze and test data without cybersecurity or privacy concerns. Second, it’s able to save an enormous amount of time.

Tobias Hann, the CEO of Mostly.ai, a synthetic data company, says that synthetic data allows for unrestricted internal use—something that was largely unavailable using real data. “Banks are in a very regulated space,” he says. “They have lots of sensitive data. It’s so sensitive that even internally, you restrict access to the data. So, even for data science teams, they need to wait months to get access to data sets and do analyses.”

But by generating synthetic datasets from the real ones, those data science teams “don’t need to wait months—they get it instantly.”

It may be helpful to think of synthetic data as similar to a crash-test dummy used to test safety standards in the automotive space. Those dummies are loaded with special sensors, and are created to mimic injuries that could occur in soft tissues and bones in an actual human being. They’re synthetic recreations based on an actual human body—but, just like synthetic data, they’re manufactured replications.

That allows banks to play around with the data and derive results that can be applied to the real world, just like a crash-test dummy can provide actionable data and information to auto manufacturers.

Hann mentions that synthetic data companies are largely operating in one of two areas, either dealing with “structured” or “unstructured” data. “Structured data is anything you can put in Excel,” he says, while unstructured data would include things like audio or images. Currently, almost all enterprises that use synthetic data—including banks—are using structured data.

Better than the real thing

While it may seem a bit worrisome for consumers that banks are using less-than-real data to make decisions, Hann says there’s little to worry about. Using synthetic data means banks are “getting much more insight during the product development process, it can help improve the user experience,” he says. “What we’re working toward at the end of the day is the ability to give data to many more people, from interns to the CEO.” Ultimately, access to data should lead to banks becoming more efficient and user-friendly.

“When you’re using services, there are delays and outages,” explains Veeramachaneni, and that’s largely due to “things operating in the background, like APIs talking to each other.” Eventually, synthetic data may help in speeding those processes up, allowing bank transfers to execute more quickly, and lead to a smoother, better experience for consumers navigating what can feel like an archaic financial system.

“From a consumer’s point of view, it will improve their experience,” he says. “This is the real deal for actual consumers.”

Note: This story was updated to correct the attribution of a quote.

Fast Company

(72)

Is it real or fake?

The “crash-test dummy” of data

Better than the real thing

Related