Sramana Mitra: We have a question from the audience. Sanjeev Munjal is asking, “Does this mean that businesses with a lot of historical data can generate LLMs better for sectors like banking, insurance, etc.?”
Benjamin Narasin: Well, that’s somewhat my point—it’s a temporary advantage if a business has historical data.
I learned long ago that many businesses think their data lakes are valuable. But in reality, some have “data swamps”—data that isn’t useful at all. LLMs might change that, but I believe the advantage of proprietary data for LLM training is temporary.
Even if a business decides to keep their data private, unless they’re the only ones doing what they do—which is rare in industries like banking and insurance—eventually, others will license similar data to train other LLMs. There are varied views on the efficacy of smaller proprietary data sets versus broader ones. Initially, it seemed like larger data sets outperformed specialized ones, though that’s less certain now. However, it’s hard to imagine data so unique that no one else could replicate it. In many ways, businesses overvalue their own proprietary assets. For instance, in banking or insurance—the world’s largest industries—many players are involved, so any “special sauce” a bank claims to have can be surpassed by others.
Take fraud detection, for example. Some banks stick to outdated internal systems, which frustrates users with constant alerts. Citibank, for instance, relies on systems that don’t always work effectively. Rather than insisting on proprietary solutions, companies should adopt the best technology available. It doesn’t always have to be invented internally.
Sramana Mitra: I’d like to comment on one point you made and ask a follow-up. Smaller language models are gaining traction, not necessarily because they’re more proprietary but because they tend to hallucinate less.
Benjamin Narasin: That’s true; they’re related but distinct. Smaller models also offer cost advantages. Experiments, such as one Google ran using higher mathematics in training, showed large models initially outperforming smaller, focused ones. But more recent tests have seen focused training sets outperforming larger ones. It’s too early to tell. The question of small vs. large models is relevant to specialized versus broad applications. Smaller models might have fewer hallucinations, though that’s improving across the board.
I often use Perplexity, which has its critics, yet I find it to be a pretty reliable source. I will use multiple engines for comparison, as it’s essential to make informed human decisions. For instance, my youngest son believes every 15-second clip he sees on Reddit as gospel that was handed down in a Lutheran-bound Bible. We get into a lot of arguments about this as I say I need to see the source data.
Sramana Mitra: The problem with user-generated content is the source is often not credible, often a lot of BS.
Benjamin Narasin: Another problem is the homogenization of data and news. Prior to using the internet and various streams as your primary source of input, you had a specific format. I used to walk across the street to the library and read the New York Times and the Wall Street Journal. I know where that data’s coming from. I know the source of that material. But when you’re looking at Facebook as an example, whether it’s the Wall Street Journal, the New York Times, or Ben’s House of Mud’s random rant, it’s all in the same form. It homogenizes the data such that you look at a piece of content from the New York Times, and it appears exactly the same as Ben’s House of Mud rant.
There was a great example of this where a story was circulated around Silicon Valley, which somebody talked about in our Monday meeting. I wanted to know where that came from. I looked it up online and I found it was on Facebook, and I clicked through. It was from a satirical newspaper like The Onion. It was a made-up piece of comedy, but the person had read it and enough people had read it as if it was news. It spread as a rumor, as if it had actually happened.
Sramana Mitra: That’s really bad.
Benjamin Narasin: LLMs also provide us with a homogenized end result. One of the things I like about Perplexly is it shows me the sources. I will admit I very seldom click on them, but when I do on occasion, I usually find something and that gives me a little more confidence.
This segment is part 3 in the series : 1Mby1M Virtual Accelerator AI Investor Forum: With Benjamin Narasin, Founder and General Partner at Tenacity Venture Capital
1 2 3 4 5 6 7