Intro. [Recording date: March 25, 2025.]
Russ Roberts: Today is March 25th, 2025, and my guest is podcaster and author, Dwarkesh Patel. You can find him on YouTube, at Substack at Dwarkesh.com. He is the author with Gavin Leech of The Scaling Era: An Oral History of AI, 2019-2025, which is our topic for today, along with many other things, I suspect. Dwarkesh, welcome to EconTalk.
Dwarkesh Patel: Thanks for having me on, Russ. I’ve been a fan, I was just telling you, for ever since–I think probably before I started my podcast, I’ve been a big fan, so it’s actually really cool to get to talk to you.
Russ Roberts: Well, I really appreciate it. I admire your work as well. We’re going to talk about it some.
Russ Roberts: You start off saying, early in the book–and I should say, this book is from Stripe Press, which produces beautiful books. Unfortunately, I saw it in PDF [Portable Document Format] form; but it was pretty beautiful in PDF form, but it’s I’m sure even nicer in its physical form. You say, ‘We need to see the last six years afresh–2019 to the present.’ Why? What are we missing?
Dwarkesh Patel: I think there’s this perspective in the popular conception of AI [artificial intelligence], maybe even when researchers talk about it, that the big thing that’s happened is we’ve made these breakthroughs and algorithms. We’ve come up with these big new ideas. And that has happened, but the backdrop is just these big-picture trends, these trends most importantly in the buildup of compute, in the buildup of data–even these new algorithms come about as a result of this sort of evolutionary process where if you have more compute to experiment on, you can try out different ideas. You wouldn’t have known beforehand why the transformer works better than the previous architectures if you didn’t have more compute to play around with.
And then when you look at: then why did we go from GPT-2 to GPT-3 to GPT-4 [Generative Pre-trained Transformer] to the models we’re working with now? Again, it’s a story of dumping in more and more compute. Then that raises just a bunch of questions about: Well, what is the nature of intelligence such that you just throw a big blob of compute at wide distribution of data and you get this agentic thing that can solve problems on the other end? It raises a bunch of other questions about what will happen in the future.
But, I think that trend of this 4x-ing [four times] of compute every single year, increasing in investment to the level we’re at hundreds of dollars now at something which was an academic hobby a decade ago, is the missed trend.
Russ Roberts: I didn’t mention that you’re a computer science major, so you know some things that I really don’t know at all. What is the transformer? Explain what that is. It’s a key part of the technology here.
Dwarkesh Patel: So, the transformer is this architecture that was invented by some Google researchers in 2018, and it’s the fundamental architectural breakthrough behind ChatGPT and the kinds of models that you play around with when you think about an LLM [large language model].
And, what separates it from the kinds of architectures before is that it’s much easier to train in parallel. So, if you have these huge clusters of GPUs [Graphics Processing Units], a transformer is just much more practicable to scale than other architectures. And that allowed us to just keep throwing more compute at this problem of trying to get these things to be intelligent.
And then the other big breakthrough was to combine this architecture with just this really naive training process of: Predict the next word. And you wouldn’t have–now, we just know that this is how it works, and so we’re, like, ‘Okay? Of course, that’s how you get intelligence.’ But it’s actually really interesting that you predict the next word in Wikitext, and as you make it bigger and bigger, it picks up these longer and longer patterns, to the point where now it can just totally pass a Turing Test, can even be helpful in certain kinds of tasks.
Russ Roberts: Yeah, I think you said it gets “intelligent.” Obviously that was a–you had quotes around it. But maybe not. We’ll talk about that.
At the end of the first chapter, you say, “This book’s knowledge cut-off is November, 2024. This means that any information or events occurring after that time will not be reflected.” That’s, like, two eons ago.
Dwarkesh Patel: That’s right.
Russ Roberts: So, how does that affect the book in the way you think about it and talk about it?
Dwarkesh Patel: Obviously, the big breakthrough since has been inference scaling, models like o1 and o3, even DeepSeek’s reasoning model. In an important way, it is a big break from the past. Previously, we had this idea that pre-training, which is just making the models bigger–so if you think like GPT-3.5 to GPT-4–that’s where progress is going to come from. It does seem that that alone is slightly disappointing. GPT-4.5 was released and it’s better but not significantly better than GPT-4.
So, the next frontier now is this: How much juice can you get out of trying to make these smaller models–train them towards a specific objective? So, not just predicting internet text, but: Solve this coding problem for me, solve this math problem for me. And how much does that get you–because those are kinds of verifiable problems where you know the solution, you just get a see if the model can get that solution. Can we get some purchase on slightly harder tasks, which are more ambiguous, probably the kind of research you do, or also the kinds of tasks which are–just require a lot of consecutive steps? The model still can’t use a computer reliably, and that’s where a lot of economic value lies. To automate remote work, you actually got to do remote work. So, that’s the big change.
Russ Roberts: I really appreciate you saying, ‘That’s the kind of research you do.’ The kind of research I do at my age is what is wrong with my sense of self and ego that I still need to do X, Y, Z to feel good about myself? That’s the kind of research I’m looking into. But I appreciate–I’m flattered by your presumption that I was doing something else.
Russ Roberts: Now, I have become enamored of Claude. There was a rumor that Claude is better with Hebrew than other LLMs. I don’t know if that’s true–obviously because my Hebrew is not good enough to verify that. But I think if you ask me, ‘Why do you like Claude?’ it’s an embarrassing answer. The typeface is really–the font is fantastic. The way it looks on my phone is beautifully arrayed. It’s a lovely visual interface.
There are some of these tools that are much better than others for certain tasks. Do we know that? Do the people in the business know that and do they have even a vague idea as to why that is?
So, I assume, for example, some might be better at coding, some might better at more deep research, some might better at thinking and meaning, taking time before answering and it makes a difference. But, for many things that normal people would want to do, are there any differences between them–do we know of? and do we know why?
Dwarkesh Patel: I feel like normal people are in a better position to answer that question than the AI researchers. I mean, one question I have is: in the long run, what will be the trend here? So, it seems to me that the models are kind of similar. And not only are they similar, but they’re getting more similar over time, where, now everybody’s releasing a reasoning model, and they’re not only that, they’re copying the–when they make a new product, not only do they copy the product, they copy the name of the product. Gemini has Deep Research and OpenAI has Deep Research.
You could think in the long run maybe they’d get distinguished. And it does seem like the labs are pursuing sort of different objectives. It seems like a company like Anthropic may be much more optimizing for this fully autonomous software engineer, because that’s where they think a lot of the value is first unlocked. And then other labs maybe are optimizing more for consumer adoption or for just, like, enterprise use or something like that. But, at least so far–tell me about your impression, but my sense is they feel kind of similar.
Russ Roberts: Yeah, they do. In fact, I think in something like translation, a truly bilingual person might have a preference or a taste. Actually, I’ll ask you what you use it for in your personal life, not your intellectual pursuits of understanding the field. For me, what I use it for now is brainstorming: help me come up with a way to think about a particular problem, tutoring. I wasn’t sure what transformer was, so I asked Claude what it was. And I’ve got another example I’ll give in a little bit. I use it for translation a lot because I think Claude’s much better–it feels better than Google Translate. I don’t know if it’s better than ChatGPT.
Finally, I love asking it for advice on travel. Which is bizarre, that I do that. There’s a zillion sites that say, ‘The 12 best things to see in Rome,’ but for some reason I want Claude’s opinion. And, ‘Give me three hotels near this place.’ I have a trust in it that is totally irrational.
So, that’s what I’m using it for. We’ll come back to what else is important, because those things are nice but they’re not important. Particularly. What do you use it for in your personal life?
Dwarkesh Patel: Research, because my job as a podcaster is I spend a week or two prepping for each guest and having something to interact with as I am–because you know that you read stuff and it’s like you don’t get a sense of why is this important? How does this connect to other ideas? Getting a constant engagement with your confusions is super helpful.
The other thing is, I’ve tried to experiment with putting these LLMs into my podcasting workflow to help me find clips and automating certain things like that. They’ve been, like, moderately useful. Honestly, not that useful. But, yeah, they are huge for research. The big question I’m curious about is when they can actually use the computer, then is that a huge unlock in the value they can provide to me or anybody else?
Russ Roberts: Explain what you mean by that.
Dwarkesh Patel: So, right now there are just–some labs have rolled out this feature called computer use; but they’re just not that good. They can’t reliably do a thing like book you a flight or organize the logistics for a happy hour or countless other things like that, right? Which sometimes people use this frame of: These models are at high school level; now they’re at college level; now they’re a Ph.D. level. Obviously, a Ph.D.–I mean, a high schooler could help you book a flight. Maybe a high schooler especially, maybe not the Ph.D..
Russ Roberts: Yeah, exactly.
Dwarkesh Patel: So, there’s this question of: What’s going wrong? Why can they be so smart in this–I mean, they can answer frontier math problems with these new reasoning models, but they can’t help me organize–they can’t, like, play a brand new video game. So, what’s going on there?
I think that’s probably the fundamental question that we’ll learn over the next year or two, is whether these common-sense foibles that they have, is that sort of intrinsic problem where we’re under–I mean, one analogy is, I’m sure you’ve heard this before–but, like, remember–the sense I get is that when Deep Blue beat Kasparov, there was a sense that, like, a fundamental aspect of intelligence had been cracked. And in retrospect, we realized that actually the chess engine is quite narrow and is missing a lot of the fundamental components that are necessary to, say, automate a worker or something.
I wonder if, in retrospect, we’ll look back at these models: If in the version where I’m totally wrong and these models aren’t that useful, we’ll just think to ourselves, there was something to this long-term agency and this coherence and this common sense that we were underestimating.
Russ Roberts: Well, I think until we understand them a little bit better, I don’t know if we’re going to solve that problem. You asked the head of Anthropic something about whether they work or not. You said, “Fundamentally, what is the explanation for why scaling works? Why is the universe organized such that if you throw big blobs of compute at a wide enough distribution of data the thing becomes intelligent?” Dario Amodei of Anthropic, the CEO [Chief Executive Officer] said, “The truth is we still don’t know. It’s almost entirely just a [contingent] empirical fact. It’s a fact that you could sense from the data, but we still don’t have a satisfying explanation for it.”
It seems like a large barrier, that unknowing. It seems like a large barrier to making them better at either actually being a virtual assistant–not just giving me advice on Rome but booking the trip, booking the restaurant, and so on. Without that, how are we going to improve the quirky part, the hallucinating part of these models?
Dwarkesh Patel: Yeah. Yeah. This is a question I feel like we will get a lot of good evidence in the next year or two. I mean, another question I asked Dario in that interview, which I feel like I still don’t have a good answer for, is: Look, if you had a human who had as much stuff memorized as these LLMs have, they know basically everything that any human has ever written down, even a moderately intelligent person would be able to draw some pretty interesting connections, make some new discoveries. And we have examples of humans doing this. There’s one guy who figured out that, look, if you look at what happens to the brain when there’s a magnesium deficiency, it actually looks quite similar to what a migraine looks like; and so you could solve a bunch of migraines by giving people magnesium supplements or something, right?
So, why don’t we have evidence of LLMs using this unique asymmetric advantage they have to do some intelligent ends in this creative way? There are answers to all these things. People have given me interesting answers, but a lot of questions still remain.