Listen on the Go đ§ 24 minutes 21 seconds
Few journalists explore how AI works and how it’s changing our world in such a lucid and accessible manner as Timothy Lee (Bluesky profile). He reports on AI and the future of the economy and with a computer science masters degree from Princeton writes about topics that concern us all. In the summer of 2023 he wrote Large language models, explained with a minimum of math and jargon that truly resonated with a public just wanting answers in plain English.
He writes about self-driving cars, technology, macroeconomics and a lot of interesting stuff. His work has formerly appeared in the Washington Post, Vox.com, and Ars Technica.
Now living in Washington DC and drawing on two decades of experience writing about tech policy and with a strong background in macro and full-stack economics, if you havenât heard of Timâs work, let me introduce you to just how easy and relevant to our times and for AI enthusiasts of all walks of life he is to read. Here is some of his work that resonated with me that you might like as well:
Seven big advantages human workers have over AI
Are transformer-based foundation models running out of steam?
Elon Musk wants to dominate robotaxisâfirst he needs to catch up to Waymo
Six principles for thinking about AI risk
Letâs now take a short stroll through the not so distant history of AI, to enable us to go back to the future with folk like Geoffrey Hinton, Fei-Fei-Li, Jensen Huang and Yann LeCun.
Why the deep learning boom caught almost everyone by surprise
By Timothy B. Lee () of the Newsletter Understanding AI.
During my first semester as a computer science graduate student at Princeton, I took COS 402: Artificial Intelligence. Toward the end of the semester there was a lecture about neural networks. This was in the fall of 2008, and I got the distinct impressionâboth from that lecture and the textbookâthat neural networks had become a backwater.
Neural networks had delivered some impressive results in the late 1980s and early 1990s. But then progress stalled. By 2008, many researchers had moved on to mathematically elegant approaches such as support vector machines.
I didnât know it at the time, but a team at Princetonâin the same computer science building where I was attending lecturesâwas working on a project that would upend the conventional wisdom and demonstrate the power of neural networks. That team, led by Prof. Fei-Fei Li, wasnât working on a better version of neural networks. They were hardly thinking about neural networks at all.
Rather, they were creating a new image dataset that would be far larger than any that had come before: 14 million images, each labeled with one of nearly 22,000 categories.
Li tells the story of ImageNet in her recent memoir, The Worlds I See. As she worked on the project, she faced a lot of skepticism from friends and colleagues.
âI think youâve taken this idea way too far,â a mentor told her a few months into the project in 2007. âThe trick is to grow with your field. Not to leap so far ahead of it.â
It wasnât just that building such a large dataset was a massive logistical challenge. People doubted the machine learning algorithms of the day would benefit from such a vast collection of images.
âPre-ImageNet, people did not believe in data,â Li said in a September interview at the Computer History Museum. âEveryone was working on completely different paradigms in AI with a tiny bit of data.â
Ignoring negative feedback, Li pursued the project for more than two years. It strained her research budget and the patience of her graduate students. When she took a new job at Stanford in 2009, she took several of those studentsâand the ImageNet projectâwith her to California.
ImageNet received little attention for the first couple of years after its release in 2009. But in 2012, a team from the University of Toronto trained a neural network on the ImageNet dataset, achieving unprecedented performance in image recognition. That groundbreaking AI model, dubbed AlexNet after lead author Alex Krizhevsky, kicked off the deep learning boom that has continued until the present day.
AlexNet would not have succeeded without the ImageNet dataset. AlexNet also would not have been possible without a platform called CUDA that allowed Nvidiaâs graphics processing units (GPUs) to be used in non-graphics applications. Many people were skeptical when Nvidia announced CUDA in 2006.
So the AI boom of the last 12 years was made possible by three visionaries who pursued unorthodox ideas in the face of widespread criticism. One was Geoffrey Hinton, a University of Toronto computer scientist who spent decades promoting neural networks despite near-universal skepticism. The second was Jensen Huang, the CEO of Nvidia, who recognized early that GPUs could be useful for more than just graphics.
The third was Fei-Fei Li. She created an image dataset that seemed ludicrously large to most of her colleagues. But it turned out to be essential for demonstrating the potential of neural networks trained on GPUs.
Subscribe to Understanding AI
Geoffrey Hinton
A neural network is a network of thousands, millions, or even billions of neurons. Each neuron is a mathematical function that produces an output based on a weighted average of its inputs.
Suppose you want to create a network that can identify handwritten decimal digits like the number two in the red square above. Such a network would take in an intensity value for each pixel in an image and output a probability distribution over the ten possible digitsâ0, 1, 2, and so forth.
To train such a network, you first initialize it with random weights. Then you run it on a sequence of example images. For each image, you train the network by strengthening the connections that push the network toward the right answer (in this case, a high probability value for the â2â output) and weakening connections that push toward a wrong answer (a low probability for â2â and high probabilities for other digits). If trained on enough example images, the model should start to predict a high probability for â2â when shown a twoâand not otherwise.
In the late 1950s, scientists started to experiment with basic networks that had a single layer of neurons. However, their initial enthusiasm cooled as they realized that such simple networks lacked the expressive power required for complex computations.
Deeper networksâthose with multiple layersâhad the potential to be more versatile. But in the 1960s, no one knew how to train them efficiently. This was because changing a parameter somewhere in the middle of a multi-layer network could have complex and unpredictable effects on the output.
So by the time Hinton began his career in the 1970s, neural networks had fallen out of favor. Hinton wanted to study them, but he struggled to find an academic home to do so. Between 1976 and 1986, Hinton spent time at four different research institutions: Sussex University, the University of California San Diego (UCSD), a branch of the UK Medical Research Council, and finally Carnegie Mellon, where he became a professor in 1982.
In a landmark 1986 paper, Hinton teamed up with two of his former colleagues at UCSD, David Rumelhart and Ronald Williams, to describe a technique called backpropagation for efficiently training deep neural networks.
Their idea was to start with the final layer of the network and work backwards. For each connection in the final layer, the algorithm computes a gradientâa mathematical estimate of whether increasing the strength of that connection would push the network toward the right answer. Based on these gradients, the algorithm adjusts each parameter in the modelâs final layer.
The algorithm then propagates these gradients backwards to the second-to-last layer. A key innovation here is a formulaâbased on the chain rule from high school calculusâfor computing the gradients in one layer based on gradients in the following layer. Using these new gradients, the algorithm updates each parameter in the second-to-last layer of the model. Then the gradients get propagated backwards to the third-to-last layer and the whole process repeats once again.
The algorithm only makes small changes to the model in each round of training. But as the process is repeated over thousands, millions, billions, or even trillions of training examples, the model gradually becomes more accurate.
Hinton and his colleagues werenât the first to discover the basic idea of backpropagation. But their paper popularized the method. As people realized it was now possible to train deeper networks, it triggered a new wave of enthusiasm for neural networks.
Hinton moved to the University of Toronto in 1987 and began attracting young researchers who wanted to study neural networks. One of the first was the French computer scientist Yann LeCun, who did a year-long postdoc with Hinton before moving to Bell Labs in 1988.
Hintonâs backpropagation algorithm allowed LeCun to train models deep enough to perform well on real-world tasks like handwriting recognition. By the mid-1990s, LeCunâs technology was working so well that banks started to use it for processing checks.
âAt one point, LeCunâs creation read more than 10 percent of all checks deposited in the United States,â wrote Cade Metz in his 2022 book Genius Makers.
But when LeCun and other researchers tried to apply neural networks to larger and more complex images, it didnât go well. Neural networks once again fell out of fashion, and some researchers who had focused on neural networks moved on to other projects.
Hinton never stopped believing that neural networks could outperform other machine learning methods. But it would be many years before heâd have access to enough data and computing power to prove his case.
Read and explore Understanding AI for a whole new context:
đĄ Understanding AI
âď¸ Posts from 2024
đ Posts from 2023
đŚ Dive into tech, economics and policy too.
Jensen Huang
The brains of every personal computer is a central processing unit (CPU). These chips are designed to perform calculations in order, one step at a time. This works fine for conventional software like Windows and Office. But some video games require so many calculations that they strain the capabilities of CPUs. This is especially true of games like Quake, Call of Duty, and Grand Theft Auto that render three-dimensional worlds many times per second.
So gamers rely on GPUs to accelerate performance. Inside a GPU are many execution unitsâessentially tiny CPUsâpackaged together on a single chip. During gameplay, different execution units draw different areas of the screen. This parallelism enables better image quality and higher frame rates than would be possible with a CPU alone.
Nvidia invented the GPU in 1999 and has dominated the market ever since. By the mid-2000s, Nvidia CEO Jensen Huang suspected that the massive computing power inside a GPU would be useful for applications beyond gaming. He hoped scientists could use it for compute-intensive tasks like weather simulation or oil exploration.
So in 2006, Nvidia announced the CUDA platform. CUDA allows programmers to write âkernels,â short programs designed to run on a single execution unit. Kernels allow a big computing task to be split up into bite-sized chunks that can be processed in parallel. This allows certain kinds of calculations to be completed far faster than with a CPU alone.
But there was little interest in CUDA when it was first introduced, wrote Steven Witt in the New Yorker last year:
When CUDA was released, in late 2006, Wall Street reacted with dismay. Huang was bringing supercomputing to the masses, but the masses had shown no indication that they wanted such a thing.
âThey were spending a fortune on this new chip architecture,â Ben Gilbert, the co-host of âAcquired,â a popular Silicon Valley podcast, said. âThey were spending many billions targeting an obscure corner of academic and scientific computing, which was not a large market at the timeâcertainly less than the billions they were pouring in.â
Huang argued that the simple existence of CUDA would enlarge the supercomputing sector. This view was not widely held, and by the end of 2008 Nvidiaâs stock price had declined by seventy per centâŚ
Downloads of CUDA hit a peak in 2009, then declined for three years. Board members worried that Nvidiaâs depressed stock price would make it a target for corporate raiders.
Huang wasnât specifically thinking about AI or neural networks when he created the CUDA platform. But it turned out that Hintonâs backpropagation algorithm could easily be split up into bite-sized chunks. And so training neural networks turned out to be a killer app for CUDA.
According to Witt, Hinton was quick to recognize the potential of CUDA:
In 2009, Hintonâs research group used Nvidiaâs CUDA platform to train a neural network to recognize human speech. He was surprised by the quality of the results, which he presented at a conference later that year. He then reached out to Nvidia. âI sent an e-mail saying, âLook, I just told a thousand machine-learning researchers they should go and buy Nvidia cards. Can you send me a free one?â â Hinton told me. âThey said no.â
Despite the snub, Hinton and his graduate students, Alex Krizhevsky and Ilya Sutskever, obtained a pair of Nvidia GTX 580 GPUs for the AlexNet project. Each GPU had 512 execution units, allowing Krizhevsky and Sutskever to train a neural network hundreds of times faster than would be possible with a CPU. This speed allowed them to train a larger modelâand to train it on many more training images. And they would need all that extra computing power to tackle the massive ImageNet dataset.
Fei-Fei Li
Fei-Fei Li wasnât thinking about either neural networks or GPUs as she began a new job as a computer science professor at Princeton in January of 2007. While earning her PhD at Caltech, she had built a dataset called Caltech 101 that had 9,000 images across 101 categories.
That experience had taught her that computer vision algorithms tended to perform better with larger and more diverse training datasets. Not only had Li found her own algorithms performed better when trained on Caltech 101, other researchers started training their models using Liâs dataset and comparing their performance to one another. This turned Caltech 101 into a benchmark for the field of computer vision.
So when she got to Princeton, Li decided to go much bigger. She became obsessed with an estimate by vision scientist Irving Biederman that the average person recognizes roughly 30,000 different kinds of objects. Li started to wonder if it would be possible to build a truly comprehensive image datasetâone that included every kind of object people commonly encounter in the physical world.
A Princeton colleague told Li about WordNet, a massive database that attempted to catalog and organize 140,000 words. Li called her new dataset ImageNet, and she used WordNet as a starting point for choosing categories. She eliminated verbs and adjectives as well as intangible nouns like âtruth.â That left a list of 22,000 countable objects, ranging from ambulance to zucchini.
She planned to take the same approach sheâd taken with the Caltech 101 dataset: use Googleâs image search to find candidate images, then have a human being verify them. For the Caltech 101 dataset, Li had done this herself over the course of a few months. This time she would need more help. She planned to hire dozens of Princeton undergraduates to help her choose and label images.
But even after heavily optimizing the labeling processâfor example, pre-downloading candidate images so theyâre instantly available for students to reviewâLi and her graduate student, Jia Deng, calculated it would take more than 18 years to select and label millions of images.
The project was saved when Li learned about Amazon Mechanical Turk, a crowdsourcing platform Amazon had launched a couple of years earlier. Not only was AMTâs international workforce more affordable than Princeton undergraduates, the platform was far more flexible and scalable. Liâs team could hire as many people as they needed, on demand, and pay them only as long as they had work available.
AMT cut the time needed to complete ImageNet down from 18 to two years. Li writes that her lab spent two years âon the knife-edge of our financesâ as they struggled to complete the ImageNet project. But they had enough funds to pay three people to look at each of the 14 million images in the final data set.
ImageNet was ready for publication in 2009, and Li submitted it to the Conference on Computer Vision and Pattern Recognition, which was held in Miami that year. Their paper was accepted, but it didnât get the kind of recognition Li hoped for.
âImageNet was relegated to a poster session,â Li writes. âThis meant that we wouldnât be presenting our work in a lecture hall to an audience at a predetermined time, but would instead be given space on the conference floor to prop up a large-format print summarizing the project in hopes that passersby might stop and ask questions⌠After so many years of effort, this just felt anticlimactic.â
To generate public interest, Li turned ImageNet into a competition. Realizing that the full dataset might be too unwieldy to distribute to dozens of contestants, she created a much smaller (but still massive) dataset with 1,000 categories and 1.4 million images.
The first yearâs competition in 2010 generated a healthy amount of interest, with 11 teams participating. The winning entry was based on support vector machines. Unfortunately, Li writes, it was âonly a slight improvement over cutting-edge work found elsewhere in our field.â
The second year of the ImageNet competition attracted fewer entries than the first. The winning entry in 2011 was another support vector machine, and it just barely improved on the performance of the 2010 winner. Li started to wonder if the critics had been right. Maybe âImageNet was too much for most algorithms to handle.â
âFor two years running, well-worn algorithms had exhibited only incremental gains in capabilities, while true progress seemed all but absent,â Li writes. âIf ImageNet was a bet, it was time to start wondering if weâd lost.â
But when Li reluctantly staged the competition a third time in 2012, the results were totally different. Geoff Hintonâs team was the first to submit a model based on a deep neural network. And its top-5 accuracy was 85 percentâ10 percentage points better than the 2011 winner.
Liâs initial reaction was incredulity: âMost of us saw the neural network as a dusty artifact encased in glass and protected by velvet ropes.â
âThis is proofâ
The ImageNet winners were scheduled to be announced at the European Conference on Computer Vision in Florence, Italy. Li, who had a baby at home in California, was planning to skip the event. But when she saw how well AlexNet had done on her dataset, she realized this moment would be too important to miss: âI settled reluctantly on a twenty-hour slog of sleep deprivation and cramped elbow room.â
On an October day in Florence, Alex Krizhevsky presented his results to a standing-room-only crowd of computer vision researchers. Fei-Fei Li was in the audience. So was Yann LeCun.
Cade Metz reports that after the presentation, LeCun stood up and called AlexNet âan unequivocal turning point in the history of computer vision. This is proof.â
The success of AlexNet vindicated Hintonâs faith in neural networks, but it was arguably an even bigger vindication for LeCun.
AlexNet was a convolutional neural network, a type of neural network that LeCun had developed 20 years earlier to recognize handwritten digits on checks. (For more details on how CNNs work, see the in-depth explainer I wrote for Ars Technica in 2018.) Indeed, there were few architectural differences between AlexNet and LeCunâs image recognition networks from the 1990s.
AlexNet was simply far larger. In a 1998 paper, LeCun described a document recognition network with seven layers and 60,000 trainable parameters. AlexNet had eight layers, but these layers had 60 million trainable parameters.
LeCun could not have trained a model that large in the early 1990s because there were no computer chips with as much processing power as a 2012-era GPU. Even if LeCun had managed to build a big enough supercomputer, he would not have had enough images to train it properly. Collecting those images would have been hugely expensive in the years before Google and Amazon Mechanical Turk.
And this is why Fei-Fei Liâs work on ImageNet was so consequential. She didnât invent convolutional networks or figure out how to make them run efficiently on GPUs. But she provided the training data that large neural networks needed to reach their full potential.
The technology world immediately recognized the importance of AlexNet. Hinton and his students formed a shell company with the goal to be âacquihiredâ by a big tech company. Within months, Google purchased the company for $44 million. Hinton worked at Google for the next decade while retaining his academic post in Toronto. Ilya Sutskever spent a few years at Google before becoming a cofounder of OpenAI.
AlexNet also made Nvidia GPUs the industry standard for training neural networks. In 2012, the market valued Nvidia at less than $10 billion. Today, Nvidia is one of the most valuable companies in the world, with a market capitalization north of $3 trillion. That high valuation is driven mainly by overwhelming demand for GPUs like the H100 that are optimized for training neural networks.
Sometimes the conventional wisdom is wrong
âThat moment was pretty symbolic to the world of AI because three fundamental elements of modern AI converged for the first time,â Li said in a September interview at the Computer History Museum. âThe first element was neural networks. The second element was big data, using ImageNet. And the third element was GPU computing.â
Today leading AI labs believe the key to progress in AI is to train huge models on vast data sets. Big technology companies are in such a hurry to build the data centers required to train larger models that theyâve started to lease out entire nuclear power plants to provide the necessary power.
You can view this as a straightforward application of the lessons of AlexNet. But I wonder if we ought to draw the opposite lesson from AlexNet: that itâs a mistake to become too wedded to conventional wisdom.
âScaling lawsâ have had a remarkable run in the 12 years since AlexNet, and perhaps weâll see another generation or two of impressive results as the leading labs scale up their foundation models even more.
But we should be careful not to let the lessons of AlexNet harden into dogma. I think thereâs at least a chance that scaling laws will run out of steam in the next few years. And if that happens, weâre going to need a new generation of stubborn nonconformists to notice that the old approach isnât working and try something different.
Thanks for reading!
Editorâs Note
This article first appeared on Understanding AI on November 5th, 2024. To learn more about Understanding AI, please read its About me. If you are interested in Tech, economics and policy check out his other Newsletter, Full Stack Economics.
Full disclosure, Timothy writes the most successful paid Newsletter about AI on Substack outside of the older AI magazine called TheSequence. If you want to know how long heâs been blogging you might want to check out this wordpress I dug up. Timothyâs topic choices, philosophical pieces about AI, and easy to read pieces and relatable clarity of thought really elevates the AI genre here on Substack.
â I want to write about the philosophical issues raised by generative AI. Do large language models understand language the way people do? Does it matter if they do?â –
Iâm blown away by the immediate impact, clarity and accessibility of his work. I consider him an expert on Waymo and the self-driving car industry, and heâs clearly broad-minded thinker and a uniquely lucid explorer at the intersections of AI, economics, policy, technology and and these things especially in the context of U.S. society at large. I canât recommend his work enough:
Consider following Tim on Substack, X, Bluesky and YouTube podcasts related to his work. In my humble opinion TBL is one of the greatest of all times currently active in independent AI journalism coverage and shows that AI journalism is a viable path to independently supported journalism. This means he is a pioneer.
Read More in  AI SupremacyÂ