One reason my blog writing has fallen off the past year – I’m ambivalent about bots scraping my data to train AI models. But honestly, I’m not that great a writer, and it’s not like the bots are mining gold. I just need to get over myself and keep sharpening my writing practice, be it on this blog or elsewhere.
I just finished reading The Alignment Problem by Brian Christian. While the issue of AI ethics and the dangers posed by advanced AI are the main theme, what I spent time mulling over was comparing the educating of AI with the educating of human students. There are differences between human brains and machine learning neural networks, but the bigger difference is the wetware of the entire human body-organism, which cannot be separated into dry hardware and software.
Christian launches the historical story with Skinner’s behaviorism, Turing’s computing machines, and the neuron assembly of McCulloch and Pitts. (I didn’t know Pitts was such an enigmatic character until reading this book!) This is the framework of reinforcement learning. The reward hypothesis states that “all of what we mean by goals and purposes [is essentially] the maximization of the cumulative sum of a received scalar reward”. Shoot for the high score! Not surprisingly, Atari and other early video games were utilized in the training process. (I also learned that Montezuma’s Revenge, a game I played in the 1980s, is particularly tricky for an AI to get good at and represented some sort of gold standard.) What made the world pay attention was when AI beat grandmasters at Chess and Go.
I appreciate Christian going through the challenges of any training method. (He also carefully distinguishes reinforcement learning from supervised and unsupervised learning.) These include the problem of the terseness of a scalar reward or punishment, compounded by a delay in knowing that a much earlier blundering move may have cost the game. Turns out “reinforcement learning is less like learning with a teacher than learning with a critic. The critic may be every bit as wise, but is far less helpful.” There’s an interesting story on the “dopamine puzzle” that leads to a learning model (known as temporal difference) that what’s really being valued is the “error in its expectation of future rewards”.
The most interesting part for me was Chapter 5 (“Shaping”) on the Problem of Sparsity. Essentially, “if the reward is defined explicitly in terms of the end goal, or something fairly close to it, then one must essentially wait until random button-pressing, or random flailing around, produces the desired effect. The mathematic show that most reinforcement-learning algorithms will, eventually, get there…” but it’s inefficient and takes too darn long. The solution is to put together a Curriculum. That’s what we do as human educators. I break down the learning of chemistry into steps; I set tasks for the students; I try to motivate them; and there’s a rewards system in terms of points and a final grade. But creating the right incentives in AI training turns out to be quite tricky. Specifying certain steps along the pathway often does not have the desired outcome. Evolution has had hundreds of millions of years to shape humans, dolphins, elephants, and octopi, all naturally intelligent creatures among many others.
Can you get beyond external reinforcement strategies? Can you build in intrinsic curiosity into a computer? Can you value novelty? There are some clever tricks to do this. OpenAI (now famous for ChatGPT) is profiled for their early efforts working on Atari-arcade-like games. Can we learn from how humans and apes learn? Can computers learn through imitation? Do they learn the same way? I learned that human children in some situations over-imitate compared to chimpanzees; “children are from a very young age, acutely sensitive to whether the grown-up demonstrating something is deliberately teaching them, or just experimenting.” Why does this work? It “allows the student (be it human or machine) to learn things that are hard to describe.” The OpenAI folks managed to get an AI to beat Montezuma’s Revenge by watching YouTube videos of many human players.
This may be why taking students through worked examples, then letting them try simpler problems, before adding complexity to a more sophisticated problem is a pedagogical approach that works well, at least for the subject of chemistry. Many of these principles came from folks doing research into teaching and learning math. There’s also a tricky balance between intrinsic and extrinsic motivational approaches. It’s not that one always works better than the other. I’m not sure that final grades, which I assign based on numerical scores, are the best value function that most of my students strive towards. I understand that grades loom large for increasingly stressed students in what they perceive to be a global cutthroat career market. My generation did not experience the pressures they are facing now. With AI chomping at their heels as a competitor, the business of educating AI may be existential for them, even if they don’t realize it yet.