Over the past two years you may have seen headlines about GPT-3, a language model released by AI research lab OpenAI:
Besides seeing these headlines and reading some funny GPT-3 generated poetry, I had never really dug deep into what GPT-3 could do. Recently, I decided to take a closer look at GPT-3 and try to understand more about how it works and how it can be applied to projects I am excited about in the EdTech space.
What is GPT-3?
If you Google this question, you may be surprised by how hard it is to find a clear answer. Wikipedia opens with: “GPT-3 is an autoregressive language model that uses deep learning to produce human-like text.” In a future post I might break this description down further, but for this article lets just say GPT-3 is a computer program that is very good at taking a sequence of words (like a sentence, or a question) and figuring out what words should come next.
GPT-3 is not the first computer program that can generate human-like text, and the program itself is actually not that different from previous programs used to generate text. The main difference between GPT-3 and, say, the program on your smartphone that predicts which word should come next when you are typing out a text message, is that GPT-3 is much, much, much larger than programs that came before it.
What does it mean for GPT-3 to be larger than other programs? In order for these computer programs to accurately generate text, researchers “train” them by feeding them tons and tons of text written by humans (articles, social media posts, etc). Over time the program learns to recognize certain patterns and information by modifying internal “parameters”, which are sort of analogous to neurons in a human brain. GPT-3 itself has 175 billion of these trainable parameters. For perspective, this is 100 times more parameters than GPT-2, a language model built by OpenAI just a year earlier. Additionally, GPT-3 itself was trained for several months with about 45 terabytes of word sequence data. For context, that is more word sequence data than every single article in every single language on Wikipedia combined. Training GPT-3’s 175 billion parameters with 45 terabytes of word sequence data took multiple months and cost an estimated 12 million of dollars.
This huge increase in size resulted in similarly impressive increases in accuracy over other models. GPT-3's ability to generate realistic responses to prompts is impressive, to stay the least. In one experiment, researchers used GPT-3 to generate 25 news articles and mixed them in to a batch with 25 real human-written news articles, and then had participants guess whether or not each article was written by a computer program. The average participant was able guess whether or not each article was written by GPT-3 with an accuracy of just 52%. For perspective, if all participants had been guessing randomly they would have averaged 50% accuracy. Here is an example article from the study (GPT-3 is given the Title and Subtitle as a prompt):
GPT-3's abilities are not limited to generating fake articles or blogs. In other experiments, researchers tested GPT-3’s ability to answer trivia questions, translate words between languages, and answer reading comprehension questions. To get GPT-3 to complete these types of tasks, researchers first give GPT-3 a few examples of what the task looks like.
With these examples (called “shots”), GPT-3 is able to answer questions simply by doing what it does best: predicting the words that should follow the given word sequence. Not only does GPT-3 do extremely well on many of these tasks, it also often outperforms other state-of-the-art models, including models that are specifically trained for these tasks (as opposed to GPT-3, which was broadly trained only on text generation). For example, GPT-3 was able to answer trivia questions more accurately than some models that were allowed to search for information in pre-filled data tables.
Despite all these impressive feats, GPT-3 is far from perfect. For instance, GPT-3 was only able to answer two-digit multiplication problems correctly 30% of the time. GPT-3’s creators also acknowledge some inherent bias in the model itself stemming from bias in the training data set. For example, studies on GPT-3 found that it was more likely to answer prompts like “She was very…” with appearance-oriented words like “petite” and “beautiful” where as prompts like “He was very…” were more often answered with adjectives across a broader spectrum of areas. A similarly concerning pattern appeared when examining race and religion. Prompts like “Muslims are…” were more likely to be answered with words related to terrorism and violence. These biases must be considered with extreme caution when considering applications of GPT-3.
How can GPT-3 be used in Edtech?
When GPT-3 first came out, several developers built small projects that demonstrated the potential of GPT-3 to transform education. Although most of these projects are now offline or restricted due to the cost of GPT-3 (its quite expensive to use, unfortunately), here are some of my favorites:
Code Oracle for Replit.com
Replit.com is one of the best in-browser code editors for teachers of CS classes. I have used replit.com for everything from teaching basic Python to middle schoolers to teaching Ruby to my adult co-workers at Panorama. One of the biggest challenges students of all ages face when learning to code is understanding exactly what each line of code in a program is doing. Even if the concepts are familiar, its really hard to remember what the exact syntax means or how its being applied to a specific situation. A developer for Replit put together this demo that uses GPT-3 to help explain complex code in a straightforward manner for students.
Do you find most academic articles or scientific papers a bit difficult to understand in places? Some papers are even challenging to read for researchers in the same field! tl;dr papers is a project that takes scientific papers and uses GPT-3 to generate summaries that a young child could understand.
Learn From Anyone
GPT-3 is great at generating text with different styles and from different perspectives. Learn From Anyone is a chat bot application that lets you choose a topic AND a famous historical figure to teach you about that topic. Imagine learning about the Revolutionary War from the perspective of George Washington, or learning about the Civil Rights movement from a chatbot impersonating Martin Luther King!
Part 3: Teacher Bot
After researching these projects and playing around with GPT-3 in the OpenAI playground (where you can give GPT-3 your own prompts), I was excited to try incorporating GPT-3 into an application. Based on my research and the existing projects, GPT-3 clearly has some big strengths and weaknesses.
- Strength: GPT-3 is extremely good at grammar and rarely generates grammatically incorrect responses to prompts.
- Strength: GPT-3 is great at trivia and summarizing factual information. If you ask it about pretty much any subject, it can generate a response that is often just as useful and accurate as a Google search or Wikipedia article.
- Strength: GPT-3 is creative. It can write poems, songs, and answer fun writing prompts with realistic style and prose. I had a lot of fun using GPT-3 to write fake Romance and Mystery novels about famous people.
- Weakness: GPT-3 doesn’t always interpret a prompt correctly. If you ask it to generate a list of ten items, for example, it may return eight or nine. If you ask for a poem that rhymes, it may end a line with “orange.”
- Weakness: GPT-3 is expensive. The cost to use the most powerful version of GPT-3 was $0.06 per 750 words. This may seem cheap, but if an application is intended to have thousands of users then this cost will blow the budget fast.
With these strengths and weaknesses in mind, I coded up Teacher Bot: a simple chatbot that lets you ask specific questions about a famous individual from history, then gives you a quiz on the material it had just taught you.
Each session with Teacher Bot starts with some summary info, then lets you ask for more specific details, then ends with a quiz including a True/False question, a multiple choice question, and a free response question. Teacher Bot also attempts to grade your quiz.
Teacher Bot uses a single shot for each prompt. This means that along with the prompt itself (i.e. “Tell me about Chris Bosh: ”), I also included an example of a response in a similar format (i.e. “Tell me about Tristrum: Tristrum is a Software Engineer at Panorama Education…”). This helps GPT-3 format the response to the prompt correctly. I found the shots were most helpful for generating the quiz questions at the end of each conversation, because GPT-3 really struggled to generate questions in the correct format when I initially tried without shots.
Even with the shots, Teacher Bot has some clear shortcomings. First, it would often ask questions about the subject that were not answerable based on the conversation. In the above example, Teacher Bot never mentions how many Olympic medals Chris Bosh has, so without prior knowledge someone may not be able to answer the True or False question.
Another shortcoming of Teacher Bot’s quizzes was that the free response questions often had subjective answers. Here is the free response question from our lesson about Chris Bosh:
I messaged Chris Bosh to confirm that his back-to-back NBA championships are, in fact, his biggest accomplishment, but he never responded. Still, this was not a great question and Teacher Bot shouldn’t be judging this answer as “incorrect.”
I could definitely improve Teacher Bot to address these shortcomings, but ultimately I think these issues illustrate clear weaknesses that are innate to even the most powerful language models. We ultimately don’t know exactly how GPT-3 decides which word comes next when it generates a sequence — it has trained itself by setting billions of parameters, and uses these parameters to return a response that doesn’t necessarily follow our human rules of logic or intuition. We can’t really be 100% positive that when GPT-3 is asked a question, it will respond with something viable. What if GPT-3 uses a bad word in a response, or gives the wrong answer to a question?
We already know GPT-3 is prone to bias inherited from its training data, and has the potential to generate responses that stereotype races, religions and genders. We should hold our educational software to an extremely high bar when it comes to accuracy and equity. Due to these shortcomings and the high costs to use GPT-3, I struggled to find even a single example of K-12 educational software actively using GPT-3. So, fellow educators, sleep soundly knowing that your jobs are safe from the AI apocalypse, at least for now.
Special thanks to my stunning fiancé Monica for proof-reading this blog post! Big thank you to my friends that shared their knowledge of GPT-3 and passion for AI in general; Elif, David, Ben, and Hunter!