Building a daily joke website with AWS Lambda and GPT-4

Tristrum Tuttle
13 min readMay 22, 2023

--

I recently had an idea for a very simple website that writes jokes about current events using OpenAI’s GPT-4. My game plan was to write a script in Python that would parse a news website, generate jokes about the news using GPT-4, then store the result as HTML in S3 so it can be easily viewed or shared.

Step 1: Scrape a News Website

I decided to go with AP News, since it has a pretty simple website relative to other news websites. I started by installing Python 3.7, because that is the minimum required version for using OpenAI’s Python library. There are many tutorials for updating or changing your Python version, so I’d recommend looking around Stack Overflow for a tutorial that matches your Operating System (OS) and hardware.

With Python 3.7 installed, I now needed to install Beautiful Soup 4. I was able to use pip install, but this will be highly dependent on your OS. I also installed the Requests, a simple HTTP library.

Now, I can write my script to grab the article text for recent news headlines on AP News. All of my code can be found on GitHub here, but this is the relevant scraping method:

import requests
from bs4 import BeautifulSoup
def scrape_news():
URL = "https://apnews.com/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
results = soup.find("div", class_="TopStories")
top_stories = results.find_all("a", {"data-key": "card-headline"})
minor_stories = results.find_all("a", {"data-key": "related-story-link"})

# I found a lot of duplicates, so I decided to eliminate those by
# converting the array to a set
headline_elements = set(
["https://apnews.com" + headline['href'].strip() for headline in (top_stories + minor_stories)]
)

for headline_url in headline_elements:
headline_page = requests.get(headline_url)
article_soup = BeautifulSoup(headline_page.content, "html.parser")
article_headline = article_soup.find("h1").text
article_body = article_soup.find("div", class_="Article")
article_text = article_body.find_all("p")
short_article_text = ""
for paragraph in article_text[0:10]:
short_article_text += paragraph.text
print(article_headline)
print(short_article_text)
print(headline_url)

scrape_news()

Here is an example article output from this method:

Nuggets on brink of NBA Finals with 119–108 win over Lakers in Game 3

LOS ANGELES (AP) — Nikola Jokic was far from his usual dominant self

[10 paragraphs of text]

https://apnews.com/article/nba-playoffs-2023-lakers-nuggets-e164ab7e5200562166975ec15e08a4a0

Step 2: GPT-4 Writes Jokes

For writing the jokes using GPT-4, I needed to make an account on OpenAI and create an API key. Since this key is linked to your account, be very careful not to share this online. If you do accidentally commit your API key to a public Github repository, OpenAI may alert you and rotate the key automatically. I definitely did not learn this from personal experience.

I was able to install the openai library using pip, and I also installed the dotenv library to be able to read my API key from a .env file. Alternatively, you can just add your API key and organization key to the code directly, but that is not a safe practice (even for side projects). The dotenv library works alongside the os library, which is included in most Python installations by default. If you are having a bad time, feel free to jump to Step 3 and try building in AWS Lambda directly.

To start, I created a .env file and added my API key and organization ID key. You can find your organization ID on OpenAI’s organization settings page. My .env file looks like this:

OPENAI_API_KEY=sk-...
ORG_KEY=org-...

I quickly tested that the environment variables were working correctly by listing the OpenAI models. This also confirmed that I have access to GPT-4:

import os
import openai
from dotenv import load_dotenv

load_dotenv()
openai.organization = os.getenv("ORG_KEY")
openai.api_key = os.getenv("OPENAI_API_KEY")
print(openai.Model.list())

# Output
# ...
# {
# "created": 1678604602,
# "id": "gpt-4",
# "object": "model",
# "owned_by": "openai",
# "parent": null,
# "permission": [
# {
# "allow_create_engine": false,
# "allow_fine_tuning": false,
# "allow_logprobs": false,
# "allow_sampling": false,
# "allow_search_indices": false,
# "allow_view": false,
# "created": 1684465847,
# "group": null,
# "id": "modelperm-HnvVZ1tf2jVawVaM1B3yjZnD",
# "is_blocking": false,
# "object": "model_permission",
# "organization": "*"
# }
# ],
# "root": "gpt-4"
# }

Next, I started writing my method to query GPT-4 and write jokes about my scraped news articles. I used a temperature of 0.3 to get a good mix of consistency and creativity, and set the max_tokens to 200 to limit the size of the response. I also included a system prompt of “You are the punniest person on the planet” to encourage GPT-4 to add some flair to the joke delivery.

import os
import openai
from dotenv import load_dotenv

load_dotenv()
openai.organization = os.getenv("ORG_KEY")
openai.api_key = os.getenv("OPENAI_API_KEY")

def get_joke(article):
prompt = "Write a short joke about the following news article:"
prompt += article
prompt += "\n Joke:"
completion = openai.ChatCompletion.create(
model="gpt-4",
temperature=0.3,
max_tokens=200,
messages=[
{"role": "system", "content": "You are the punniest person on the planet."},
{"role": "user", "content": prompt}
]
)
return completion.choices[0].message.content

This worked pretty well out of the box. The first test article I submitted was about the “Nuggets on brink of NBA Finals with 119–108 win over Lakers in Game 3” article, and GPT-4 returned this knee slapper:

Why did the Lakers lose to the Nuggets in Game 3? Because they couldn’t ketchup to Denver’s relish for victory!

I guess its sort of a play on Denver Nuggets and chicken nuggets? Anyways, this was good enough for my MVP.

At this point, I also realized two big issues with my initial idea:

  • 90% of news is about incredibly serious topics that shouldn’t really be taken lightly, like death and war
  • GPT, at least without fine tuning, generates many jokes that use harmful stereotypes as the main premise

I decided to continue with the project mostly as a technical exercise, but I don’t plan on making any of the results public until I can fine-tune the prompt and add guardrails to catch harmful content.

Step 3: Setting Up AWS

I decided the easiest way to be able to view and share the results of my joke bot daily would be to upload them to a bucket in AWS S3. From here, you can query them from a web application or make them publicly readable, although Amazon recommends against this type of S3 set up.

Rather than bother trying to get the AWS SDK for Python, Boto3, installed and configured locally, I decided to create my AWS Lambda function first because AWS Lambda includes Boto3 by default.

To start, I created a bucket in S3 called “news-jokes-ai”. As long as your AWS account is set up correctly, this is normally pretty straightforward. I used US East as the region and selected all the default options.

Next, I went ahead and created an AWS Lambda function. I titled this function “daily-uploads” since I plan to run the function on a daily cadence, gather the jokes, and store them as a simple HTML file.

I also recommend creating a new role with basic permissions.

Even though my Lambda function will already have Boto3 ready to use for S3 operations, it will not have access to my S3 bucket without some additional configuration. First, I needed to configure my IAM role to be permitted to put files into my S3 bucket. If you selected “Create a new role” in the previous step you will see a role with a name like “your-lambda-function-dash-role” in the IAM roles page. You can also create a new role from this page directly. I clicked on my function’s role, and selected “Add Permissions.”

Here, you can either add an existing policy or create a new one. Even though the existing S3 AmazonS3FullAccess role will work for our purposes, the best practice for security’s sake is to limit access to only the specific action and bucket our Lambda function will be interacting with. I selected “Create policy” and made a new policy with PutObject permissions for the S3 bucket I made previously.

I clicked continue and gave my policy the name “ai-news-jokes-uploader” in the “Review and create” step. Now that my policy is created, I can easily select and add it from the previous “Add Permissions” page:

Step 4: Lambda Function Uploads to S3

Now that I have a Lambda function with S3 permissions configured, I tested uploading an HTML file to S3. I wanted to keep the format of my HTML file simple, with just a news headline and joke per new article. In the “Code source” section of my Lambda function, I added a method to send a file to my S3 bucket. The Lambda function uses a specifically named lambda_handler method when triggered, but otherwise there are not many restrictions on the code design.

import boto3
import datetime

html_template_open = "<html><head><title>News Jokes AI</title></head><body>"
html_template_close = "</body></html>"

def upload_news_to_s3(joke_data_list):
print("Uploading to S3...")
s3 = boto3.resource("s3")
my_bucket = s3.Bucket('news-jokes-ai')
date = datetime.datetime.now().strftime("%Y-%m-%d")
html_body = ""

for joke_data in joke_data_list:
html_body += "<div>"
html_body += "<h2> %s </h2>" % joke_data[0]
html_body += "<p> %s </p>" % joke_data[2]
html_body += "<p> Source: <a href=%s>AP News</a></p>" % joke_data[1]
html_body += "</div>"

html_heading = "<h1>Jokes for %s </h1>" % date
html_object = html_template_open + html_heading + html_body + html_template_close
html_file_name = date + ".html"
my_bucket.put_object(Key=html_file_name, Body=html_object, ContentType="text/html")
return "Success!"


def lambda_handler(event, context):
joke_data_list = [["Headline", "Joke", "https://apnews.com/"]]
upload_reply = upload_news_to_s3(joke_data_list)
return upload_reply

After editing, your Lambda function is not changed for testing/execution purposes until you click the “Deploy” button. I deployed my script, then clicked “Test” to try it out. I used all of the default options in the Test Event creation page, including the hello-world template.

After saving the TestEvent, my test kicked off and I got “Success!” response. I went to my S3 bucket and confirmed that I now have a file named with today’s date:

No notes!

Step 5: Python Code, Assemble!

To summarize, so far I have:

  1. A Python method that scrapes AP news for news articles
  2. A Python method that takes a news article and returns a GPT-4 generated joke about the article
  3. An S3 bucket
  4. A Lambda function that takes an array of joke items, turns them into a file and uploads it to the S3 bucket

In order for the Lambda function to be able to run my news scraper and AI joke generator, we need to use layers. Layers let AWS Lambda scripts access additional libraries, like Beautiful Soup 4 and Open AI, that are not included in Lambda by default. To create a layer, I first needed to get a zip file of each of my additional libraries. This process may vary depending on your OS. On my mac, I was able to follow this tutorial and copy the following steps:

Run the following command to download Beautiful Soup

pip install bs4 -t temp/python/

change into the temp directory and zip up the files to one directory called bs4.zip.

cd temp
zip -r9 bs4.zip .

Once you have a zip file of a library, you can create a layer by navigating to the “Lambda layers” page and clicking “Create layer.”

Once you have created your layers, you should see them on the Layers page. The version defaults to 1, but you can add additional versions for different use cases in the future.

I navigated back to my Lambda function and clicked “Add layers,” then selected my Custom layers. For this project, bs4 and openai were the only two external libraries I needed to add layers for. The other libraries (os, requests, etc) are already available in Lambda.

Now that I had my layers created and added, I was able to add my news scraper method and joke generator method to my Lambda function. One last configuration change needed to get everything to work is adding the GPT-4 tokens to the Lambda environment. You can set these by navigating to Configuration → Environment variables.

After adding the environment variables, I was able to access them directly with just the os :

openai.organization = os.getenv("ORG_KEY")
openai.api_key = os.getenv("OPENAI_API_KEY")

I also increased the timeout from the default 3 seconds to 5 minutes. The web scraping + joke generation take a decent amount of time, and Lambda is a relatively cheap service to run.

After some minimal tinkering, my code was successfully uploading AI jokes about the news to my S3 bucket.

Very exciting!

You can find the full code for my AWS Lambda on Github here, or copied below:

import boto3
import datetime
import requests
from bs4 import BeautifulSoup
import os
import openai

html_template_open = "<html><head><title>News Jokes AI</title></head><body>"
html_template_close = "</body></html>"

openai.organization = os.getenv("ORG_KEY")
openai.api_key = os.getenv("OPENAI_API_KEY")

def upload_news_to_s3(joke_data_list):
print("Uploading to S3...")
s3 = boto3.resource("s3")
my_bucket = s3.Bucket('news-jokes-ai')
date = datetime.datetime.now().strftime("%Y-%m-%d")
html_body = ""

for joke_data in joke_data_list:
html_body += "<div>"
html_body += "<h2> %s </h2>" % joke_data[0]
html_body += "<p> %s </p>" % joke_data[2]
html_body += "<p> Source: <a href=%s>AP News</a></p>" % joke_data[1]
html_body += "</div>"

html_heading = "<h1>Jokes for %s </h1>" % date
html_object = html_template_open + html_heading + html_body + html_template_close
html_file_name = date + ".html"
my_bucket.put_object(Key=html_file_name, Body=html_object, ContentType="text/html")
return "Success!"


def get_joke(article):
prompt = "Write a short joke about the following news article:"
prompt += article
prompt += "\nJoke:"
completion = openai.ChatCompletion.create(
model="gpt-4",
temperature=0.3,
max_tokens=200,
messages=[
{"role": "system", "content": "You are the punniest person on the planet."},
{"role": "user", "content": prompt}
]
)
return completion.choices[0].message.content


def scrape_news_and_create_jokes():
URL = "https://apnews.com/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
results = soup.find("div", class_="TopStories")
top_stories = results.find_all("a", {"data-key": "card-headline"})
minor_stories = results.find_all("a", {"data-key": "related-story-link"})

# I found a lot of duplicates, so I decided to eliminate those by
# converting the array to a set
headline_elements = set(
["https://apnews.com" + headline['href'].strip() for headline in (top_stories + minor_stories)]
)

joke_data_list = []

for headline_url in headline_elements:
headline_page = requests.get(headline_url)
article_soup = BeautifulSoup(headline_page.content, "html.parser")
article_headline = article_soup.find("h1").text
article_body = article_soup.find("div", class_="Article")
article_text = article_body.find_all("p")
short_article_text = ""
for paragraph in article_text[0:10]:
short_article_text += paragraph.text

joke_data_list.append([
article_headline,
headline_url,
get_joke(short_article_text)
])

return joke_data_list

def lambda_handler(event, context):
joke_data_list = [["Headline", "Joke", "https://apnews.com/"]]
joke_data_list = scrape_news_and_create_jokes()
upload_reply = upload_news_to_s3(joke_data_list)
return upload_reply

Step 6: Scheduling the Function to Run Daily

The final step was to get my Lambda function to run at a set time every day. AWS Lambda makes this super simple with CloudWatch Events. I considered two options for setting this up: Using either an EventBridge scheduled rule or an EventBridge Schedule. AWS recommends using EventBridge Schedule for our use case, because Schedule was developed specifically as an improvement to scheduled rules for invoking targets on a schedule.

To set up a schedule, I navigated to EventBridge → Schedules and click “Create schedule.” From here, I set the frequency using Cron so that my function would run every day at 9am. I set the flexibility window to 15 minutes as well.

On the Step 2 page, I selected AWS Lambda as the target and specified my Lambda function.

Finally, on the last page, I turned off retries and the Dead Letter Queue. If I accidentally have a bug in my code or don’t handle an error correctly, I’d prefer to just skip a day’s worth of jokes than have my function retry.

I let AWS create a new role for the permissions, and saved the schedule. One nice feature of the EventBridge Scheduler is that it allows you to set up a one-time schedule, which is helpful for testing that your repeating schedule is set up correctly. I quickly created a one-time schedule with the same specs as the cron-based schedule, and set it up to go off in a few minutes. After the schedule’s time elapsed, I checked S3 and confirmed I had a newly updated joke file.

Final Thoughts

My script is currently generating jokes about the news every day! Overall, I learned a lot and really enjoyed this project. S3 and Lambda are extremely cheap, costing barely $0.01 per month (a cost which is dwarfed by my GPT-4 cost of ~$0.05 per day). AWS Lambda also has built in CloudWatch monitoring for logging errors which is extremely useful. I am looking forward to building some safety guardrails to eliminate harmful generations and fine-tuning the prompt.

So far this is my favorite generated joke:

Additional Resources

These are some of the tutorials I relied on to get everything in this post set up correctly. Huge shout out to all the creators of these resources!

AWS Lambda and S3

Installing libraries in Lambda (openai, bs4, boto3)

Scheduling in Lambda

--

--

No responses yet