How to Not Build a Search Engine
At a glance, building a search engine from scratch is easy. All you need to do is:
- Write a web crawler to scrape the sites you want to search. In theory, this is as simple as sending a GET request to a list of web pages and storing the response data. In practice, your crawler needs to respect rules in each site’s robot.txt file, deal with paywalls and anti-bot measures, and traverse chains of links found by parsing HTML (a haphazard process). This crawler needs to be run periodically to keep data fresh, and if you aren’t careful your crawler can get stuck pulling copies of the same page or download more pages than your storage solution can handle.
- Write an algorithm to index and rank search results. When a user wants to search through our collection of scraped sites and data, a naive algorithm that just returns exact matches of the searched text in a random order will provide horrible results, if it doesn’t timeout entirely. We need to be able to provide results quickly, using an index that can surface pages that contain terms from the search text. Then, we need to rank those surfaced pages so that we are listing the most relevant pages first.
- Build a frontend that lets users access the search functionality and view the results of their search. You may need to paginate multiple results, add thumbnails and some description text parsed out of the result itself, so users can preview their result before visiting the external link.
On top of these general steps, everyone that uses the internet is accustomed to the speed, accuracy and functionality of Google, a search engine that has devoted huge teams and resources to each of the steps above. Writing a search engine from scratch may take an experienced engineer a few days, but creating a search engine that feels as responsive and smooth as Google can easily take years.
Don’t Reinvent the Wheel
When I took on a project to build a website for searching news articles targeted at K-12 students, I initially considered building a search engine from scratch. I researched options and looked at vendors that could potentially help me bridge some of the gap between my engine and the state of the art, like Algolia and Elasticsearch, but the pricing and implementation costs didn’t seem worth it for my very minimal MVP.
Although building a search engine may be a fun and rewarding technical challenge, it’s not the fastest way to build and validate an MVP.
I realized that I didn’t want to spend years writing a competitor to Google. I really just wanted to query an existing search engine, but limit the results to specific websites and sort by date. Surely there is a free or cheap API, or some sort of open source solution I could self-host?
My investigation led me to this old StackOverflow post which lists several possible options:
- Yahoo BOSS = discontinued
- EntireWeb Search API = discontinued
- Gigablast = discontinued
- SeekStorm (formerly Faroo) = too expensive
- Bing CustomSearch (formerly Bing Search API?) = too expensive
- Google Custom Search API (formerly Web Search API) = too expensive
- Common Crawl = free, but only updates once per year
At this point, I realized that the current state of public search engines looks pretty similar to the current state of other public Web 2.0 APIs. The options that used to be free have now been discontinued, replaced by expensive alternatives with unnecessary AI features, confusing cost structures and documentation that is cobbled together from the previous versions.
Just when I was beginning to lose hope, I discovered that Google’s Custom Search does let you build a custom search engine without programatic access that you can embed directly into a webpage. Its not the flashiest tech or most complex platform, but its super easy to use and it works! And so, mightynews.org was born.
Creating Mighty News
Out of the gate, Google Custom Search proved to be a pretty good option for for my MVP needs. It was extremely easy to add sites for indexing and querying, and we were even able to use regex in URLs to limit to specific subpages hosted on sites like Apple Podcasts and Buzzsprout without including the entire domain.
You can change the disable Google branding and add specific autocompletes. There are also a ton of options in the “Look and Feel” section that let you customize the font, color and orientation of search box as well as the format of the search results themselves. As a final bonus, the Google Custom Search provides some basic stats on usage and popular keyword searches.
Although this approach was great for an MVP, the downsides were also evident from the start. Since the query isn’t added to the URL by default, reloading the page (or visiting a link and returning) totally wipes the previous results. You also can’t embed multiple search engines on the same page, since the embedded JS runs globally and can’t distinguish between multiple search boxes. This forced me to create separate pages for each of the search engines I wanted to add, like a podcast-only search, instead of using a SPA. Finally, the documentation for the Custom Search Engine was still in the process of being re-written and it wasn’t clear which features were still accessible from the old version.
I did write a little bit of vanilla JS to change the default results sort method to “Date” and a “No results” message when no results are returned by the query, but otherwise I was able to spend most of my project time on non-coding tasks like researching accessible fonts and UI/UX design best practices for student users. Overall, using Google Custom Search was the quick solution I needed to spend less time reinventing search and more time building a user-friendly website.
Go visit mightynews.org today!
Special thank you to Dan Buck, director of the Briarwood lower school in Texas, for pitching this project and for his work to inspire students to learn about the news and current events. Check out his podcast Little News Ears on YouTube!