Home

An inverted index to serve quotes

Sunday, August 15 2021

An inverted index for quotes

I’ve built an app that serves quotes I like: quotes.sebstrug.com.

Specifically, I built an inverted index for the app. The inverted index really was the point of the app, saving and serving quotes I like has been a nice side effect.

What is an inverted index?

Inverted indexes allow you to quickly search a set of documents for a word by building a map of words to documents.

For example, say you have the two quotes:

Tact is the knack of making a point without making an enemy.

Isaac Newton

No plan survives first contact with the enemy.

Field Marshal Helmuth von Moltke

The inverted index as a JSON for these two quotes would look like:

1
2
3
4
5
6
{
    "tact": ["quote 1"],
    "is": ["quote 1"],
    ...,
    "enemy": ["quote 1", "quote 2"]
}

For efficiency purposes, each quote is saved into a document named by index: 1.txt, 2.txt, …

Each word also has an identifier, and a corresponding JSON mapping words to identifiers: {"tact": 1, "is": 2}.

Then, our inverted index looks more like

1
2
3
4
5
6
{
    "1": [1],
    "2": [2],
    ...,
    "n": [1, 2]
}

Being able to search this index is far faster than searching each document for a given word.

Inverted indexes carry the same disadvantage as any index: adding a value takes time as the index needs to be updated. If you’re adding many documents per second and concurrency is an issue, it may be better to not update the index in real-time, and periodically re-build it.

Implementation of the index

I implemented the index using iterators throughout, in src/index.py, to make the application as efficient as possible.

Word to ID mapping is stored in a single dictionary in memory. Since there are ~470,000 words in the English dictionary, this only amounts to about 6.5MB as a JSON.

The index is only stored in memory as a list at the last possible point. However, this isn’t a totally robust implementation: when faced with millions, or billions, of files, the index could exceed memory allocated.

The solution to this would be to:

  1. Map each sorted English word to an ID, so that word IDs cannot change per build.
  2. Save a partial index after iterating through N documents (a million? five million?). Write partial index to remote, and have a function to combine partial indexes.

Other implementation

I used this opportunity to try out some new technologies I’m unfamiliar with:

  1. FastAPI to serve the API, although I appreciate that FastAPI is not generally suited to serving HTML.
  2. pytest, for testing. What an improvement over unittest!
  3. AWS Lambda to build the index on AWS, serverless. I’ve done this all clickety-click in console for now, I may provide a serverless.yml file at a later point. The handler is in src/lambda_function.py
  4. Terraform to provision the AWS S3 bucket. I’ve previously only used Sceptre for infrastructure-as-code, it’s been fun to learn a more generally used tool.
  5. The app is deployed in a tmux session on an EC2 machine on AWS. I appreciate this isn’t a scalable solution.

Get involved

I’ve provided a POST endpoint to the API where you can fill out a HTML form to add a quote you like. Since any quote you add will be added to my list of quotes, I haven’t made this public! But if you figure out the endpoint from the code, I’d love for you to add a quote.

Feel free to copy all the code and use it for your own means, or as a learning resource to learn technologies you’re unfamiliar with. I’ve provided a quick guide in the README for you to use it either locally or on AWS. All the AWS resources used are free-tier, so you shouldn’t incur any costs when playing around with this. Contact me with any questions.

Screenshot of the front page
'I have the simplest tastes. I am always satisfied with the best.', Oscar Wilde