Building an Events Application: Querying events and indexing with GIN
In my last blog post we explored how to expose a simple API using the Echo Go framework and GORM ORM. We also took a look at how I wanted to represent an event in the database, including using tags and geolocation for search and filtering purposes. Great! But how should the data actually be queried when the end user is searching for events? In this post I’m going to discuss my thought process and approach to best find a match to a users search criteria.
Title and tags
You may remember from my last blog post that I chose to have column on the Events db table named tags. My initial thinking has been that most people either would want to search on the title of the event, or some other metadata about the event. I chose to go with tags as people are used to adding these to their posts on social media, and would be comfortable using them to add searchable values to their events.
The next question then, is how should these tags be represented in the database? I need the tags to be easily searchable, with fast querying in the database, and I need every event to be able to hold several tags. One solution would be to have a tags table with a column that holds a string value that is the tag, and a tag ID. Then have a through table that points to a tag and an event in the same row, to register which events hold which tags. However, this would require expensive joins to query for specific tags. A successful application, holding thousands of searchable events, would make for some inefficient querying this way.
So I need locality of the tags, and I need to be able to index said tags as effectively as possible. Because I don’t need tags to be normalised in the database (I don’t care about aggregation, etc), I can hold all of the tags in a single column on each entry.
PostgreSQL has an index type which helps immensely with lookups in arrays called GIN (Generalized inverted index). GIN works really well on JSONB, array types and full text search. It is really effective in speeding up searches on these data types in a way that B-tree indexes cannot, and is especially effective if you are using the @>
operator. It does come with the added cost of extra overhead on writes, but as I expect that there will be considerably many more searches for events than events being created or updated, I am okay with this overhead. There are also ways to mitigate some of the overhead.
Defining the column on the events table in go, using GORM looks like this:
Tags pq.StringArray `gorm:"type:text[];index:tags_idx,type:GIN"`
Here we tell GORM about the datatype using pq.StringArray
and define the column to be a text array with type:text[]
. I then specify the index as well as the indexing type using index:tags_idx,type:GIN
.
The title is more simply defined:
Title string `gorm:"not null"`
Querying events
So now that we have defined our tags and title column, we can look at querying events. Having in mind that a user will be writing their search criteria in a single search bar, it becomes apparent that we need to make our lookups using the same values for both title and tags. This means that the words written in a search could either be a list of tags or an event title. To facilitate this, the words need to be both comma separated for the tags and with spaces for the title. Assuming that the api takes a list of words delimited by commas in the query parameters, we can query the database like so:
type EventSearchItemDTO struct {
ID uuid.UUID `json:"id"`
Title string `json:"title"`
FormattedAddress string `json:"address"`
}
func (es *EventStore) Get(c echo.Context) error {
var events = []EventSearchItemDTO{}
var result *gorm.DB
params := c.QueryParams()
tags := params.Get("tags")
title := strings.ReplaceAll(tags, ",", " ")
//ILIKE makes the LIKE case insensitive
result = es.db.Model(&model.Event{}).Where("tags @> ? OR title ILIKE ?", "{"+tags+"}", "%"+title+"%").Select("id, title, formatted_address").Find(&events)
if result.Error != nil {
return c.String(http.StatusNotFound, "Not Found")
}
return c.JSON(http.StatusOK, events)
}
The resulting database query, searching for “cool party” looks like this:
SELECT id, title, formatted_address FROM "events" WHERE tags @> '{cool,party}' OR title ILIKE '%cool party%'
A thing to note, is that I’m currently calling the search criteria for tags, and then deducting the title search criteria from that, changing commas to spaces. Since it’s both things we’re searching for, the query parameter should probably not be named tags.
Improvements
Currently, searching for event title is not exactly efficient. If a user misspells a word or uses other derivatives of it (for example writing days instead of day), they would not get a hit on an event that would otherwise be a match. So we need a better search functionality for the title. But what about getting hits on some text in the description of the event? Maybe the person even wants to search for events on a specific address?
One potential solution is to add an extra column in the database that converts all relevant columns into a tsvector
for full-text search. This brings a new set of complications though. The tags that I have implemented do not need to be searched for derivatives, making them unsuitable for tsvector
search. Full text search also means that we have to think about settings weights on values in order to determine the most relevant matches and displaying them first.
An alternative I have been thinking about, is getting rid of the tags, and using categories for filtering instead, and combining it with full text search on all columns. If you have any input or suggestions I would love to hear from you in the comments.
Final thoughts
In this post, we explored how to query events using a combination of tags and event titles, leveraging GIN indexing and the less complicated ILIKE
operator. We also discussed alternative approaches to enhance search functionality. If you find these problems interesting or want to learn more about full text search, I recommend checking out pganalyze’s article Understanding Postgres GIN Indexes.
Another method to sort results by relevance is to consider the proximity of the event to the user. In my next post, I will discuss how to perform distance calculations directly in PostgreSQL.
You can access my code and review it in detail on my GitHub repo.