Tech

08.12.2023

How Relational Algebra powers Klaro Cards

The Bmg library at work in Klaro

A couple of years ago, when I joined Klaro Cards, I wrote in general terms about being excited about the way relational algebra is used in Klaro's code base. Today, I'd like to show you some specifics of how that works.

There is no shortage of database wrappers, Object Relational Mappers (ORMs) and similar tools that purport to make it easier to work with SQL databases. I have also noted a bit of a trend of "like SQL but better" languages lately. Unfortunately, very few of those projects take advantage of the solid foundations of the relational model, which relational databases are based on. If they did, they would find that practically all we need is already there.

The Bmg library, which we use for Klaro Cards, does take that path, and is not only more powerful than regular database wrappers, but a true improvement on SQL. Bmg implements a relational algebra, which is a rock-solid mathematical model of working with relational data. Let's see what that looks like in practice!

Our example

When logged into Klaro Cards, you see at the left a row of icons indicating the projects you are a member of. That list is sorted by the time you last interacted with each project, with the most recent one on top. Thus, the task is:

Get the projects of which the current user is a member, in the order of most recent interaction.

Two database tables are involved: project_events and projects. project_events is a log of events recording interactions users have had with projects, such as getting invited, accepting the invite, adding or modifying cards etc. projects contains the actual project information that we want to display. This is pretty straightforward, but still not entirely trivial in SQL and hence most SQL wrappers or ORMs. Let's walk through the process of doing it with relational algebra using Bmg.

Getting a relation

Bmg.sequel(:project_events, SEQUEL_DATABASE)
  .restrict(user: requester.id)

The first thing to notice here is that we're specifying a particular source of relations - namely a relational database accessed through the Ruby Sequel library library. We could just as easily have specified a CSV or spreadsheet file, and all the same operations would have been available to us. In Bmg, it's always "relations in, relations out".

The second line is straightforward: we limit the relation to tuples/rows where user is the current user. Nothing very fancy about it, but let us have a look at the SQL produced by Bmg (with help from the Sequel library):

SELECT
  "t1"."project",
  "t1"."user",
  "t1"."timestamp",
  "t1"."eventType",
  "t1"."nickname",
  ...
FROM
  "project_events" AS "t1"
WHERE
  "t1"."user" = 123

Pretty standard fare. But also notice what we did not have to do: we didn't have to set up any special classes or entities or any other structure that restricts what we are going to do with the data. We are not forced to view our data through any canonical form, and that is in fact one of the original motivations for the relational model half a century ago.

Projection

What's next? We don't need all of those attributes. Let's project the relation into one with fewer attributes:

Bmg.sequel(:project_events, SEQUEL_DATABASE)
  .restrict(user: requester.id)
  .project([:project, :timestamp])

We only need the project and timestamp. Let's look at the generated SQL query:

SELECT DISTINCT
  "t1"."project",
  "t1"."timestamp"
FROM
  "project_events" AS "t1"
WHERE
  "t1"."user" = 123

Here's something interesting: Bmg infers that this query can return duplicate rows, because part of the primary key has been omitted, namely the nickname attribute which together with project forms the primary key:

project_events_pkey PRIMARY KEY, btree (project, nickname)

A true relation has set semantics, which means there can be no duplicates. As the saying goes: saying something twice does not make it more true. In our example, if user 123 interacted with her blog project on Monday 5:21PM, we only need one tuple (blog, Monday 5:21PM) even if she somehow had several project activities at the exact same time. And so Bmg inserts the condition that we only want a set of tuples with distinct values. In fact, Bmg always specifies the DISTINCT condition, except when it can infer from primary keys or unique indexes that it's not necessary.

Predicates

Next, we want to narrow down the relation some more:

Bmg.sequel(:project_events, SEQUEL_DATABASE)
  .restrict(user: requester.id)
  .restrict(Predicate.neq(:eventType, 'destroy') | Predicate.neq(:cancelledAt, NULL))
  .project([:project, :timestamp])

We're adding another condition on selected tuples: we want to filter out events of the destroy type unless they are also cancelled. Notice how the predicate used in the new restrict operation is constructed. The logical disjunction (OR), is not treated as a SQL feature because, well, it isn't. Rather, it's a fundamental operation of logic. In Bmg we can build up predicates in a natural manner, and since they have an actual structure, Bmg can compile them into SQL.

It's worth noting that some ORMs struggle even at this basic level. For example, since their raison d'ĂȘtre is to construct objects from rows of data from the database, they might force you to create a new class of Data Transfer Objects just to be able to specify a select list (like we do in the project operation here). Likewise, since ORMs usually try to map every method call directly to a SQL clause or operator, building complex conditions (like the disjunction in our second restrict operation) can be awkward. The benefits of Bmg are not just various conveniences or an updated syntax, but rather all stem from having a completely unified, algebraic view of data.

Before moving on, let's see where we're at in terms of the generated SQL:

SELECT DISTINCT
  "t1"."project",
  "t1"."timestamp"
FROM
  "project_events" AS "t1"
WHERE
  "t1"."user" = 123
AND (
  ("eventType" != "destroy") OR ("cancelledAt" IS NOT NULL) 
)

Summarize it!

Now let's try something slighly more complicated. We want the tuples ordered by timestamp, remember? In Bmg, we do this with the summarize operation:

Bmg.sequel(:project_events, SEQUEL_DATABASE)
  .restrict(user: requester.id)
  .restrict(Predicate.neq(:eventType, 'destroy') | Predicate.neq(:cancelledAt, NULL))
  .project([:project, :timestamp])
  .summarize([:project], :timestamp => :max)

This can be understood as picking out the project attribute from the relation, and pairing it with a calulated value derived from the same relation. In this case, we want, for each individual project, to get the biggest timestamp, in other words the most recent time at which the user actively engaged with the project.

In SQL, this translates to an aggregation function together with a GROUP BY clause:

SELECT DISTINCT
  "t1"."project",
  max("t1"."timestamp") AS "timestamp"
FROM
  "project_events" AS "t1"
WHERE
  "t1"."user" = 123
AND (
  ("eventType" != "destroy") OR ("cancelledAt" IS NOT NULL) 
GROUP BY
  "t1"."project"

Again, summarize isn't just some clever shorthand for SELECT ... aggregation ... GROUP BY. It's an algebraic operation which maps one relation onto another relation. The whole point here is to think in relations, whereas ORM libraries make you spend your time thinking about how to piece together some code that will result in the SQL you wished you had.

Joining

Now, let's get those actual project attributes for our carefully selected tuples. For that, we first assign a variable to hold the relation we already built:

events = Bmg.sequel(:project_events, SEQUEL_DATABASE)
  .restrict(user: requester.id)
  ...

We then need another relation to join on:

projects = Bmg.sequel(:projects, SEQUEL_DATABASE)

And joining the relations is as easy as:

projects.join(events, :id => :project)

Remember, we're joining a plain relation (projects) with a relation we built by restricting, projecting and summarizing over a plain relation (project_events). How does that translate into SQL?

WITH "t3" AS (
  SELECT DISTINCT
    "t2"."project" AS "id",
    max("t2"."timestamp") AS "timestamp"
  FROM
    "project_events" AS "t2"
  WHERE
    "t2"."user" = 123
  AND (
    ("eventType" != "destroy") OR ("cancelledAt" IS NOT NULL) 
  GROUP BY
    "t2"."project"
)
SELECT
  "t1"."id",
  "t1"."name",
  "t1"."subdomain",
  "t1"."owner",
  -- Lots of attributes
  -- ...
  "t3"."timestamp"
FROM
  "projects" AS "t1"
INNER JOIN "t3" ON ("t1"."id" = "t3"."id")

Woah, what just happened? The entire events relation is now wrapped in a Common Table Expression (inside the parantheses following WITH "t3" AS) - a sort of temporary table or subquery which mimics the derived relation that we've constructed in Bmg. Normally, Bmg tries to merge relations into a single SQL expression, but in this case it determined that it's not possible since one of them contains a GROUP BY clause. The crucial principle, again, is that relations can always be modified, combined and recombined, yielding new relations. As the Bmg documentation states:

Bmg always allows chaining operators. If it does not, it's a bug.

Yes, Bmg can compile relations into SQL, but it's not some half-baked improvement on SQL that ends up being worse than it. Instead of forcing you to reverse-engineer the SQL you wished you had in the first place, Bmg allows us to think and code in terms of relations. It's simply the relational model as it was meant to be.

Conclusion

The astute reader (hi!) will have noticed that we've only attained half of our stated goal:

Get the projects of which the current user is a member, in the order of most recent interaction.

There is no ordering operation here, so we're still not returning projects in sorted order. How come? Again, relations have set semantics, and just like they can't contain duplicates, they aren't ordered. The tuples this code produces does contain the timestamp attribute needed for ordering them, but the actual sorting takes place outside of Bmg.

It would be wonderful if we could have some flavor of relational algebra built into relational databases. But as for now, Bmg is to my knowledge the only implementation of relational algebra used in production, and hardened over a number of years. If you are a Ruby user, you should give it a try! You'll never have to choose between clunky SQL and clunky SQL wrappers again.