Using an ORM adds a handy layer between you and the database tables. Sure, you can write SQL, but most developers find it easier to think of a list of complex objects than they do a collection of rows. However, it's important to remember what's going on under the hood - an ORM just writes SQL for you after all.
This is a tale of a bug I lost a fair chunk of an afternoon to, and the resulting rabbit-hole I dove down to try and work out what on earth was going on.
I recently shipped Similar content for my blog. At the bottom of each post you'll now find a few posts related to this one, which might interest you. Whilst I'm using full-text search for the search page, what I want here isn't quite a search, so I had to implement something myself. My first thought was to simply find the posts with the most tags in common with this one. With Django, that's pretty straightforward:
my_tags = this_page.tags.all()
BlogPostPage.objects.alias(
tags_in_common=Count("tags", filter(tags__in=my_tags))
).filter(tags_in_common__gt=0)
<optimisation>
In an ideal world, I'd retrieve just the ids of this page's tags, and pass those in, to avoid calculating my own tags twice. But that's outside of the scope of this post.
</optimisation>
Or, in English: For each page, count the number of tags, including only those which are also on this_page
, and only give me results if there is more than 1 tag in common. Simple, right?
If another post has 3 tags in common, tags_in_common
will be 3. And that's all done in the database, so it's pretty fast and doesn't require a lot of memory (at least for the application). Let's say instead I wanted to know how similar the tags were for a page, and rather than the number of tags, I wanted to know the percentage in common. For example, if this page has 5 tags, and another post has 3, the similarity would be 0.6 (3 ÷ 5). Django makes our lives a lot easier, by allowing arithmetic operations directly on expressions, so writing this feels very natural:
similar_pages = BlogPostPage.objects.annotate(
tag_similarity=Count("tags", filter(tags__in=my_tags)) / len(my_tags)
).filter(tag_similarity__gt=0)
Here, we take the same query as before, but stick / 5
on the end of the alias, which Django helpfully adds into the SQL for us:
(COUNT("blog_blogpostpage_tags"."blogposttagpage_id") / 5)
But, if you run this, it doesn't work as you'd expect it:
>>> similar_pages.count()
0
You wat?! We've gone from having a post with 3 out of 5 tags in common, to somehow having a similarity of 0? Is that what's happening?
>>> BlogPostPage.objects.annotate(tag_similarity=Count("tags", filter(tags__in=my_tags)) / len(my_tags)).first().tag_similarity
0
Yes, yes it is. len(my_tags)
is definitely 5, and tags_in_common
was definitely 3. 3 ÷ 5 is definitely 0.6, so something else is afoot.
To understand why, we need to think about our good friend: types.
#Integers vs floats
To humans, there's only "numbers". 2 is a number, 83 is a number, 3.14 is a number. Computers are much more fussy. Even before we consider the size of the number (there are numbers above 2.1 billion, apparently), computers care about decimal points, and need to know when there is one. We know that 2 and 2.0 are the same, but a computer doesn't.
But that's enough about the rabbit-hole that is floating point. In short, the "type" of a number matters.
When we mix the 2, interesting things happen.
Python saves us quite a bit, by happily converting between int
s and float
s when it's necessary:
>>> 3 / 5
0.6
>>> type(3), type(5)
(<class 'int'>, <class 'int'>)
>>> type(3 / 5), type(3.0 / 5)
(<class 'float'>, <class 'float'>)
>>> type(6 / 2)
<class 'float'>
Because division frequently doesn't result in a whole number, Python gives us a float
. If we don't want that, we can just convert the value to an int
or use Python's "floor divide" operator (//
) which gives us back an int
instead.
In a more strongly-typed language like Rust, we don't see this hand-holding. If you divide an integer (i*
or u*
), you'll get the same type back:
[src/main.rs:2] 3 / 5 = 0
[src/main.rs:3] 3.0 / 5.0 = 0.6
And mixing types is impossible because your program won't compile:
error[E0277]: cannot divide `{float}` by `{integer}`
--> src/main.rs:2:14
|
2 | dbg!(3.0 / 5);
| ^ no implementation for `{float} / {integer}`
|
= help: the trait `Div<{integer}>` is not implemented for `{float}`
= help: the following other types implement trait `Div<Rhs>`:
<&'a f32 as Div<f32>>
<&'a f64 as Div<f64>>
<&'a i128 as Div<i128>>
<&'a i16 as Div<i16>>
<&'a i32 as Div<i32>>
<&'a i64 as Div<i64>>
<&'a i8 as Div<i8>>
<&'a isize as Div<isize>>
and 54 others
For more information about this error, try `rustc --explain E0277`.
If we look back at that first rust example, 3 / 5 = 0
, is the same symptoms as our original issue - a division not returning the fractional value. Even though we're writing Python, that's not where the issue lies.
#COUNT
Now, back to counting columns. In SQL, counting rows is done with the COUNT
function. I mostly use it for SELECT COUNT(*) FROM table_name;
, but apparently it has other uses.
If we look at the aggregate function reference for PostgreSQL, COUNT
returns a bigint
. What happens if we try our maths using a bigint
?
postgres=# SELECT 3 / 5;
?column?
----------
0
(1 row)
postgres=# SELECT 3::bigint / 5;
?column?
----------
0
(1 row)
<note>
::
is a way of explicitly noting / casting the type of a value in PostgreSQL. In this case, we make sure 3
has the same type it would have if COUNT
returned it.
</note>
And there in lies the problem. When COUNT
returns the number of tags, it's of type bigint
. When we divide, we get back a fractional value, which can't be stored in a bigint
, so it gets truncated to 0. The fact PostgreSQL is written in C (which isn't dynamically-typed) makes no difference here - so is Python (CPython anyway) after all.
#The fix
So we know the problem, what's the solution?
In this case, we just need to make sure that the COUNT
value is a decimal value before we attempt to divide it. Therefore the result of the division can keep its fractional value, and we get the expected value back.
The easiest way to do this is with CAST
(which Django implements with Cast
). We "cast" COUNT
's bigint
value to a double precision
(which supports fractional values), and then divide it.
from django.db.models import FloatField
from django.db.models.functions import Cast
similar_pages = BlogPostPage.objects.annotate(
tag_similarity=Cast(Count("tags", filter(tags__in=my_tags)), output_field=FloatField()) / len(my_tags)
).filter(tag_similarity__gt=0)
<note>
Behind the scenes, Django's FloatField
is represented by PostgreSQL's double precision
.
</note>
It's a small difference (which is more verbose than I'd want), but at least it works as I'd expect it:
>>> similar_pages.first().tag_similarity
0.6
If we peek at the SQL, we can see the difference:
(COUNT("blog_blogpostpage_tags"."blogposttagpage_id"))::double precision / 5)
The COUNT
value is being cast to a double precision
just before the divide, and so tag_similarity
can store a fractional part and it behaves as we'd expect. Distilling this back down to simple SQL:
postgres=# SELECT 3::double precision / 5;
?column?
----------
0.6
(1 row)
#The less-nice fix
The "nice" fix is to cast the value of COUNT
to a double precision
before dividing, but I'd be remiss if I didn't talk about some other solutions, all of which are worse than this one, but still interesting to discuss nonetheless.
First off, you could cast the denominator rather than the numerator in the division. Because 1 of the values is a float, the result is able to be a float too:
postgres=# select 3 / 5::double precision;
?column?
----------
0.6
(1 row)
Or, you could explicitly say the denominator is a float, rather than casting:
postgres=# SELECT 3 / 5.0;
?column?
------------------------
0.60000000000000000000
(1 row)
Sure, the value looks a bit weird, but 0.60000000000000000000
is exactly the same as 0.6
- PostgreSQL considers 5.0
a numeric
:
postgres=# SELECT pg_typeof(5.0);
pg_typeof
-----------
numeric
(1 row)
And finally, my least favorite of all, you can add in a noop operation which changes the type.
postgres=# SELECT (3 + 0.0) / 5;
?column?
------------------------
0.60000000000000000000
(1 row)
This is a pretty janky solution, but it does at least work. If you do choose this path for whatever reason, be sure to add a comment!
# NB: I care more about typing fewer characters than I do about nice, maintainable code.
BlogPostPage.objects.annotate(
tag_similarity=(Count("tags", filter(tags__in=my_tags)) + 0.0) / len(my_tags)
)
#The result
With a now-working query, I can calculate how similar the tags are between this page and another page, and find the pages most-similar. Combining that with a few other metrics, such as how similar the titles are, I can find similar-looking posts to this one, and show some them under a posts. And all in just a single, efficient database query.
<apology>
I'm sure there are better ways of doing this, but I don't want to write a tiny search engine, and couldn't find something out there to easily plug-in to what I already had. If you have have suggestions, please tell me!
</apology>
I wonder what will be shown underneath this post?...