COUNTing is hard: A tale of types in SQL and Django

2023-06-03

5 minutes

#django #programming

Using an ORM adds a handy layer between you and the database tables. Sure, you can write SQL, but most developers find it easier to think of a list of complex objects than they do a collection of rows. However, it's important to remember what's going on under the hood - an ORM just writes SQL for you after all.

This is a tale of a bug I lost a fair chunk of an afternoon to, and the resulting rabbit-hole I dove down to try and work out what on earth was going on.

I recently shipped Similar content for my blog. At the bottom of each post you'll now find a few posts related to this one, which might interest you. Whilst I'm using full-text search for the search page, what I want here isn't quite a search, so I had to implement something myself. My first thought was to simply find the posts with the most tags in common with this one. With Django, that's pretty straightforward:

Python

my_tags = this_page.tags.all()

BlogPostPage.objects.alias(
    tags_in_common=Count("tags", filter(tags__in=my_tags))
).filter(tags_in_common__gt=0)

<optimisation>

In an ideal world, I'd retrieve just the ids of this page's tags, and pass those in, to avoid calculating my own tags twice. But that's outside of the scope of this post.

</optimisation>

Or, in English: For each page, count the number of tags, including only those which are also on this_page, and only give me results if there is more than 1 tag in common. Simple, right?

If another post has 3 tags in common, tags_in_common will be 3. And that's all done in the database, so it's pretty fast and doesn't require a lot of memory (at least for the application). Let's say instead I wanted to know how similar the tags were for a page, and rather than the number of tags, I wanted to know the percentage in common. For example, if this page has 5 tags, and another post has 3, the similarity would be 0.6 (3 ÷ 5). Django makes our lives a lot easier, by allowing arithmetic operations directly on expressions, so writing this feels very natural:

Python

similar_pages = BlogPostPage.objects.annotate(
    tag_similarity=Count("tags", filter(tags__in=my_tags)) / len(my_tags)
).filter(tag_similarity__gt=0)

Here, we take the same query as before, but stick / 5 on the end of the alias, which Django helpfully adds into the SQL for us:

PostgreSQL

(COUNT("blog_blogpostpage_tags"."blogposttagpage_id") / 5)

But, if you run this, it doesn't work as you'd expect it:

Python

>>> similar_pages.count()
0

You wat?! We've gone from having a post with 3 out of 5 tags in common, to somehow having a similarity of 0? Is that what's happening?

Python

>>> BlogPostPage.objects.annotate(tag_similarity=Count("tags", filter(tags__in=my_tags)) / len(my_tags)).first().tag_similarity
0

Yes, yes it is. len(my_tags) is definitely 5, and tags_in_common was definitely 3. 3 ÷ 5 is definitely 0.6, so something else is afoot.

To understand why, we need to think about our good friend: types.

#Integers vs floats

To humans, there's only "numbers". 2 is a number, 83 is a number, 3.14 is a number. Computers are much more fussy. Even before we consider the size of the number (there are numbers above 2.1 billion, apparently), computers care about decimal points, and need to know when there is one. We know that 2 and 2.0 are the same, but a computer doesn't.

But that's enough about the rabbit-hole that is floating point. In short, the "type" of a number matters.

When we mix the 2, interesting things happen.

Python saves us quite a bit, by happily converting between ints and floats when it's necessary:

Python

>>> 3 / 5
0.6
>>> type(3), type(5)
(<class 'int'>, <class 'int'>)
>>> type(3 / 5), type(3.0 / 5)
(<class 'float'>, <class 'float'>)
>>> type(6 / 2)
<class 'float'>

Because division frequently doesn't result in a whole number, Python gives us a float. If we don't want that, we can just convert the value to an int or use Python's "floor divide" operator (//) which gives us back an int instead.
In a more strongly-typed language like Rust, we don't see this hand-holding. If you divide an integer (i* or u*), you'll get the same type back:

[src/main.rs:2] 3 / 5 = 0
[src/main.rs:3] 3.0 / 5.0 = 0.6

And mixing types is impossible because your program won't compile:

error[E0277]: cannot divide `{float}` by `{integer}`
 --> src/main.rs:2:14
  |
2 |     dbg!(3.0 / 5);
  |              ^ no implementation for `{float} / {integer}`
  |
  = help: the trait `Div<{integer}>` is not implemented for `{float}`
  = help: the following other types implement trait `Div<Rhs>`:
            <&'a f32 as Div<f32>>
            <&'a f64 as Div<f64>>
            <&'a i128 as Div<i128>>
            <&'a i16 as Div<i16>>
            <&'a i32 as Div<i32>>
            <&'a i64 as Div<i64>>
            <&'a i8 as Div<i8>>
            <&'a isize as Div<isize>>
          and 54 others

For more information about this error, try `rustc --explain E0277`.

If we look back at that first rust example, 3 / 5 = 0, is the same symptoms as our original issue - a division not returning the fractional value. Even though we're writing Python, that's not where the issue lies.

#`COUNT`

Now, back to counting columns. In SQL, counting rows is done with the COUNT function. I mostly use it for SELECT COUNT(*) FROM table_name;, but apparently it has other uses.

If we look at the aggregate function reference for PostgreSQL, COUNT returns a bigint. What happens if we try our maths using a bigint?

PostgreSQL console (psql)

postgres=# SELECT 3 / 5;
 ?column?
----------
        0
(1 row)

postgres=# SELECT 3::bigint / 5;
 ?column?
----------
        0
(1 row)

<note>

:: is a way of explicitly noting / casting the type of a value in PostgreSQL. In this case, we make sure 3 has the same type it would have if COUNT returned it.

</note>

And there in lies the problem. When COUNT returns the number of tags, it's of type bigint. When we divide, we get back a fractional value, which can't be stored in a bigint, so it gets truncated to 0. The fact PostgreSQL is written in C (which isn't dynamically-typed) makes no difference here - so is Python (CPython anyway) after all.

#The fix

So we know the problem, what's the solution?

In this case, we just need to make sure that the COUNT value is a decimal value before we attempt to divide it. Therefore the result of the division can keep its fractional value, and we get the expected value back.

The easiest way to do this is with CAST (which Django implements with Cast). We "cast" COUNT's bigint value to a double precision (which supports fractional values), and then divide it.

Python

from django.db.models import FloatField
from django.db.models.functions import Cast

similar_pages = BlogPostPage.objects.annotate(
    tag_similarity=Cast(Count("tags", filter(tags__in=my_tags)), output_field=FloatField()) / len(my_tags)
).filter(tag_similarity__gt=0)

<note>

Behind the scenes, Django's FloatField is represented by PostgreSQL's double precision.

</note>

It's a small difference (which is more verbose than I'd want), but at least it works as I'd expect it:

Python

>>> similar_pages.first().tag_similarity
0.6

If we peek at the SQL, we can see the difference:

PostgreSQL

(COUNT("blog_blogpostpage_tags"."blogposttagpage_id"))::double precision / 5)

The COUNT value is being cast to a double precision just before the divide, and so tag_similarity can store a fractional part and it behaves as we'd expect. Distilling this back down to simple SQL:

PostgreSQL console (psql)

postgres=# SELECT 3::double precision / 5;  
?column?    
----------  
     0.6  
(1 row)

#The less-nice fix

The "nice" fix is to cast the value of COUNT to a double precision before dividing, but I'd be remiss if I didn't talk about some other solutions, all of which are worse than this one, but still interesting to discuss nonetheless.

First off, you could cast the denominator rather than the numerator in the division. Because 1 of the values is a float, the result is able to be a float too:

PostgreSQL console (psql)

postgres=# select 3 / 5::double precision;  
?column?    
----------  
     0.6  
(1 row)

Or, you could explicitly say the denominator is a float, rather than casting:

PostgreSQL console (psql)

postgres=# SELECT 3 / 5.0;  
       ?column?           
------------------------  
0.60000000000000000000  
(1 row)

Sure, the value looks a bit weird, but 0.60000000000000000000 is exactly the same as 0.6 - PostgreSQL considers 5.0 a numeric:

PostgreSQL console (psql)

postgres=# SELECT pg_typeof(5.0);  
pg_typeof    
-----------  
numeric  
(1 row)

And finally, my least favorite of all, you can add in a noop operation which changes the type.

PostgreSQL console (psql)

postgres=# SELECT (3 + 0.0) / 5;  
       ?column?           
------------------------  
0.60000000000000000000  
(1 row)

This is a pretty janky solution, but it does at least work. If you do choose this path for whatever reason, be sure to add a comment!

Python

# NB: I care more about typing fewer characters than I do about nice, maintainable code.
BlogPostPage.objects.annotate(
    tag_similarity=(Count("tags", filter(tags__in=my_tags)) + 0.0) / len(my_tags)
)

#The result

With a now-working query, I can calculate how similar the tags are between this page and another page, and find the pages most-similar. Combining that with a few other metrics, such as how similar the titles are, I can find similar-looking posts to this one, and show some them under a posts. And all in just a single, efficient database query.

<apology>

I'm sure there are better ways of doing this, but I don't want to write a tiny search engine, and couldn't find something out there to easily plug-in to what I already had. If you have have suggestions, please tell me!

</apology>

I wonder what will be shown underneath this post?...

Share this page

What's new in Django 3.2 LTS

2021-04-06

15 minutes

#django #programming

It’s that time again, time for another Django LTS release. Since Django 2.2, back in 2019, a lot has changed in tech, in Python and of course in Django. Historically, I’ve worked entirely on LTS versions, hence combining these 3 releases together. Staying on the LTS version is a trade-off…

Cranes on the River Wear, Sunderland, UK

Django 2.2

2019-04-01

9 minutes

#django #programming

April marks the release of Django 2.2, the latest LTS version of the popular Python web framework. Django 2.2 marks almost two years of development since the last LTS release, 1.11 in April 2017, and brings with it some very large improvements and changes which naturally come with a major…

Django ORM Performance

2020-06-07

10 minutes

#django #programming #python

Django already does some pretty incredible things when it comes to performance, especially in the ORM layer. The lazy loading, fluent interface for querying means it’ll only fetch the data you need, when you need it. But it can’t handle everything for you, and often needs some help to work…

COUNTing is hard: A tale of types in SQL and Django

#Integers vs floats

#COUNT

#The fix

#The less-nice fix

#The result

Similar content

What's new in Django 3.2 LTS

Django 2.2

Django ORM Performance

#`COUNT`