GUIDs - How I messed up my RSS feed

2023-09-25

4 minutes

#programming #website

If you're reading this post through an RSS aggregator, or were directed here from one, you may have already seen the issue I'm about to describe, and already swore my name. If you didn't, I'd recommend subscribing, for completely unbiased reasons.

This is a tale of RSS, and an easily forgotten about but quite handy feature (when you remember it exists...).

An RSS feed is simply a list of pages, ordered by date, with a bit of metadata attached (title, content etc). With a simple RSS feed, it's possible to expose a feed of pages, events, resources, all in ways which are easily machine-consumable to be used somewhere else. Reading blog posts, subscribing you YouTube videos, viewing commits on a GitHub repository, all possible through the wonders of RSS. Just plug the feed's URL into your aggregator of choice, and you're off!

rss.xml

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>All Pages Feed :: TheOrangeOne</title>
        <link>https://theorangeone.net/</link>
        <description />
        <language>en-gb</language>
        <lastBuildDate>Tue, 05 Sep 2023 07:57:58 +0000</lastBuildDate>
        <item>
            <title>Efficiently doing nothing in Linux</title>
            <link>https://theorangeone.net/posts/efficient-sleeping/</link>
            <description>Computers exist for doing work, usually useful, often not. In rare instances, it's useful to make a program do nothing at all. My primary use case for this is Docker containers, where it's useful to have the container do nothing, so it can be exec-d into as part of another…</description>
            <pubDate>Tue, 15 Aug 2023 20:03:34 +0000</pubDate>
        </item>
        <!-- And a few more items... -->
    </channel>
</rss>

On an interval, your aggregator will request the feed from wherever the content lies (for example, https://theorangeone.net/feed/), and notes down all the pages it sees. If there's a new page, it keeps track of it ready for consumption. If it already knows about a page, it ignores it. This means a feed sync is completely idempotent.

ish

For a web browser, and much of the rest of the internet, the URL is the unique identifier. You can get back to the same page if you enter the same URL (for simplicity, let's ignore authentication, cookies, and single-page apps). For example, if you want to find out about room-temperature superconductors, you can visit the page on Wikipedia: https://en.wikipedia.org/wiki/Room-temperature_superconductor. However, the field of superconductivity is changing rapidly. In 6 month's time, what we know may have changed completely. But regardless, you can visit the same URL and find out about room-temperature superconductors, and about what Wikipedia says about them - the URL uniquely identifies that. In the same way, I have a few blog posts, all reachable at URLs, and have been for a long time. Wikipedia has existed for a while, and will for a while to come. My website, hopefully will too.

RSS aggregators work in a similar way. Page content changes, but URLs rarely do. Aggregators will use the URL as a unique identifier, But if in a years time, I want to shuffle around some URLs, how can I do that? For browsers, it's easy, I issue a redirect to the new location. RSS aggregators don't follow redirects, they rely entirely on the feed itself.

And here-in lies the cause of all this, and whilst it's not DNS, it's another three-letter acronym: SEO. As part of helping better understand how much traffic I get from RSS feeds, I added a utm_medium parameter to the URLs in my feed, so when a user clicked the link, it'd be seen as coming through RSS by Plausible (which natively supported utm_* parameters). It was a single commit - a simple commit. I tested it, it worked great, and I left it. The next morning, I opened tt-rss, and spotted just over 100 new posts, in my own feed! In that moment, I knew exactly what the issue was - all the URLs had changed to include ?utm_medium=rss, which my aggregator saw as completely different and added all the new posts to my reading list (and probably many other people's - sorry!).

rss.xml

<link>https://theorangeone.net/posts/efficient-sleeping/?utm_medium=rss</link>

All I need is a way to uniquely reference RSS items in a way that allows me to change the URL as much as I want (or need). How hard can it be?

Enter, the guid field. The guid is an optional piece of metadata which can be attached to an RSS item to uniquely identify it. Many aggregators will fall back to the URL if the guid doesn't exist, and many feed generators will use the URL if they don't implement a guid, but it doesn't have to be a URL. The RSS specification defines no format for what the GUID should be, and leaves it down to the source to ensure it's a unique value - aggregators should consider it an opaque string. Because it's a separate field, I can keep the guid as the plain URL, maintaining backwards compatibility, and unlock the freedom to set the link to whatever I like:

rss.xml

<item>
    <title>Efficiently doing nothing in Linux</title>
    <link>https://theorangeone.net/posts/efficient-sleeping/?utm_medium=rss</link>
    <description>Computers exist for doing work, usually useful, often not. In rare instances, it's useful to make a program do nothing at all. My primary use case for this is Docker containers, where it's useful to have the container do nothing, so it can be exec-d into as part of another…</description>
    <pubDate>Tue, 15 Aug 2023 20:03:34 +0000</pubDate>
    <guid>https://theorangeone.net/posts/efficient-sleeping/</guid> <!-- And here's the fix -->
</item>

<implementation>

In Django's case, this is done by defining a item_guid method.

</implementation>

Now, the downside. Because I only figured this out after I made the mistake (as most teachable lessons go), most aggregators out there subscribed to my feed have already consumed and stored the erroneous URLs. RSS aggregators serve 2 purposes. Not only do they aggregate feeds together, but they also store the content themselves, even if they disappear upstream. This means there's nothing I can do to clean up people's feeds - they're stuck like that (again, sorry).

Had I set a guid from the start, none of this would have been an issue. But then, I wouldn't have learned a valuable lesson about RSS, reading RFCs, and the value in looking ahead.

The internet never forgets...

Share this page

Why I rewrote my website

2017-11-13

#programming #website

I’ve had a website for around four years now, starting with a python CGI-based site hosted at 1&1, and evolving into its current form, powered by Hugo. Although I’m a web developer, I’m very far from a designer. I really can’t design anything!Alternatives In the past, I’ve used services like…

Adding blog posts to my GitHub profile

2024-02-06

4 minutes

#programming #python #website

In case you didn't know, I have a blog - you're reading it now. It's not like what most people think of when they think "blog". It's guides, tales and random thoughts about the things I do, play around with or find interesting. The same can be said for the…

Turning my website into a browser search engine

2024-02-22

6 minutes

#programming #website

My website has search functionality. You can visit the search page (or the homepage or the magnifying glass in the top right), enter a search query, and if I've written something about it (which is quite likely), the matching pages will come up for your reading pleasure. Under the hood,…

GUIDs - How I messed up my RSS feed

Similar content

Why I rewrote my website

Adding blog posts to my GitHub profile

Turning my website into a browser search engine