Today I listened to episode 4283 of the Daily Tech News Show podcast in which they discussed Reddit’s fight with Microsoft and Perplexity AI over the crawling of content on the reddit platform. I think Justin Robert Young correctly summarized the whole story succinctly in that Microsoft has money and Reddit wants that money. I also agree with his statement that Reddit, as a business, should sue to obtain whatever money they can while they can. I don’t, however, agree with the statement that this is “their data.”
In full disclosure, I was once an avid Reddit user and active contributor to several communities, but I discontinued using it after the great reddit blackout. Truthfully, I don’t miss it, although I do miss much of the knowledge stored within that platform. When I signed up for Reddit, I had to agree to their User Agreement. Now, I’m certain it has changed in the years since I did so, but based on the current agreement I asserted that I owned the content that I was submitting and I granted the site non-exclusive license to that content. Naively I assumed, as Microsoft has argued, that this non-exclusivity meant the content would be freely viewable, usable, and crawlable since that’s been the norm for as long as the World Wide Web (remember that term??) existed. You know what I didn’t take into account? I own the content, but they own the servers, and they have no obligation to serve the content if they don’t want to.
Let’s face it, every time content is viewed, crawled, or scraped it costs some small amount to the company serving that content. We’re probably talking small fractions of a penny each time something is served, but at Reddit’s scale, that can turn into real money. So, of course, they want to not just recoup that cost, but profit from it. I get it. As the owner of that content (by Reddit’s own admission in their User Agreement), I don’t like it, but I get it.
So what am I to do about this? Even though I no longer use Reddit if there were an option I could set to opt in to my content being visible to all web crawlers for use in any search engine, AI, research initiative, or whatever else, I would login to enable that. Of course, this doesn’t exist. X recently took the opposite approach on their website, automatically opting users into their content being used to train an AI and giving them the option to opt-out. Why the difference? Well, it’s X’s own AI in question, so they’re the one deriving benefit. Is it hypocritical of me that I kinda want to dust off my Twitter credentials to login (for the first time in probably 10 years) and opt out of this while wishing Reddit would let me opt-in to being crawled? Maybe it is, but it’s my content so I have a right to that hypocrisy.
So what’s a user to do? The license I granted is non-exclusive, so since I still own the content I could re-post it on my own site. This has a number of problems. First, search engines would see this duplication of content and bury it as unoriginal, so all traffic would still go back to the social networks. Or maybe not, if the social network is using robots.txt to block a particular search engine from crawling that content, then maybe my copy would actually rise to the top. The bigger problem is that much of the content only really makes sense in connection to the rest of the content around it. A comment makes no sense without the context, which I do not have ownership over or license to. Posts lose much of their usefulness without the comments, which I also do not have ownership over or license to. In the end, there’s no benefit to reclaiming and reposting my content. I suppose I could go in and explicitly delete my content, but the license I granted was not just non-exclusive, but also perpetual and unrevocable. So, even if I went scorched earth on my own creations, the platforms are well within their rights to restore it.
The fact that I’m posting this commentary here, on my own site, points to one thing: I’ve learned my lesson. Yeah, there’s a chance that this content will never be discovered by a web crawler, never be used to train an AI, and might not even be read by another human being, but at least I know that I retain control over this content. Read, crawl, scrape, and use it for training if you wish…or don’t, you have my permission.