Avatar @zvava

ok so i have found a genuine twt hash collision. what do i do.

internally, bbycll relies on a post lookup table with post hashes as keys, this is really fast but i knew i'd inevitably run into this issue (just not so soon) so now i have to either:
  1) pick the newer post over the other
  2) break from specification and not lowercase hashes
  3) secretly associate canonical urls or additional entropy with post hashes in the backend without a sizeable performance impact somehow

vor ≈2 Monaten | #6ishh6q |
Antworten zu #6ishh6q von @zvava
Avatar @prologic | #6ishh6q

@zvava we have to amend the spec and increase the hash length. We just haven't done so yet 😆

vor ≈2 Monaten | #vnph4ma |
Avatar @zvava | #6ishh6q

@prologic i just added timeline refresh to bbycll and it is so convincing i almost replied to you from there hehe, can i get a link pretty please :o

vor ≈2 Monaten | #bxfzeaa |
Antworten zu #tm3naga von @lyse
Antworten zu #zqxcq3a von @prologic
Avatar @zvava | #zqxcq3a

@prologic im unsure how i feel about the hash v2 proposal, given it is completely backward incompatible with hash v1 it doesn't really solve any of the problems with it. it only delays collisions, and still fragments threads on post edits

i skimmed through discussions under other the proposals — i agree humans are very bad at keeping the integrity of the web in tact, but hashes in done in this way make it impossible even for systems to rebuild threads if any post edits have occurred prior to their deployment

vor ≈2 Monaten | #nzs23fa |
Antworten zu #nzs23fa von @zvava
Avatar @lyse | #nzs23fa

@zvava It is just completely impossible to make v2 backwards-compatible with v1.

Well, breaking threads on edits is considered a feature by some people. I reckon the only approach to reasonably deal with that property is to carefully review messages before publishing them, thus delaying feed updates. Any typos etc., that have been discovered afterwards, are just left alone. That's what I and some others do. I only risk editing if the feed has been published very few seconds earlier. More than 20 seconds and I just ignore it. Works alright for the most part.

vor ≈2 Monaten | #axrtzga |
Avatar @zvava | #nzs23fa

@lyse i dont mind if the hash is not backward compatible but im not sure if this is the right way to proceed because the added complexity dealing with two hash versions isnt justified

regular end users wont care to understand how twt hashes are formed, they just want to use twtxt! so i guess i could work in protecting users from themselves by disallowing post edits on old posts or posts with replies, but i'm not fond of this either really. if they want to break a thread, they can just delete the post (though i've noticed yarn handling post deletes dubiously...)

on activitypub i do genuinely find myself looking through several month or even year old posts sometimes and deciding to edit/reword them a little to be slightly less confusing, this should be trivial to handle on twtxt which is an infinitely simpler specification

vor ≈2 Monaten | #dvw775q |
Antworten zu #dvw775q von @zvava
Avatar @lyse | #dvw775q

@zvava There would be only one hash for a message. Some to be defined magic date selects which hash to use. If the message creation timestamp is before this epoch, hash it with v1, otherwise hammer it through v2. Eventually, support for v1 could be dropped as nobody interacts with the old stuff anymore. But I'd keep it around in my client, because why not.

If users choose a client which supports the extensions, they don't have to mess around with v1 and v2 hashing, just like today.

As for the school of thought, personally, I'd prefer something else, too. I'm in camp location-based addressing, or whatever it is called. There more I think about it, a complete redesign of twtxt and its extensions would be necessary in my opinion. Retrofitting has its limits. Of course, this is much more work, though.

vor ≈2 Monaten | #tu6eela |
Avatar @alexonit | #dvw775q

@zvava @lyse I also think a location based reference might be better.

A thread is a single post of a single feed as a root, but the hash has the drawback of not referencing the source, in a distributed network like twtxt it might leave some people out of the whole conversation.

I suggest a simpler format, something like: (#<TIMESTAMP URL>)

This solves three issues:

  • Easier referencing: no need to generate a hash, just copy the timestamp and url, it's also simpler to implement in a client without the rish of collisions when putting things together
  • Fetchable source: you can find the source within the reference and construct the thread from there
  • Allow editing: If a post is modified the hash becomes invalid since it depends on [ timestamp, url, content ]
vor ≈2 Monaten | #altkl2a |
Antworten zu #altkl2a von @alexonit
Avatar @lyse | #altkl2a

@alexonit Personally, I find the reversed order of URL first and then timestamp more natural to reference something. Granted, URL last would be kinda consistent with the mention format. However, the timestamp doesn't act as a link text or display text like in a mention, so, it's some different in my opinion. But yeah.

vor ≈2 Monaten | #ro3oydq |
Avatar @alexonit | #altkl2a

@lyse Yeah, the format is just an idea of how it could work.

The order of SOURCE > POST does make more sense indeed.

vor ≈2 Monaten | #sqisw6a |
Avatar @zvava | #altkl2a

@alexonit @lyse i really don't understand why this was not the solution in the first place, given how simple twtxt is (mean to be), a reply should be as simple as #<https://example.com/twtxt.txt#2025-09-22T06:45Z> with the timestamp in an anchor link. the need for a mention is avoided like this as well since it's already linking to the replied-to feed!

🐀💭 i should just implement it into bbycll and force it into existence

vor ≈2 Monaten | #hdnacjq |
Avatar @prologic | #altkl2a

We've been discussing the idea of changing the threading model from Content-based Addressing to Location-based addressing for years now. The problem is quite complex, but I feel I have to keep reminding y'all of the potential perils of changing this and the pros/cons of each model:

With content-addressed threading, a reply points at something that’s intrinsically identified (hash of author/feed URI + timestamp + content). That ID never changes as long as the content doesn’t. Switching to location-based anchors makes the reply target extrinsic—it now depends on where the post currently lives. In a pull-based, decentralised network, locations drift. The moment they do, thread identity fragments.

vor ≈2 Monaten | #r7g45uq |
Avatar @prologic | #altkl2a

Here is just a small list of things™ that I'm aware will break, some quite badly, others in minor ways:

  1. Link rot & migrations: domain changes, path reshuffles, CDN/mirror use, or moving from txt → jsonfeed will orphan replies unless every reader implements perfect 301/410 history, which they won’t.
  2. Duplication & forks: mirrors/relays produce multiple valid locations for the same post; readers see several “parents” and split the thread.
  3. Verification & spam-resistance: content addressing lets you dedupe and verify you’re pointing at exactly the post you meant (hash matches bytes). Location anchors can be replayed or spoofed more easily unless you add signing and canonicalization.
  4. Offline/cached reading: without the original URL being reachable, readers can’t resolve anchors; with hashes they can match against local caches/archives.
  5. Ecosystem churn: all existing clients, archives, and tools that assume content-derived IDs need migrations, mapping layers, and fallback logic. Expect long-lived threads to fracture across implementations.
vor ≈2 Monaten | #3h7w7ca |
Antworten zu #3h7w7ca von @prologic
Avatar @lyse | #3h7w7ca

@prologic I know we won't ever convince each other of the other's favorite addressing scheme. :-D But I wanna address (haha) your concerns:

  1. I don't see any difference between the two schemes regarding link rot and migration. If the URL changes, both approaches are equally terrible as the feed URL is part of the hashed value and reference of some sort in the location-based scheme. It doesn't matter.

  2. The same is true for duplication and forks. Even today, the "cannonical URL" has to be chosen to build the hash. That's exactly the same with location-based addressing. Why would a mirror only duplicate stuff with location- but not content-based addressing? I really fail to see that. Also, who is using mirrors or relays anyway? I don't know of any such software to be honest.

  3. If there is a spam feed, I just unfollow it. Done. Not a concern for me at all. Not the slightest bit. And the byte verification is THE source of all broken threads when the conversation start is edited. Yes, this can be viewed as a feature, but how many times was it actually a feature and not more behaving as an anti-feature in terms of user experience?

  4. I don't get your argument. If the feed in question is offline, one can simply look in local caches and see if there is a message at that particular time, just like looking up a hash. Where's the difference? Except that the lookup key is longer or compound or whatever depending on the cache format.

  5. Even a new hashing algorithm requires work on clients etc. It's not that you get some backwards-compatibility for free. It just cannot be backwards-compatible in my opinion, no matter which approach we take. That's why I believe some magic time for the switch causes the least amount of trouble. You leave the old world untouched and working.

If these are general concerns, I'm completely with you. But I don't think that they only apply to location-based addressing. That's how I interpreted your message. I could be wrong. Happy to read your explanations. :-)

vor ≈2 Monaten | #6udv2ja |
Avatar @prologic | #3h7w7ca

@lyse I don't think there's any point in continuing the discussion of Location vs. Content based addressing.

I want us to preserve Content based addressing.

Let's improve the user experience and fix the hash commission problems.

vor ≈1 Monat | #o52fuua |
Avatar @alexonit | #3h7w7ca

@lyse @prologic Can't we find a middle ground and support both?

The thread is defined by two parts:

  1. The hash
  2. The subject

The client/pod generate the hash and index it in it's database/cache, then it simply query the subject of other posts to find the related posts, right?

In my own client current implementation (using hashes), the only calculation is in the hash generation, the rest is a verbatim copy of the subject (minus the # character), if this is the common implemented approach then adding the location based one is somewhat simple.

function setPostIndex(post) {
    // Current hash approach
    const hash = createHash(post.url, post.timestamp, post.content);

    // New location approach
    const location = post.url + '#' + post.timestamp;

    // Unchanged (probably)
    const subject = post.subject;

    // Index them all
    addToIndex(hash, post);
    addToIndex(location, post);
    addToIndex(subject, post);
}

// Both should work if the index contains both versions
getThreadBySubject('#abcdef') => [post1, post2, post3]; // Hash
getThreadBySubject('https://example.com#2025-01-01T12:00:00') => [post1, post2, post3]; // Location

As I said before, the mention is already location based @<example https://example.com/twtxt.txt>, so I think we should keep that in consideration.

Of course this will lead to a bit of fragmentation (without merging the two) but I think this can make everyone happy.

Otherwise, the only other solution I can think of is a different approach where the value doesn't matter, allowing to use anything as a reference (hash, location, git commit) for greater flexibility and freedom of implementation (this probably need the use of a fixed "header" for each post, but it can be seen as a separate extension).

vor ≈1 Monat | #7ds2bwq |
Avatar @prologic | #3h7w7ca

@alexonit Yhays kind of love you!! Stance and position on this. If we are going to make chicken changes in the threading model, let's keep content based addressing, but also improve the use of experience. So in fact, in order to answer your question, I think yes, we can do some kind of combination of both.

vor ≈1 Monat | #6xnf3ja |
Avatar @alexonit | #3h7w7ca

@prologic That is really great to hear!

If there are opposing opinions we either build a bridge or provide a new parallel road.

Also, I wouldn't call my opinion a "stance", I just wish for a better twtxt thanks to everyone's effort.

The last thing we need to do is decide a proper format for the location-based version.

My proposal is to keep the "Subject extension" unchanged and include the reference to the mention like this:

// Current hash format: starts with a '#'
(#hash) here's text
(#hash) [@nick](/reader/index.php?action=own&url=url) here's text

// New location format: valid URL-like + '#' + TIMESTAMP (verbatim format of feed source)
(url#timestamp) here's text
(url#timestamp) [@nick](/reader/index.php?action=own&url=url) here's text

I think the timestamp should be referenced verbatim to prevent broken references with multiple variations (especially with the many timezones out there) which would also make it even easier to implement for everyone.

I'm sure we can get @zvava, @lyse and everyone else to help on this one.

I personally think we should also consider allowing a generic format to build on custom references, this would allow for creating threads using any custom source (manual, computed or external generated), maybe using a new "Topic extension", here's some examples.

// New format for custom references: starts with a '!' maybe?
(!custom) here's text
(!custom) [@nick](/reader/index.php?action=own&url=url) here's text

// A possible "Topic" parse as a thread root:
[!custom] start here
[custom] simpler format

This one is just an idea of mine, but I feel it can unleash new ways of using twtxt.

vor ≈1 Monat | #nocwefq |
Avatar @prologic | #3h7w7ca

@alexonit Holy fuck! 🤣 I just realized how bad my typing was in my reply before 🤣 🤦‍♂️ So sorry about that haha 😆 I blame the stupid iPhone on-screen keyboard ⌨️

vor ≈1 Monat | #ag5xxha |
Antworten zu #ag5xxha von @prologic
Avatar @alexonit | #ag5xxha

@prologic I admit that I was a bit confused about the meaning of the message, at least I understood it was a "yes" from the last sentence. 😅

vor ≈1 Monat | #bpxub4a |
Avatar @prologic | #3h7w7ca

I was trying to say (badly):

That's kind of my position on this. If we are going to make significant changes in the threading model, let’s keep content based addressing, but also improve the user experience. Answering your question, yes I think we can do some combination of both.

vor ≈1 Monat | #6rwb3za |
Avatar @prologic | #3h7w7ca

I would personally rather see something like this:


2025-09-25T22:41:19+10:00
vor ≈1 Monat | #66opeca |
Avatar @prologic | #3h7w7ca

Of course we still have to fix the hashing algorithm and length.

vor ≈1 Monat | #qitxowq |
Avatar @alexonit | #altkl2a

@prologic I can see the issues mentioned, but I think some can be fixed.

  1. The current hash relies on a url field too, by specification, it will use the first # url = <URL> in the feed's metadata if present, that too can be different from the fetching source, if that field changes it would break the existing hashes too, a better solution would be to use a non-URL key like # feed_id = <UNIQUE_RANDOM_STRING> with the url as fallback.

  2. We can prevent duplications if the reference uses that same url field too or the client "collapse" any reference of all the urls defined in the metadata.

  3. I agree that hashing based on content is good, but we still use the URL as part of the hashing, which is just a field in the feed, easily replicable by a bot, also noting that edits can also break the hash, for this issue an alternative solution (E.g. a private key not included in the feed) should be considered.

  4. For offline reading the source would be downloaded already, the fetching of non followed feeds would fill the gap in the same way mentions does, maybe I'm missing some context on this one.

  5. To prevent collisions there was a discussion on extending the hash (forgot if that was already fixed or not), but without a fallback that would break existing clients too, we should think of a parallel format that maintains current implementations unchanged, we are already backward compatible with the original that don't use threads at all, a mention style format for that could be even more user-friendly for those clients.

We should also keep in mind that the current mention format is already location based (@<example https://example.com/twtxt.txt>) so I'm not that worried about threads working the same way.

Hope to see some other thought about this matter. 🤓

vor ≈2 Monaten | #xan3eva |