Hello fedi. i am trying to solve the "fetch all replies" problem once and for... - Random

jonny, 1 month ago (edited 1 month ago)

Hello fedi. i am trying to solve the "fetch all replies" problem once and for all that makes the fedi feel a lot more desolate and with a lot more reply guys in it than it should be. this is take two, where before i had it triggered by a button, but now i think it should happen on the server-side whenever you expand a post. can anyone help me out figure out how to make this more efficient by only fetching posts that the server doesn't already have? i am not sure what the best strategy would be, and if anyone with experience doing efficient rails and SQL stuff could give me some pointers that would be gr8. the patch is actually extremely simple it just needs a few nice things to make it not DDoS everyone.

https://github.com/NeuromatchAcademy/mastodon/pull/44

Issue that describes approach: https://github.com/NeuromatchAcademy/mastodon/issues/43
Wiki page: https://wiki.neuromatch.social/Fetch_All_Replies

#MastoDev #FediDev #FetchAllReplies

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov, ifixcoinops, elduvelle, rysiek +11 more

Image

Image alternative text

bkil, 1 month ago

@jonny Someone mentioned this post on Matrix, I checked it on Friendica and it automatically fetches all replies and shows it as a comment thread. Welcome to technology from 2010 I guess, or maybe I have overlooked part of the original problem statement?.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

jonny, 1 month ago

@bkil
If friendica does it, then great. Masto doesnt.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

4censord, 1 month ago

@jonny have you considered running this as sidekiq jobs?
So either adding a new queue, or using the pull queue?

This would have the disadvantage that on high queue delay the fetching will be delayed.
But, it would also have the advantage of moving the load of fetching maybe many replies to a separate system, out of the main puma process.
This eases scaling and performance concerns.

On first thread expansion, it'd queue a job to fetch the first n replies to the thread.
When this job completes, and there are still more than n replies it should queue a new job (maybe even delayed to in a few seconds) to fetch the next n replies and repeat.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

jonny, 1 month ago

@4censord
Thats what id like to do, but wasnt sure the best way to make it happen!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mike, 1 month ago

@jonny Would love to see this fixed! I’m on my own instance and always have to navigate to the original page to see replies outside my follows.

Sorry I don’t know anything about mastodon internals to help!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

jonny, 1 month ago

@mike doin it for the small instances ;)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

efi, 1 month ago

@jonny don't posts have some kind of id when you fetch them from a server?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

jonny, 1 month ago

@efi yep! just need a little help with making an efficient query to check those against the local representations

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

efi, 1 month ago

@jonny append the server name to the id, hash it and use that as the index, I think, would be the most efficient?
tho sql has very good inference, so maybe indexing both on server name, then secondary index for post id would work even better, not sure, it's been a decade since I did sql myself

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

jonny, 1 month ago

@efi masto makes an internal snowflake ID for posts already and stores the originating post URI as well, i will investigate tmrw what indexes exist between URI and ID, but presumably that data is already all there. mostly concerned with the implementation in rails and the caching system for debouncing/deduplicating requests.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

smallcircles, 1 month ago

@jonny

Hey super interesting, Jonny!

> and with a lot more reply guys in it than it should be

As it happens this morning I was coincidentally side-tracked on a self-assigned quest to put thoughts on "Reply Guy" anti-pattern together on #SocialCoding movement's forum.

Till now that turned into this wiki post (and related discussion thread): https://discuss.coding.social/t/wiki-for-sx-anti-pattern-reply-sigh-aka-reply-guy/530

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

smallcircles, 1 month ago

@jonny

PS. I cross-ref'ed this great thread to the #SX matrix chatroom..

https://matrix.to/#/#socialcoding-foundations:matrix.org

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

can, 1 month ago

@jonny I unfortunately don’t have any knowledge into optimizing this, but I want to thank you for working on this issue. I think this is a very crucial feature that has been ignored for too long and will contribute greatly to the overall readiness of Mastodon. Because the current state clearly feels like a bug every time I open a post and I’m constantly viewing posts on the original instance, which is terrible UX. So, thanks!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

jonny, 1 month ago

@can hopefully we get it to work!!! we already had it working in v1, but it was masto-to-masto only, this one should be more general and should blend more seamlessly into normal use on both web and apps.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

18+ jonny, 1 month ago

Pitch

When expanding a post, the instance should fetch all replies from the host server.

This issue is to move more general conversation out of #8 because i think that's the wrong approach

Previous context:

https://github.com/NeuromatchAcademy/mastodon/pull/8 - I'm going to close this PR because I think it should be default behavior rather than behind a button in the web interface, and it should be done by the server rather than the client calling the search endpoint.

https://wiki.neuromatch.social/Fetch_All_Replies

https://neuromatch.social/tags/FetchAllReplies

Motivation

Two reasons:

It's an important discovery mechanism - people should be able to see the conversation around a post (within normal privacy settings, ie. we should not be trying to get followers-only posts, etc.)

The "a thousand of the same replies" problem is notorious on fedi and part of what makes it somewhat exhausting, and can quickly feel like brigading if a post becomes even moderately popular.

Approach

When a post is expanded, a call is made to the https://github.com/NeuromatchAcademy/mastodon/blob/7f91e30520309d7b0b960ae5bd54abd169aac1f5/app/controllers/api/v1/statuses_controller.rb#L32

If that status is on a different server AND the request is coming from a logged in user on our instance, make a call to the https://github.com/NeuromatchAcademy/mastodon/blob/main/app/services/activitypub/fetch_replies_service.rb before yielding from the db

Remove limitation on URIs matching the host server and the limit of 5 replies in the https://github.com/NeuromatchAcademy/mastodon/blob/7f91e30520309d7b0b960ae5bd54abd169aac1f5/app/services/activitypub/fetch_replies_service.rb#L56 method and https://github.com/NeuromatchAcademy/mastodon/blob/7f91e30520309d7b0b960ae5bd54abd169aac1f5/app/services/activitypub/fetch_replies_service.rb#L38.

Instead, to mitigate amplification/DoS, replace with a single numerical limit that first filters out URLs that the instance already has (to avoid duplicated requests). This should be tied to the pagination of the context endpoint - first fetch 40 posts, then as one scrolls the server should fetch the next 40, and so on.

Concerns

Privacy has been discussed elsewhere - we will only be getting posts that wouldn't be filtered out by normal post visibility settings. ie. the user would be able to get them on their own by just running a bunch of manual searches.

Perf & API Consistency: Having a potentially long-running service call in the context endpoint is undesirable. We should run the service as async. This will mean that later calls will yield different results (ie. as the posts are imported by the async worker). That's really only a problem for programmatic API usage, and just requires a note on the endpoint documentation. In normal web UI usage, it should look like the posts loading into the interface as they are received. The context endpoint would behave as expected on the first call, and just have extra replies in future calls. We could add an additional option that defaults false to make the reply fetching service synchronous.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

jonny, 1 month ago (edited 1 month ago)

I think that with a combination of debouncing how frequently the reading server requests from the OP server, and only asking tertiary (replier) servers for new posts off the results from the context request that it isn't any more of a DoS problem than normal masto operation. Recall that masto is already pretty dang inefficient (eg. if you expand a post, masto will already fetch the profile, which includes fetching pinned posts, preview cards for all links in bio and pinned posts, etc.), and expanding the context of a post would be a directly triggered behavior that matches i think normal expectations: when i look for the replies to that post, i should see the replies to that post. This would tie into any existing privacy controls - a 'followers only' reply wouldn't be reported in the response from the OP -> reading server, the reading server would have to abide by AUTHORIZED_FETCH, blocks would still hold, etc.

The costs of not having fetch all replies are pretty bad - first is that fedi can feel vacant on smaller servers. it takes actually quite a lot of people with quite a lot of follows to start having anything resembling a conversation among ppl more then 1-deep in a social graph. One of the primary criticisms of the fedi (mastodon specifically) is the high number of reply guys, and if you have ever had a post with even a moderate amount of popularity you are aware how exhausting it is to get exactly the same reply over and over again, as well as pointing well-intentioned people to information/replies/etc. that already exist elsewhere in the replies.

I'll stop there, but I think that the benefits of having fetch all replies pretty strongly outweigh the costs, and so that's why i want to do it efficiently. This is an especially important behavior if we want to get to a point of making the fedi p2p, where we can make sparse state updates more of the norm <3

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

jonny, 1 month ago

this is a patch on top of glitch, and so if we find something that works here the goal would be to pull it upstream, with neuromatchstodon as sort of the live testing instance. so ur work would be respected, credited, and made more general

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

jonny, 1 month ago

This is, imo one of the biggest problems with running a small or single-user fedi instance. This patch would make small fedi instances about a billion times more usable - aka it's directly responsive to the problem of 'fediverse is cool, but actually most accounts are on the largest 3 servers' bc smaller servers see like 0.01% of the fedi.

this is actually imo a more efficient behavior compared to the current alternative, which is to make some dummy account (or pollute your home feed) with lots and lots of follows you need to make just to be able to see the context around a post. ie. currently you need to get many many more posts than you want vs. just requesting the context of the posts you want to see.

also a polite cc to @hrefna who i have seen write about amplification on activitypub and masto a bunch of times, in case xe have any thoughts here

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment

Pitch

Motivation

Approach

Concerns