For the first time the #CoSocialCa Mastodon server has started to struggle just... - Random

mick, 21 days ago

For the first time the #CoSocialCa Mastodon server has started to struggle just a little bit to keep up with the flow of the Fediverse.

We’ve usually been “push” heavy but we’ve started to see some spikes in “pull” queue latency. The worst of these spikes was today, where we fell behind by at least a couple minutes for most of the afternoon.

1/?

#MastoAdmin #CoSocialTechOps

A pair of graphs showing sidekiq queue latency and enqueued jobs. The queue latency peaked at around 6 minutes at roughly 6pm this afternoon, but was between 1-2 minutes most of the afternoon. The maximum number of messages in queue at any one time was 12000. The majority of the traffic is pull queue.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Image

Image alternative text

mick, 21 days ago

This is great! It’s exciting to see our community growing.

I’m going to make a simple change to see if we can better keep up.

The system that we’re running on has plenty of headroom for more sidekiq threads.

2/?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mick, 21 days ago

For anyone interested in understanding the guts of Mastodon, I have found this article from Digital Ocean very helpful: https://www.digitalocean.com/community/tutorials/how-to-scale-your-mastodon-server#perfecting-sidekiq-queues

Eventually we’ll grow so big that we’ll need oodles of sidekiq queues and we’ll want to be able to customize how many types of them we want, and will run them as jobs across multiple servers and so-on.

But for now I’m just going to make the number of threads slightly bigger and see what happens.

3/?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

thisismissem, 21 days ago

@mick just note that article is completely out of date when it comes to the streaming server; I did changes & rewrote the scaling docs for that component.

STREAMING_CLUSTER_NUM is no longer a thing.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mick, 21 days ago

@thisismissem Yes I caught that. Not going near streaming. It’s about the best description of sidekiq I’ve found anywhere though.

If there are other good resources that are more up-to-date please share.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

thisismissem, 21 days ago

@mick yup, scaling docs absolutely need better info re sidekiq (and monitoring)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mick, 21 days ago

We’ll do this staging first, because I am a responsible sysadmin (and I am only ever half sure I know what I’m doing.)

We’re running the default config that came with our DigitalOcean droplet, which as a single sidekiq service running 25 threads.

4/?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mick, 21 days ago

That article from DigitalOcean suggests that 10-15 threads = 1 GB of RAM.

We also need to give each thread its own DB connection.

In staging the DB is local, so we don’t need to worry too much about a few extra connections.

In production, we’re connected to a DB pool that will funnel the extra connections into a smaller number of connections to the DB. Our Database server still has oodles of capacity to keep up with all of this.

5/?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mick, 21 days ago

Staging server only has 2 GB of RAM but it also has virtually no queue activity so let’s give it a shot.

Having confirmed that we have sufficient resources to accommodate the increase and then picked a number out of hat, I’m going to increase the number of threads to 40.

6/?

A webpage showing the sidekiq control panel on cosocial.engineering, featuring 40 threads.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mick, 21 days ago

No signs of trouble. Everything still hunky-dory in staging.

On to production.

If this is the last post you read from our server then something has gone very wrong. 😅

7/?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mick, 21 days ago

Aaaand we’re good. 🎉

I’ll keep an eye on things over the next days and week and see if this has any measurable impact on performance one way or the other.

And that’s enough recreational server maintenance for one Friday night. 🤓

8/?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mick, 20 days ago

This looks better! Pull queue never got more than 41 seconds behind and that was only briefly.

I still am not clear on what has contributed to these spikes, so there’s no way of knowing for sure that the changes made yesterday are sufficient to keep our queues clear and up-to-date, but this looks promising.

9/?

Graphs showing sidekiq pull queue performance over the past 48 hours. Today’s performance looks very good when compared to yesterday’s.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mick, 18 days ago

Well, we’re not out of the woods yet.

We fell behind by less than a minute for most of the day yesterday, with some brief periods where we were slower still.

The droplet is showing no signs of stress with the increased Sidekiq threads, so I can toss a bit more hardware at the problem and see if we can reach equilibrium again.

Better would be to get a clearer picture of what’s going on here.

Maybe we need to do both of these things!

10/?

A closer view of the hours from 10 am to 5 pm (EDT) yesterday, clearly showing the rise in queued pull jobs (and a rapid clearing of push jobs)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mick, 17 days ago

This strikes me as an issue.

We have the capacity to run 40 workers (following the change I made last week, documented earlier in this thread.)

We have fairly huge backlog of pull queue jobs.

Why aren’t we running every available worker to clear this backlog? 🤔

It might be necessary to designate some threads specifically for the pull queue in order to keep up with whatever is going on here, but I am open to suggestions.

#MastoAdmin

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

michael, 17 days ago

@mick I haven’t read the entirety of the thread, so forgive me if that’s already been covered, but have you tried defining your workers with different sequences of queues.

So you could have one service defined as

/home/mastodon/.rbenv/shims/bundle exec sidekiq -c 10 -q pull -q default -q ingress -q push -q mailers

Another as

/home/mastodon/.rbenv/shims/bundle exec sidekiq -c 10 -q default -q ingress -q push -q mailers -q pull

Etc.

That wat you would have 10 workers prioritising the pull queue, but picking up other queues when capacity is available. And another 10 workers prioritising the default queue, but picking up other queues (including pull) when capacity is available.

You could permutate this for some different combination of queue priorities.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mick, 17 days ago

@michael that’s where I’m headed next I think.

I’d hoped that just increasing the number of threads for the single service would be enough, but it seems like the default queue prioritization results in a backlog and idle workers.

So dedicating a number of threads per queue seems like the next sensible step.

Thanks for the suggestion!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

michael, 17 days ago

@mick just to be clear, what I’d suggest is not to dedicate them, but to prioritise.

Maybe you mean the same thing, but if you set up a service with

/home/mastodon/.rbenv/shims/bundle exec sidekiq -c 10 -q pull

And another with

/home/mastodon/.rbenv/shims/bundle exec sidekiq -c 10 -q push

Then that first process will sit idle when there is nothing in the pull queue, even if the push queue might be full.

If, on the other hand, you have a service defined as

/home/mastodon/.rbenv/shims/bundle exec sidekiq -c 10 -q pull -q push

And another as

/home/mastodon/.rbenv/shims/bundle exec sidekiq -c 10 -q push -q pull

Then that first command will process the push queue, after the pull queue has been emptied. And the second one will process the pull queue after the push queue has been emptied. Thus potentially wasting fewer resources.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

paul, 17 days ago

@michael @mick

After I was told to do this, it worked like a charm and my queues haven't been backed up at all. Before, I had gotten in the rut where several thousands were backed up in queue during certain busy hours of the federation day.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

rolle, 12 days ago

@mick What monitoring system is that?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mick, 12 days ago

@rolle Prometheus, and this sidekiq-exporter https://github.com/Strech/sidekiq-prometheus-exporter

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

rolle, 12 days ago

@mick Thanks!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

thisismissem, 21 days ago

@mick i wonder if it'd make sense to produce log messages for each instance, actor, activity type, directionality and other properties over time & make that graphable?

What percentage of activities processed or sent in the last day were Delete's, what type of activities does Server Y send me?

(Maybe this also ties into @polotek's question earlier today)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mick, 21 days ago

@thisismissem @polotek That would be helpful. I’d love to be better able to interpret weird traffic spikes like this.

Without being creepy about it. 😅

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

polotek, 20 days ago

@mick @thisismissem so the thing that has me thinking about this is I was using activitypub.academy to view some logs. I did a follow to my server and it showed that my server continually sent duplicate "Accept" messages back. I can't tell if that's an issue with my server or with the academy. Because I can't see my logs.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

thisismissem, 20 days ago

@polotek @mick yeah, I've seen that too & I'm not sure if it's a bug on the source mastodon server or on ActivityPub.academy server

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

polotek, 20 days ago

@thisismissem @mick one thing I know is a problem is retries that build up in sidekiq. Sidekiq will retry jobs basically forever. And when server disappear, their jobs sit in the retry queue failing indefinitely. I'm sure larger instances with infra teams do some cleanup hear. But how are smaller instances supposed to learn about this?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

thisismissem, 20 days ago

@polotek @mick there are retry limits

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

polotek, 20 days ago

@thisismissem @mick where? Are they configurable? And again, how would I know? Is the recommended support channel complaining in mastodon until somebody tells you something that you can’t even verify?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

thisismissem, 20 days ago

@polotek @mick in the workers

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

polotek, 20 days ago

@mick @thisismissem people told me that activity pub was very "chatty". I understand a lot better why that is now. But I now suspect that there's also a ton of inefficiency there. Because few people are looking at the actual production behavior.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

thisismissem, 20 days ago

@polotek @mick at Hachyderm we do do a lot of monitoring & alerting (helps when the infrastructure team is all really experienced), but there could certainly be more logs & data available

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

KevinMarks, 20 days ago

@polotek @mick @thisismissem what happened to the attempt to wire Mastodon up with OpenTelemetry? This kind of thing is what honeycomb.io is really good at exploring

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

thisismissem, 20 days ago

@KevinMarks @polotek @mick I think it's still in progress

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment