Postmortem: a Mastodon outage, Backup restore and preventive Maintenance

       819 words, 4 minutes

After reading about Mastodon UI theming options, I decided to follow the directions from the TangerineUI-for-Mastodon project to get another look’n’feel on my instance. The directions were pretty clear and short, so I went for them. But something failed during assets compilation process. And my Mastodon instance got wrecked.

As a personnal “challenge”, I decided I would write a software post-mortem about this event. The end of the document will also summarize actions that were taken during post-backup-restoration maintenance phase.

Summary

Date2023-08-02 13:33
Author(s)Joel C.
StatusService restored.
SummaryMastodon Web UI was partially unavailable for 3H during daylight period.
ImpactNo toots could be sent or read from the Web interface.
No data were lost.
Only user(s) who applied the new theme were impacted.
Root cause(s)The compilation of theme assets failed but the theme was applied
by the only user of the Mastodon instance.
TriggerErrors were treated as warnings and have been discarded.
ResolutionFrontend files and configuration were restored from the last nightly backup.
DetectionImpacted user(s) could not access the Web interface.
The Mastodon error message “Something went wrong on our side” was displayed.

Corrective action(s)

ItemTypeOwnerBug
Reminders of the rules for Production operations.mitigateJoel C.N/A DONE
Set up a non-Production environment for staging purpose.preventJoel C.#00008 TODO
Freeze unplanned feature changes for a month.processJoel C.N/A DONE
Set up daily maintenance script to optimize backup.improvementJoel C.#00009 DONE

Lesson(s) learned

What went well

What went wrong

Where we got lucky

Timeline

2023-08-02 (GMT+2)

Restoring operations

Stop the Mastodon services and remove the failing content:

# systemctl stop mastodon-web mastodon-sidekiq mastodon-streaming
# rm -rf /home/mastodon/live

Transfer the backup data to the Mastodon server. I am using rsnapshot(1). This means that restore can be done with a simple tar command via a SSH connection, from the backup server to the Mastodon server:

# $(ssh-agent -s)
# ssh-add /home/backup/.ssh/id_backup
# cd /backup/mastodon/daily.0/home/mastodon
# gtar cpf - --exclude=live/public/system/cache live | \
  ssh backup@mastodon "cd /home/mastodon ; tar xvpf -"

Clearing the cache from the Mastodon server:

$ redis-cli -h 192.0.2.10 -n 0 FLUSHDB

$ cd ~/live
$ RAILS_ENV=production ./bin/tootctl cache clear

$ RAILS_ENV=production bin/tootctl preview_cards remove --days 0
14648/14648 |===========================================| Time: 00:01:53
Removed 14648 preview cards (approx. 489 MB)

$ RAILS_ENV=production bin/tootctl media remove --days 0
33553/33553 |===========================================| Time: 00:04:51
Removed 33553 media attachments (approx. 22 GB)

$ RAILS_ENV=production bin/tootctl media remove-orphans
46414/46414 |===========================================| Time: 00:01:57
Removed 53 orphans (approx. 7.49 MB)

$ RAILS_ENV=production bin/tootctl accounts cull
26989/26989 |===========================================| Time: 00:22:30
Visited 26989 accounts, removed 205

Proceed to cache warming for preferred domains:

$ RAILS_ENV=production bin/tootctl accounts refresh --domain bsd.network
$ RAILS_ENV=production bin/tootctl accounts refresh --domain piaille.fr
$ RAILS_ENV=production bin/tootctl accounts refresh --domain fosstodon.org
...

Restart and check Mastodon services status:

# systemctl start mastodon-web mastodon-sidekiq mastodon-streaming
# systemctl status mastodon-web mastodon-sidekiq mastodon-streaming

Maintenance operations

Since I configured proxies on my Mastodon instance, I get a lot more traffic and a lot more cache is used. I realised that way too much data is backed up in the daily process. I decided to tune my rsnapshot(1) configuration and to add a nightly maintenance script that will erase “old” cache information. What is old depends on you.

The rsnapshot configuration is tweaked to exclude the Mastodon cache directory:

# vi /etc/rsnapshot/mastodon.conf
(...)
backup backup@mastodon:/home/ home/ exclude=mastodon/live/public/system/cache
(...)

A maintenance script runs every night and clear old cached data:

# vi /home/scripts/mastodon_maintenance
#!/usr/bin/env bash
PATH="/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin"
PATH="${PATH}:$HOME/.rbenv/shims:$HOME/.rbenv/bin"
export PATH RAILS_ENV=production LANG=en_US.utf8
cd ~/live
sleep $((RANDOM%600))
echo "Starting Mastodon maintenance."
./bin/tootctl media remove --days 7
./bin/tootctl media remove --days 7 --prune-profiles
./bin/tootctl media remove --days 7 --remove-headers
./bin/tootctl media remove-orphans
./bin/tootctl preview_cards remove --days 30
./bin/tootctl statuses remove --days 7
echo "Mastodon maintenance done."
#EOF

# chmod 0755 /home/scripts/mastodon_maintenance

# cat > /etc/cron.d/mastodon_maintenance
@daily mastodon /home/scripts/mastodon_maintenance | mailx -s "Mastodon maintenance" root

The storage is now kept stable at about 30GB. No more Mastodon outage has been seen 😮‍💨