Hopefully this is part 1 of many posts, describing my progress into running a PDS across many nodes or devices.
A few weeks ago someone said something about running a PDS in a browser that got me thinking... What if a PDS was hosted on a device. Your phone, tablet, laptop etc as part of an app. It would allow anyone to host their own PDS without the infrastructure barrier that technical minded people know how to get over.
The great thing about having it running on your device(s) is that you really do own your data without the worry of a cloud hosting company going down or that the wifi in your house stops a Raspberry Pi from running (It'll also be much cheaper). But what if your phone turns off I hear you say? Well what if you have more than 1 device running a PDS...?
All of that is a dream scenario for me that may or may not happen. It got me thinking though; how hard can it be to have multiple instances of a PDS running across more than 1 device?
I run my PDS (Bluesky reference instance) on an Uplcloud server and it costs me around £6 a month. I want to run it on my Pi at home and for the past few months, I've been running a test PDS using the Go implementation Cocoon. It's been fun and I've learnt a ton about how the PDS work and plugs into the atproto architecture.
Most people I know that host a PDS on a Pi use Cloudflare tunnels to allow that inbound traffic to their home network. I tried it, but it was too flakey and kept erroring. Instead I use Tailscale which is something I've used for years for my hobby projects.
I point my test PDS DNS record pds404.uk to the same Upload server that my real PDS is running on. From there Caddy routes traffic to pds404.uk to my Pi via the Tailscale network. It's so easy to configure and since I already had Tailscale running on the server and Pi for my ssh access, I didn't need to install anything new.
Cocoon is running in a Docker container on my Pi which means in theory I can just run 2 or 3 of them on my Pi. If I had another Pi, which is looking less likely at the moment due to how expensive they've gotten, I could easily run Cocoon on there as well and use the Caddy load balancer to route to either Pi. So that's what I did. I ran another Cocoon container on my Pi, configured Caddy to load balance between the 2 of them round robin style, pointed them at the same sqlite and keys file and it kinda worked.
I didn't test that too much because that isn't really distributing if they are tied to running on the same host sharing a sqlite file. Then I remembered I had heard of a product called Turso. A sqlite implementation that can run in the cloud. Surely that would make this more distributed wouldn't it? It might introduce more latency because database queries would now have to run over a network instead of a file system, but I gave it a shot.
Implementing Turso into Cocoon was actually quite simple. I just had to import another library and pass in the URL and token to the Turso database when opening the sqlite connection. Here is a PR on my fork of Cocoon showing how easy it was
There was also a Turso cli tool that allowed me to migrate from the sqlite file on my Pi to the cloud database which was very straight forward.
Once I had that, I configured my Cocoon instances on my Pi (both of them), spun them up and it worked... kinda. I tested it out by creating a post and sure enough I was able to see it from my real PDS account. I then created another one but this time it went missing. I used pdsls to look at the repo and the record was there which was good news, which then pointed me in the direction of relays and such.
I took a look at the logs for both containers and spotted something interesting. Both containers had a crawl request going out but only one then had the xrpc/com.atproto.sync.subscribeRepos incoming request, which is the relay opening a web socket connection to collect the events. What had happened is both servers had made the request crawl request but due to the round robin load balancing, only one of them got the incoming subscribe request (well it got both of them actually). The create post request must have then been routed to the container that hadn't been subscribed to which meant the create post record was never published to the relay.
This is quite a problem. It's an interesting problem that will need solving if I want to take my distributed PDS theory any further. Thankfully Fig had already mentioned to me a few weeks ago an idea they had for a Firehose inverter:
ooh i also have another thing for you maybe! leaflet.pub/d2b6a15e-303... i'd make the server if you'd actually make the PDS that uses it! (though mayyyyybe a tricky fit for phones where serving the redirect might not work)
This could potentially be the answer I'm looking for, but in the meantime I'm going to explore some other ideas that I have, but first I'm going to have to do a deep dive into how the subscribing works.
However this proves that running a PDS across more than 1 device should be possible and I proved that using Turso as the database works nicely too. Side note and something else I'm going to look into, is Turso also runs a replicated mode where a copy of the database is stored in a file on each server and changes from the cloud primary are replicated down to the local copies. That should help with latency and something I will investigate as well.
Stay tuned for the next part in my journey to a distributed PDS.