My fly.io-hosted website went down for 5 minutes (6 hours ago), but then came right back up, and has been up ever since. I use a free monitoring service that checks it every 5 minutes, so it's possible it missed another short bit of downtime. But fly.io has been pretty reliable overall for me!
Would be fascinated to see your data over a period of months.
Application up time is flakey, but what was worse were fly deploys failing for no clear reason. Sometimes layers would just hang and eventually fail for no particular reason; I'd run the same command an hour or two later without any changes and it would just work as expected.
I'd love to make a monitoring service to deploy a basic app (i.e. run the fly deploy command) every 5 minutes and see how often those deploys fail or hang. I'd guess ~5% inexplicably fail, which is frustrating unless you've got a lot of spare time.
I used to run a service that created k8s clusters on GCP for our customers. We did want to check that that functionality kept working and had a prober test it periodically. It was actually broken a lot.
Always good to monitor your dependencies if you have the time. Then when someone complains about an issue in your service, you can check your monitoring to see if your upstream services are broken. If they are, at least you know where to start debugging.
My downtimes from fly are pretty rare but generally global when they happen, in this outage we had no downtime but couldn't deploy for a few hours. I have issues with deploying about once per quarter(deploy most days across a few apps)
If that’s the case I suspect fly is getting a lot more reliable. I stopped using them about a year ago so haven’t kept up on their reliability since. Glad to hear, it’s good for a competitive market to have many providers, and fly might have issues but hopefully has a bright future
They are definitely getting more reliable. I was an early user and moved off them to self hosted for quite a while because of the frequent downtime in early days.
Their support still leaves a lot to be desired even as someone that pays for it but the ease of running and deploying a distributed front end keeps bringing me back.
The last post-mortem they wrote is very interesting and full of details. Basically back in 2016 the heart or keystone component of fly.io production infrastructure was called consul, which is a highly secure TLS server that tracks shared state and it requires that both the server certificate and the client certificate be authenticated. Since it was centralized, it had scaling issues, so fly.io wrote a replacement for it in 2020 called corrosion, and quickly forgot about consul, but didn't have the heart to kill it. Then in October 2024 consul's root key signing key expires, which brought down all connectivity, and since it uses bidirectional authentication, they couldn't bring it back online until they deployed new SSL certificates to every machine in their fleet. Somehow they did this in half an hour, but the chain of dominoes had already been set in motion to reveal other weaknesses in their infrastructure that they could eliminate. There was this other internal service whose own independent set of TLS keys had also expired long ago, but they didn't notice until they tried rebooting it as part of the consul rekey, since doing so severed the TCP connections it had established way back when its certificate was valid. Plus the whole time this is happening, their logging tools are DDOSing their network provider. It took some real heroes to save the company and all their customers too when that many things explode at once.
Any principle in itself isn't without critique, agree, but it's still the choice being made to pick this specific principle that tells the whole story. There are so many principles to pick from and the tech dept pick follows up with a "We have a 3-month “no refactoring” rule for new hires. This isn’t everyone’s preferred work style! We try to be up front about stuff.", which sounds a bit like an additional perform or else... principle that just delays ownership of the stuff you're supposed to work with. In the best case that sounds like naiive optimism and in the worst case that's gross negligence... neither one speaks "engineering" to me.
For brevity I chose to put up only the conclusion from a postmortem (of which I've read plenty by now) and another point from their otherwise comparatively shorter careers page, which imo capture the inherent tension between building out fast & building out right. This is not something I've started complaining about today or yesterday. I've used Fly in prod for 4 years and spilled much ink on this topic on their forums already. Even if I critique, I remain optimistic about Fly despite the seemingly endless list of failure modes building such complex systems entail: https://community.fly.io/t/fly-down/10224/15
(personally speaking, I'm humble enough because I can hardly build a toy side-project right!)
If a cloud platform doesn't really provide reliability, I'd say it's probably not worth it. You could better just rent a (virtual) server and save the cloud tax.
For experiments and hobby projects the value proposition is amazing. Where else can you spin up an independent instance for $1.94 per month?*
*Note this is for an instance with only 256MB RAM (https://fly.io/docs/about/pricing/), but it's definitely possible to run non-trivial projects on that. Rust-based web servers like Rocket require only about 10MB RAM. Basic PHP servers should also fit from what I can find.
There are plenty of better deals as long as you don’t limit yourself to big clouds and clouds with startup-esque landing pages frequently posted to HN. LowEndTalk may be the most well-known place for finding such deals.
(Not saying the typical cheap VPS on LowEndTalk has comparable PaaS features. Only responding to parent’s use case of a single cheap instance.)
Best business model in the world, buy stuff in big bags, put it in smaller ones, sell at a multiple of the original price.
Fly is mostly (to my knowledge) reselling Netactuate and OVH servers, their main innovation is the developer experience on top, using Docker on a MicroVM based approach. Of course not only that, but I think it’s their main differentiator.
Haven’t used that in a while but Scaleway offered ridiculously cheap dedicated ARM hardware close to these price points, not sure if they still do.
Nowhere? Because that's a ridiculously low amount of RAM to offer even in your cheapest offerings?
You can easily get 4 GB of RAM for $5 from the likes of Hetzner or Hostinger, so that's 16x more RAM for 2.5x the price. One relatively unknown provider I have used in the past offers 2 GB of RAM for €3.6/month (if paid monthly, €3 if anually), so 8x more RAM for 1.5-2x the price. I'm sure I could find something even cheaper, but I'm just looking at providers I have personally used.
BTW that dropdown seems to be sorted cheapest > most expensive. If you go to the bottom of the list the price for that same VPS doubles.
i recommend lowendtalk what fly.io doing is running colocated baremetal servers and using firecracker to overcommit (probably via memory ballooning and other disk compression on demand)
if you are going to haggle over $2/month then you are better off just connecting your raspberry pi with wireguard/cloudflare tunnel on a residential connection
I used to use Racknerd for that sort of thing, and the costs were around there -- maybe $1.90/mo for a 512MB instance. It was easy to squeeze several hobby projects onto the machine.
And actually, it's the resources that are free (CPU, memory, network) and you're allowed to split them up into multiple VMs if you want to.
One of my VMs had an uptime of more than 1050 days before the infrastructure rebooted it, so in terms of availability they've certainly surprised me.
The only downside I've come across with Oracle Free is that the 'best' regions are typically full. I ended up provisioning my free VMs in another region/country and it works fine.
I suppose another downside (if you want to view it this way) is they will delete idle unused free VMs after a certain time period. You have to add a credit card to your account to "upgrade" your account and run free resource indefinitely. While you're not charged for anything, it makes me nervous forking over a CC number to Oracle.
The reliability is very very bad. It was really insane that 2 times in the past few months the main dashboard was down as I’m demoing something. Not to mention the deploy outages and almost daily some random thing was unavailable or delayed.
I had to leave a few months ago after the price raises and how many times my boss saw some issue in the project I had with them.
They also deprecated and removed their sqlite backup service. Back to GCP and not worrying about so many outages now.
Now just to worry about GCP getting shut down with a few days' notice. /s
But in all seriousness the gall to raise prices before actually fixing the reliability problems is pretty shocking. I understand it's a bit of a chicken-and-egg thing where you maybe are tight on resources but there's no scenario where it's acceptable to have a product with these kinds of problems and then raise prices on existing customers who are putting up with it.
Google's b2b products are relatively stable (relative to their b2c free services). You generally get somewhere like a year of notice if they shut it down.
I don't really understand the value prop of fly.io. They seem to have an impressive engineering team despite the outages, but is edge compute really something that 99.9% of devs need? There are tons of large companies that operate out of a single AWS region and those services are used by millions around the globe. It just strikes me as something that enables premature optimization right out of the box.
It doesn't really need to run anything "substantial" though. Running some janky wordpress site with some scabbed-on ecommerce customizations is like 50% of the internet.
a 1vCPU 512mb instance is plenty for most base cases. Maybe you need one additional machine to act as a background worker. I am sure there are some noisy neighbors but to say its underpowered is silly.
I'm calling it underpowered because the $5 one had trouble running my custom ssh daemon. ssh! the cryptography for that shouldn't chug down the server I'm renting from them. a bigger instance from them isn't having the same problems.
Did you count reliability into your assesment here? I'm reading about Fly.io outages multiple times a year, whereas Lightsail seem to be as stable as AWS EC2.
I have asked this multiple times but is anyone really using edge compute and getting value out of it? I am certain there are cases but I have not seen any of them written up before.
I am going to go out on a limb and say there is no real value prop to fly.io. I could completely be wrong but it always feels like the modern MongoDB. Everyone wants to use it but I am not sure they are extracting value from it and instead its a shiny toy that is fun to build from.
If half your customers are in new your and half in sidney it makes you app faster if you run it in both places.
There is a lot of things we do for our users that we don't need (no one "needs" SPA etc). But if it is easy to make your app faster for your users, why not?
I tried out Fly.io and deployed a little test app. I couldn't even access the app, because they put it onto a server that was under "emergency maintenance" and had been that way for twelve days.
fly.io has a very bad reputation for reliability there doesn't seem to be any damage control beyond hackernews and even here the consensus seems to be "dont run anything mission critical on fly.io or expect data redundancy"
in fact, you can almost get the same thing fly.io does by running firecracker on your own bare metal servers and cheaper too.
I'm afraid the public sentiment towards fly.io has been tainted for good (I can't count how many times they apologized now).
Personal experience between Fly.io and Railway.com - Railway wins for me hands down. I have used both and the Railways support is stellar too, in comparison. Fly.io never responded to my query about data deletion till date. Despite emailing on their support email.
I have had my Railway app online till date without any major downtimes too. I recommend anyone looking for a decent replacement to try them.
Fly builds on their own hardware. Is Railway doing the same? If not, that'd explain some of why Railway has relatively less number of outages (they're engineering fewer things).
I understand that end-users want reliability (and Fly gets a bad rep despite pretty significant investment on this front in the past 2 years), but such outages aren't exclusive to one provider & not the other. Building cloud infra is no one's definition of easy.
I've used Railway control panel maybe a total of 10 times in my life and half the time it was having weird issues. Control panel UI not loading or not working, actions failing, deploys randomly failing... I love the idea but in practice it's not something I'd want to use for anything serious.
This is probably 5th or 6th major outage from Fly.io that I have personally seen. Pretty sure there were many others and some just went unnoticed. I recommended the service to a friend, and within two days he faced two outages.
Fly.io seriously needs to get it together. Why it hasn’t happened yet is a mystery to me. They have a good product but stability needs to be an absolute top for a hosting service. Everything else is secondary.
I get this but I think if people can give GitHub a pass for shitting the bed every two weeks maybe Fly should get a bit of goodwill here. I am not affiliated with Fly at all but I do think that people should temper their expectations when even mega corp can’t get it right
I guess the secret is to be the incumbent with no suitable replacement. Then you can be complete garbage in terms of reliability and everyone will just hand wave away your poor ops story
The biggest difference is GitHub in your infrastructure is (nearly always) internal. Fly in your infrastructure is external. Users generally don't see when you have issues with GitHub, but they do generally see when you have issues with Fly.
Who's giving GitHub a pass on shitting the bed? They go down often enough that if you don't have an internal git server setup for your CICD to hit, that's on you.
My point is made by your very post - getting off GitHub onto alternatives is not seriously discussed as an option - instead it’s “well, why didn’t you prepare better to deal with your vendor’s poor ops story”
It’s just not that big of a mystery. It’s not an excuse; it’s just true. Also, they’re not especially selling reliability as much as they’re selling small geo-distributed deployments.
Suspiciously, Turso started having issues around the same time. Their CEO confirmed on Discord it's due to the Fly outage:
> Ok.I caught up with our oncall and This seems related to the Fly.io incident that is reported in our status page. Our login does call things in the Fly.io API
> we are already in touch with Fly and will see if we can speed this up
No surprise. About a year ago, I looked at fly.io because of it's low pricing and I was wondering where they were cutting corners to still make some money. Ultimately, I found the answer in their tech docs where it was spelled out clearly that an fly instance is hardwired to one physical server and thus cannot fail over in case that server dies. Not sure if that part still is in the official documentation.
In practice, that means if a server goes down, they have to load the last snapshot from that instance from the Backup and push it on a new server, update the network path, and pray to god that not more server fail than spare capacity is available. Otherwise you have to wait for a restore until the datacenter mounted a few more boxes in the rack.
That explains quite a bit the randomness of those outage reports i.e. my app is down vs the other is fine and mine came back in 5 minutes vs the other took forever.
As a business on a budget, I think anything else i.e. a small civo cluster serves you better.
> a fly instance is hardwired to one physical server and thus cannot fail over
I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?
> I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?
You can run your workload (in this case a VM) on top of a scheduler, so if one node goes down the workload is just spun up on another available node.
> I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?
They mean the storage part. If your VM's storage(state) is on one server and that server dies, you have to restore from backup. If your VM's storage is on remote shared storage mounted to that server and the server dies, your VM can be restarted elsewhere that has access to that shared storage.
In AWS land it's the difference between instance store (local to a server) and EBS (remote, attached locally).
There's a tradeoff in that shared storage will be slightly slower due to having to traverse networking, and it's harder to manage properly; but the reliability gain is massive.
> Ultimately, I found the answer in their tech docs where it was spelled out clearly that an fly instance is hardwired to one physical server and thus cannot fail over in case that server dies.
Majority of EC2 instance types did not have live migration until very recently. Some probably still don't (they don't really spell out how and when it's supposed to work). It is also not free - there's a noticeable brown-out when your VM gets migrated on GCP for example.
Here's the GCP doc [1]. Other live migration products are similar.
Generally, you have worse performance while in the preparing to move state, an actual pause, then worse performance as the move finishes up. Depending on the networking setup, some inbound packets may be lost or delayed.
The status tells a story about a high-availability/clustering system failure so I think in this case the problem is rather the complexity of the HA machinery hurting the system's availability vs something like a simple VPS.
Bad code rarely causes outages at this scale. The culprit is always configuration changes.
Sure you can try and reduce those as well during the holiday season, but what if a certificate has to be renewed? What if a critical security patch needs to be applied? What if a set of servers need to be reprovisioned? What if a hard disk is running out of space?
You cannot plan your way out of operational challenges, regardless of what time of year it is.
> Sure you can try and reduce those as well during the holiday season, but what if a certificate has to be renewed? What if a critical security patch needs to be applied? What if a set of servers need to be reprovisioned? What if a hard disk is running out of space?
Reading this, I see two routine operational issues, one security issue and one hardware issue.
You can’t plan you way around security issues or hardware failures, but operational issues you both can and should plan around. Holiday schedules like this are fixed points in time, so there’s absolutely no reason why you can’t plan all routine works to be completed either a week in advance, or a week after, the holiday period.
Certificates don’t need to be near the point of expiry to be renewed. Capacity doesn’t need to be at critical levels to be expanded. Ultimately, this is a risk management question (as a sibling has also commented). Is the organisation willing to take on increased risk in exchange for deferring operational expenses?
If the operational expense is inevitable (the certificate will need renewing), that seems like an easy answer when it comes to risk management over holidays.
If the operational expense is not inevitable (will we really need to expand capacity?), it then becomes a game of probabilities and financials - likelihood of expense being incurred, amount of expense incurred if done ahead of time, impact to business if something goes wrong during a holiday.
I think a good way of looking at it is risk. Is the change (whether it is code or configuration, etc.) worth the risk it brings on.
For example if it's a small feature then it probably makes sense to wait and keep things stable. But, if it's something that itself causes larger imminent danger like security patches / hard disk space constraints, then it's worth taking on the risk of change to mitigate the risk of not doing it.
At the end of the day no system is perfect and it ends up being judgement calls but I think viewing it as a risk tradeoff is helpful to understand.
I think you can't avoid the fact that these holiday weeks are different from regular weeks. If you "change freeze" then you also freeze out the little fixes and perf tuning that usually happens across these systems, because they're not "critical".
And then inevitably it turns out that there's a special marketing/product push, with special pricing logic that needs new code, and new UI widgets, causing a huge traffic/load surge, and it needs to go out NOW during the freeze, and this is revenue, so it is critical to the business leaders. Most of eng, and all of infra, didn't know about it, because the product team was cramming until the last minute, and it was kinda secret. So it turns out you can freeze the high-quality little fixes, but you can't really freeze the flaky brand-new features ...
It's just a struggle, and I still advise to forget the freeze, and try to be reasonable and not rush things (before, during, or after the freeze).
Any big tech company with large peak periods disagrees with you. It's absolutely worth freezing non-critical changes.
Urgent business change needs to go through? Sure, be prepared to defend to a vp/exec why it needs to go in now.
Urgent security fix? Yep same vp will approve it.
It's a no-brainer to stop your typical changes which aren't needed for a couple of weeks. By the way, it doesn't mean your whole pipeline needs to stop. You can still have stuff ready to go to prod or pre prod after the freeze
As a developer I don't see why I would rush out a change before the freeze when I could just wait until after. Maybe a stakeholder that really wants it would press for it to get out but personally I'd rather wait until after so I'm not fixing a bug during my holiday.
and stampeding changes in after the thaw, also leading to downtime. so it depends on the org, but doing a freeze is still reasonable policy. Downtime on December 15th is less expensive than on black Friday or cyber Monday for most retailers, so it's just a business decision at that point.
Blip? 365 has an ongoing incident since yesterday morning, european timezone. The reason I know is because I use their compliance tools to secure information in a rather large bankruptcy.
It's not very grey, prod becomes as if you told everyone but your ops team to go home and then sent your ops team on a cruise with pagers. If it's not important enough to merit interrupting their vacation you don't do it.
Certs shouldn't still be done by hand that this point; if another heartbleed comes out in the next 7 days then the risk can be examined, escalated, and the CISO can overrule the freeze. If it's a patch for remote root via Bluetooth drivers on a server that has no Bluetooth hardware, it's gonna wait.
you're right that there's a grey line, but crossing that line involves waking up several people and the on call person makes a judgement call. if it's not important enough to wake up several people over, then things stay frozen.
There's still a lot of situations where automatic certificate enrollment and renewal is not possible. TLS is not the only use of X.509 certificates, and even then, public facing HTTPS is not the only use of TLS.
Right, that's basically what I mean. There are a lot of automated changes happening in the background for services. I guess the whole thing I'm saying is that not every breakage is happening because of a code change.
Seems like rolling their own datastore turned out to be a bad bet.
Im not super familiar with their constraints but scylladb can do eventual consistency and is generally quite flexible.
CouchDB is also an option for multi-leader replication.
Yep...can confirm my self hosted Bitwarden there is completely FUBAR connection wise even if it is in EA, so it should be a worldwide outage...lemme guess, some internal tooling error, consensus split brain, or if it looks like someone leaked BGP routes again?
When I worked for a company who worked with big banks / financial institutions we used to run disaster recovery tests. Effectively a simulated outage where the company would try to run off their backup sites. They ran everything from those sites, it was impressive.
Once in a while we'd have a real outage that matched the test we ran as recently as the weekend before.
I was helping a bank switch over to the DR site(s) one day during such a real outage and I left my mic open when someone asked me what the commotion was on the upper floors of our HQ. I said "super happy fun surprise disaster recovery test for company X".
VP of BIG bank was on the line monitoring and laughed "I'm using that one on the executive call in 15, thanks!" Supposedly it got picked up at the bank internally after the VP made the joke and was an unofficial code for such an outage for a long time.
In most BIG banks, "Vice President" is almost an entry-level title. Easily have 1000s of them. For example, this article points out that Goldman Sachs had ~12K VPs out of more than 30K employees: https://web.archive.org/web/20150311012855/https://www.wsj.c...
Just like all Sales folks have heavily inflated titles, no customer wants to think they're dealing with a junior salesperson/loan officer when you're about to hand over your money.
It seems like every vendor sales team I work with is an "executive" or "director of sales" even though in reality they're just regular old salespeople.
VP at Goldman is equivalent to Senior SWE according to levels.fyi and their entry level is Analyst. I'm surprised by the compensation though. I would have thought people working at a place with gold in the name would be making more. Also apparently Morgan Stanley pays their VPs $67k/year.
In fairness to the fly.io folks (who are extremely serious hackers), they’re standing up a whole cloud provider and they’ve priced it attractively and they’re much customer-friendlier than most alternatives.
I don’t envy the difficulty of doing this, but I’m quite confident they’ll iron the bugs out.
I don’t always agree with @tptacek on social/political issues, and I don’t always agree with @xe on the direction of Nix, but these are legends on the technical side of things. And they’re trying to build an equitable relationship between the user of cloud services and the provider, not fund a private space program.
If I were in the market for cloud services I’d highly prize a long-term relationship on mutual benefit and fair dealings over a short-term nuisance of being an early adopter.
I strongly suspect your investment in fly is going to pay off.
I want to believe, but in the meantime they’re killing the product I’ve been working hard to build trust with my own customers though. There is a limit to my idealism, and it’s well and truly in the past.
I suspect that making a cloud service provider run reliably requires tons of grunt work more than it requires technical heroism from a small number of highly talented individuals.
It feels like fly is trying to repeat a growth model that worked 20 years ago: throw interesting toys at engineers, then wait for engineers to recommend their services as they move on in their careers.
Part of that playbook is the old Move Fast & Break Things. That can still be the right call for young projects, but it has two big problems:
1) AWS successfully moved themselves into the position of "safe" hosting choice, so it's much rarer for engineers to have influence on something that's seen by money men as a humdrum, solved problem;
2) engineers are not the internal influencers they used to be, being laid off left and right the last few years, and without time for hobby projects.
(maybe also 3) it's much harder to build a useful free tier on a hosting service, which used to be a necessary marketing expense to reach those engineers).
So idk, I feel like the bar is just higher for hosting stability than it used to be, and novelty is a much harder sell, even here. Or rather: if you're going to brag about reinventing so many wheels, they need to not to come off the cart as often.
It's interesting to see this discussion about fly.io's reliability on a day that (after over three days of downtime) Microsoft Azure finally decided the update of Azure Static Web Apps they deployed last Friday is indeed broken for customers using specific authentication settings...
...with not a single status update from Microsoft in sight.
I can’t even login to my old account. Password reset is timing out yet still receive password reset e-mail. Password reset link broken, with 500 status code.
No dog in this fight, all props to the Fly.io team for having the gumption to do what they are doing, I genuinely hope they are successful...
> It's still 99.99+% SLA
But this is simply not accurate. 99.99% uptime is < 52m 9.8s annually of downtime. They apparently blew well through that today. Looks like they essentially had the equivalent of 4 years of 99.99% uptime equivalent this evening.
Four nines is so unforgiving that it's almost the case that if people are required to be in the loop at any point during an incident, you will blow the fourth nine for the whole year in a single incident.
Again, I know it's hard. I would not want to be in the space. That fourth nine is really difficult to earn.
In the meanwhile, <hugops> to the Fly team as they work to resolve this (and hopefully get some rest).
99.99+% SLA typically means you get some billing credits for the downtime exceeding 99.99+ availability. So technically do get a "99.99+% SLA", but you don't get 99.99+% availability.
Other circles use "SLO" (where the O stands for objective).
Earlier in the year they had a catastrophic outage in LHR, we lost all our data. Yes this is also on me, I'm aware. Still, that's a hard nope from me, we migrated.
I think what a lot of people fail to understand is that there are certain categories of apps that simply “can never go down”
Examples include basically any PaaS, IaaS, or any company that provides a mission-critical service to another company (B2B SaaS).
If you run a basic B2C CRUD app, maybe it’s not a big deal if you service goes down for 5 minutes. Unfortunately there are quite a few categories of companies where downtime simply isn’t tolerated by customers. (I operate a company with a “zero downtime” expectation from customers - it’s no joke, and I would never use any infrastructure abstraction layer other than AWS, GCP or Azure - preferably AWS us-east-1 because, well, if you know the joke…)
> I think what a lot of people fail to understand is that there are certain categories of apps that simply “can never go down”
I refuse to believe that this category still exists, when I need to keep my county's alternate number for 911 in my address book, because CenturyLink had a 6 hour outage in 2014 and a two day outage in 2018. If the phone company can't manage to keep 911 running anymore, I'd be very surprised what does have zero downtime over a ten year period.
Personally, nine nines is too hard, so I shoot for eight eights.
My experience with very large scale B2B SaaS and PaaS has been that customers like to get money, if allowed by contract, by complaining about outages, but that overall, B2B SaaS is actually very forgiving.
Most B2B SaaS solutions have very long sales cycles and a high total cost to implement, so there is a lot of inertia to switching that “a few annoying hours of downtime a year” isn’t going to cover. Also, the metric that will drive churn isn’t actually zero downtime, it’s “nearest competitor’s downtime,” which is usually a very different number.
Every PaaS and IaaS I’ve ever used has had some amount of downtime, often considerably more than 5 minutes, and I’ve run production services on many of them. Plenty of random issues on major cloud providers as well. Certainly plenty of situations with dozens of Twitter posts happening but never any acknowledgement on the AWS status page. Nothing’s perfect.
Yea, when running services where 5 minutes of downtime results in lots of support tickets, you learn to accept that the incident will happen and learn to manage the incident rather than relying that it will never occur.
you realize all of those services you mention can't give you zero downtime, they would never even advertise that. They have quite good reliability certainly, but on long enough time horizons absolutely no-one has zero downtime.
All of your examples have had multiple cases of going down, some for multiple days (2011 AWS was the first really long one I think) - or potentially worse, just deleting all customer data permanently and irretrievably.
Meaning empirically, downtime seems to be tolerated by their customers up to some point?
If your app cannot go down ever, then you cannot use a cloud provider either (because even AWS and Azure do fail sometime, just look up for “Azur down” on HN).
But the truth is everybody can afford some level of outage, simply because nobody has the budget to provision an infra that can never fail.
I’ve seen a team try and be truly “multi-cloud” but then ended up with this Frankenstein architecture where instead of being able to weather one cloud going down, their app would die if _any_ cloud had an issue. It was also surprisingly hard to convince people it doesn’t matter how many globally distributed clusters you have if all your data is in us-east.
They are fundamentally different. If Cloudflare provided a way to host docker containers with volumes though, that would be game over for so many paas platforms.
Too much custom stuff too quickly, there is a lot of efficiency in vertical integration and a fully cohesive stack but it takes a very long time to stabilize if you take that route.
We spent months trying to convince them of problems with their H2 implementation in their LB/proxy (they insisted nginx was at fault, spoiler - it wasn't) but had to leave (we also went to CF, which has its own problems). Eventually one of their employees wrong a long blog post about H2 that made it obvious they finally found and fixed those problems but months too late for my employer at the time.
It would have been infinitely better for us if they could have just fixed their stability problems, that abstraction suited us as did their LB/proxy impl and SNI pricing.
I wish them well, some really smart folk over there but I can imagine these reliability problems are probably really grinding down morale.
I tried Fly early. I was very excited about this service, but I've never had a worse hosting experience. So I left. Coincidentally I tried it again a few days ago. Surely things must be better. Nope. Auth issues in the CLI, frustrations deploying a Docker app to a Fly machine. I wouldn't recommend it to anyone.
I find their user experience to be exceptional. The only flake I’ve encountered is in uptime and general reliability of services I don’t interface with directly. They’ve done a stellar job on the stuff you actually deal with, but the glue holding your services together seems pretty wobbly.
It's an internal project based on Rust, not a product. So I don't think it matters too much what they name it. It's opens source which is great, but still not a product that they need to market.
I take your point but corrosion-resistant metals such as Aluminum, Titanium, Weathering Steel and Stainless Steel don’t avoid corrosion entirely but form a thin and extremely stable corrosion layer (under the right conditions).
If you mean specifically flyio.net and not just fly.io the company, I'm guessing they host their status page on a separate domain in case of DNS/registrar issues with their primary domain.
IIRC their value prop is that they let you rapidly spin up deployments/machines in regions that are closest to your users, the idea being that it will be lower latency and thus better UX.
Color me not surprised. My few interactions with people there just gave off the impression of them being in a bit over their heads. I don't know how well that translated to their actual ops, but it's difficult to not connect the two when they continue to have major outage after major outage for a product that 'should' be their customer's bedrock upon which they build everything else.
Don’t a bunch of Elixir/Erlang guys work at fly.io? It’s weird to me that that hallmark of reliability is associated with something that the public sees as unreliable. What gives with that association?
My fly.io-hosted website went down for 5 minutes (6 hours ago), but then came right back up, and has been up ever since. I use a free monitoring service that checks it every 5 minutes, so it's possible it missed another short bit of downtime. But fly.io has been pretty reliable overall for me!
Would be fascinated to see your data over a period of months.
Application up time is flakey, but what was worse were fly deploys failing for no clear reason. Sometimes layers would just hang and eventually fail for no particular reason; I'd run the same command an hour or two later without any changes and it would just work as expected.
I'd love to make a monitoring service to deploy a basic app (i.e. run the fly deploy command) every 5 minutes and see how often those deploys fail or hang. I'd guess ~5% inexplicably fail, which is frustrating unless you've got a lot of spare time.
I used to run a service that created k8s clusters on GCP for our customers. We did want to check that that functionality kept working and had a prober test it periodically. It was actually broken a lot.
Always good to monitor your dependencies if you have the time. Then when someone complains about an issue in your service, you can check your monitoring to see if your upstream services are broken. If they are, at least you know where to start debugging.
My downtimes from fly are pretty rare but generally global when they happen, in this outage we had no downtime but couldn't deploy for a few hours. I have issues with deploying about once per quarter(deploy most days across a few apps)
If that’s the case I suspect fly is getting a lot more reliable. I stopped using them about a year ago so haven’t kept up on their reliability since. Glad to hear, it’s good for a competitive market to have many providers, and fly might have issues but hopefully has a bright future
They are definitely getting more reliable. I was an early user and moved off them to self hosted for quite a while because of the frequent downtime in early days.
Their support still leaves a lot to be desired even as someone that pays for it but the ease of running and deploying a distributed front end keeps bringing me back.
This may be of interest to you: https://news.ycombinator.com/item?id=42243282
I externally monitor fly.io and it's docs here: https://flyio.onlineornot.com/
Looks like it lasted 16 minutes for them.
What free monitoring tool do you use?
Same for us, down for ~5 mins, back up and fine, error was 501
Someone said 16 minutes: so it's not even 5 nines service.
Do you mind if I ask what monitoring service that is?
https://github.com/louislam/uptime-kuma
Sure, it's UptimeRobot: https://uptimerobot.com/
Use https://pulsetic.com/
Is it your service?
fly.io publishes their post-mortems here: https://fly.io/infra-log/
The last post-mortem they wrote is very interesting and full of details. Basically back in 2016 the heart or keystone component of fly.io production infrastructure was called consul, which is a highly secure TLS server that tracks shared state and it requires that both the server certificate and the client certificate be authenticated. Since it was centralized, it had scaling issues, so fly.io wrote a replacement for it in 2020 called corrosion, and quickly forgot about consul, but didn't have the heart to kill it. Then in October 2024 consul's root key signing key expires, which brought down all connectivity, and since it uses bidirectional authentication, they couldn't bring it back online until they deployed new SSL certificates to every machine in their fleet. Somehow they did this in half an hour, but the chain of dominoes had already been set in motion to reveal other weaknesses in their infrastructure that they could eliminate. There was this other internal service whose own independent set of TLS keys had also expired long ago, but they didn't notice until they tried rebooting it as part of the consul rekey, since doing so severed the TCP connections it had established way back when its certificate was valid. Plus the whole time this is happening, their logging tools are DDOSing their network provider. It took some real heroes to save the company and all their customers too when that many things explode at once.
On that Consul outage, Fly Infra concludes, "The moral of the story is, no more half-measures."
On their careers page [1], the Fly team goes, "We're not big believers in tech debt."
As an outsider, reads like a cacophony of contradictions?
[1] https://fly.io/docs/hiring/working/#we-re-ruthless-about-doi...
No one actually lives up to their principles, but it's still important that we have them.
If you actually do live up to yours, then you need to adopt better principles.
Any principle in itself isn't without critique, agree, but it's still the choice being made to pick this specific principle that tells the whole story. There are so many principles to pick from and the tech dept pick follows up with a "We have a 3-month “no refactoring” rule for new hires. This isn’t everyone’s preferred work style! We try to be up front about stuff.", which sounds a bit like an additional perform or else... principle that just delays ownership of the stuff you're supposed to work with. In the best case that sounds like naiive optimism and in the worst case that's gross negligence... neither one speaks "engineering" to me.
Two contradictory statements do not read like a 'cacophony' of anything to me xD I think you need a whole lot more than two to do that word justice.
“No more half-measures” and “We’re not big believers in tech debt” aren’t even contradictory statements, let alone a cacophony of them.
The comment section doing what it does best!
For brevity I chose to put up only the conclusion from a postmortem (of which I've read plenty by now) and another point from their otherwise comparatively shorter careers page, which imo capture the inherent tension between building out fast & building out right. This is not something I've started complaining about today or yesterday. I've used Fly in prod for 4 years and spilled much ink on this topic on their forums already. Even if I critique, I remain optimistic about Fly despite the seemingly endless list of failure modes building such complex systems entail: https://community.fly.io/t/fly-down/10224/15
(personally speaking, I'm humble enough because I can hardly build a toy side-project right!)
"full measures" aren't the same thing as tech debt. Complexity isn't even the same thing as tech debt.
Fly.io seems to be a bit of a mixed bag:
https://news.ycombinator.com/item?id=41917436
https://news.ycombinator.com/item?id=35044516
https://news.ycombinator.com/item?id=34742946
https://news.ycombinator.com/item?id=34229751
If a cloud platform doesn't really provide reliability, I'd say it's probably not worth it. You could better just rent a (virtual) server and save the cloud tax.
For experiments and hobby projects the value proposition is amazing. Where else can you spin up an independent instance for $1.94 per month?*
*Note this is for an instance with only 256MB RAM (https://fly.io/docs/about/pricing/), but it's definitely possible to run non-trivial projects on that. Rust-based web servers like Rocket require only about 10MB RAM. Basic PHP servers should also fit from what I can find.
There are plenty of better deals as long as you don’t limit yourself to big clouds and clouds with startup-esque landing pages frequently posted to HN. LowEndTalk may be the most well-known place for finding such deals.
(Not saying the typical cheap VPS on LowEndTalk has comparable PaaS features. Only responding to parent’s use case of a single cheap instance.)
Best business model in the world, buy stuff in big bags, put it in smaller ones, sell at a multiple of the original price.
Fly is mostly (to my knowledge) reselling Netactuate and OVH servers, their main innovation is the developer experience on top, using Docker on a MicroVM based approach. Of course not only that, but I think it’s their main differentiator.
Haven’t used that in a while but Scaleway offered ridiculously cheap dedicated ARM hardware close to these price points, not sure if they still do.
Nowhere? Because that's a ridiculously low amount of RAM to offer even in your cheapest offerings?
You can easily get 4 GB of RAM for $5 from the likes of Hetzner or Hostinger, so that's 16x more RAM for 2.5x the price. One relatively unknown provider I have used in the past offers 2 GB of RAM for €3.6/month (if paid monthly, €3 if anually), so 8x more RAM for 1.5-2x the price. I'm sure I could find something even cheaper, but I'm just looking at providers I have personally used.
BTW that dropdown seems to be sorted cheapest > most expensive. If you go to the bottom of the list the price for that same VPS doubles.
> Nowhere? Because that's a ridiculously low amount of RAM to offer even in your cheapest offerings?
There's definitely places that offer it... also 512m
I know because I've personally bought such plans and that was $5-10/yr because I didn't need dedicated ipv4.
Maybe if you're limiting yourself to AWS-wrapper cloud companies. What good is a $2/mo cloud instance if it's down multiple times a month?
Just get a $5/mo VPS instead if you're really concerned about a few dollars a month.
i recommend lowendtalk what fly.io doing is running colocated baremetal servers and using firecracker to overcommit (probably via memory ballooning and other disk compression on demand)
if you are going to haggle over $2/month then you are better off just connecting your raspberry pi with wireguard/cloudflare tunnel on a residential connection
I used to use Racknerd for that sort of thing, and the costs were around there -- maybe $1.90/mo for a 512MB instance. It was easy to squeeze several hobby projects onto the machine.
One such microVM per month used to be within the free monthly allowance, is that not the case anymore?
I'm getting 1$ for a 2gb ram vps in ovh for the first year
Sounds like a Lambda function....
Oracle free is one 4 core 24gb ram vps + 2 dualcore amd vps.
And actually, it's the resources that are free (CPU, memory, network) and you're allowed to split them up into multiple VMs if you want to.
One of my VMs had an uptime of more than 1050 days before the infrastructure rebooted it, so in terms of availability they've certainly surprised me.
The only downside I've come across with Oracle Free is that the 'best' regions are typically full. I ended up provisioning my free VMs in another region/country and it works fine.
I suppose another downside (if you want to view it this way) is they will delete idle unused free VMs after a certain time period. You have to add a credit card to your account to "upgrade" your account and run free resource indefinitely. While you're not charged for anything, it makes me nervous forking over a CC number to Oracle.
The reliability is very very bad. It was really insane that 2 times in the past few months the main dashboard was down as I’m demoing something. Not to mention the deploy outages and almost daily some random thing was unavailable or delayed.
I had to leave a few months ago after the price raises and how many times my boss saw some issue in the project I had with them.
They also deprecated and removed their sqlite backup service. Back to GCP and not worrying about so many outages now.
theres just so many anecdotes/nightmare stories from people using fly.io here much more than the ones linked by GP
expect to see more of these "post-mortem apologies" from fly.io in the future because it won't be the last
Now just to worry about GCP getting shut down with a few days' notice. /s
But in all seriousness the gall to raise prices before actually fixing the reliability problems is pretty shocking. I understand it's a bit of a chicken-and-egg thing where you maybe are tight on resources but there's no scenario where it's acceptable to have a product with these kinds of problems and then raise prices on existing customers who are putting up with it.
No /s is needed. Relying on any Google product long term is crazy.
Google's b2b products are relatively stable (relative to their b2c free services). You generally get somewhere like a year of notice if they shut it down.
I don't really understand the value prop of fly.io. They seem to have an impressive engineering team despite the outages, but is edge compute really something that 99.9% of devs need? There are tons of large companies that operate out of a single AWS region and those services are used by millions around the globe. It just strikes me as something that enables premature optimization right out of the box.
It's basically the new Heroku with less lock-in, because it works with Docker.
You get edge computing, autoscaling, and load balancing without additional configuration.
Not as flexible as AWS, but also much easier to setup and maintain.
But the reliability issues suck now and then.
DigitalOcean has been doing this for years, and their value proposition is unmatched IMO
For $5 you get:
Latest gen CPUs and RAM
HTTPS
DDoS protection
Cloudflare CDN
Autoscale
Competent support
I'd say the best part is the predictable monthly prices
And while most people probably don't care, they are an established public company, so there is more chance they will exist in 10 years
are global r/w token permissions still a thing, or did the token scopes thing finally come out of beta?
also, my experience with support was not the same as yours. they were utterly useless for the most part.
for a personal web dev (or similar) project, like, i agree, they’ve got good value.
but having worked in a small biz where DO was what they built everything on — no. bad idea. spend more. use aws (graviton ec2 instances)/azure.
the $5 droplet is underpowered and can't run anything substantial. it's just the price to get you in the door.
you wouldn't be able to run anything substantial with that kind of budget
but GO and pocketbase is on record for supporting 10k concurrent requests per second on low powered VPS
It doesn't really need to run anything "substantial" though. Running some janky wordpress site with some scabbed-on ecommerce customizations is like 50% of the internet.
a 1vCPU 512mb instance is plenty for most base cases. Maybe you need one additional machine to act as a background worker. I am sure there are some noisy neighbors but to say its underpowered is silly.
I'm calling it underpowered because the $5 one had trouble running my custom ssh daemon. ssh! the cryptography for that shouldn't chug down the server I'm renting from them. a bigger instance from them isn't having the same problems.
> Not as flexible as AWS
Today, Fly.io is more or less in the same market as Lightsail, not AWS. And when you compare it to Lightsail, it blows it away.
Did you count reliability into your assesment here? I'm reading about Fly.io outages multiple times a year, whereas Lightsail seem to be as stable as AWS EC2.
And when you compare it to Lightsail, it blows it away.
This is a bit of a confusing sentence because there are so many pronouns. Do all of the "it"s refer to Fly.io?
> And when you compare [fly.io] to Lightsail, [fly.io] blows [Lightsail] away.
This is precisely it. The ease of deploy, https domain configuration, scaling.
Additionally, having machines that turn off when not in use is easy to configure, which I never managed on AWS.
> which I never managed on AWS
I haven't looked at it recently, but App Runner could do a few of Fly.io esque things (but slightly more expensive): https://aws.amazon.com/apprunner/
I have asked this multiple times but is anyone really using edge compute and getting value out of it? I am certain there are cases but I have not seen any of them written up before.
We have an embeddable audio player served globally with very low latency. This wouldn't be possible without edge compute/data.
Depends on what you mean by edge compute, but you probably are.
5G towers are a ton of compute on the edge to secure and protect the traffic passing through them.
Or if by edge you mean having stuff close to your consumers, every non trivial operation does that.
I am going to go out on a limb and say there is no real value prop to fly.io. I could completely be wrong but it always feels like the modern MongoDB. Everyone wants to use it but I am not sure they are extracting value from it and instead its a shiny toy that is fun to build from.
I have an SSR Astro project. Using Fly makes my project fast.
For dynamic data I use SWR.
I could use Cloudflare workers but it doesn’t play so nice with Astro.
I also have a “form submission service” where I receive a Post and send an email.
I need maximum uptime to avoid revenue loss.
It’s a go service so I deploy ~6 machines across the US to ensure I don’t drop any requests.
I haven’t had downtime in years.
If half your customers are in new your and half in sidney it makes you app faster if you run it in both places.
There is a lot of things we do for our users that we don't need (no one "needs" SPA etc). But if it is easy to make your app faster for your users, why not?
And it is easier than AWS to deploy.
I would take edge compute if it's free and easy. That's fly.io's value prop.
In a world where much web browsing starts with ACK SYN ACK, it is nice if the server is close to you.
I typed fly launch, fly deploy and my node.js project was deployed. So I guess hobby projects?
I tried out Fly.io and deployed a little test app. I couldn't even access the app, because they put it onto a server that was under "emergency maintenance" and had been that way for twelve days.
fly.io has a very bad reputation for reliability there doesn't seem to be any damage control beyond hackernews and even here the consensus seems to be "dont run anything mission critical on fly.io or expect data redundancy"
in fact, you can almost get the same thing fly.io does by running firecracker on your own bare metal servers and cheaper too.
I'm afraid the public sentiment towards fly.io has been tainted for good (I can't count how many times they apologized now).
Also:
https://news.ycombinator.com/item?id=36808296
Contrary to the title of the post, Fly.io API remains inaccessible. Meaning, users still cannot access deploys/databases, etc.
For accurate updates, follow https://community.fly.io/t/fly-io-site-is-currently-inaccess...
Personal experience between Fly.io and Railway.com - Railway wins for me hands down. I have used both and the Railways support is stellar too, in comparison. Fly.io never responded to my query about data deletion till date. Despite emailing on their support email.
I have had my Railway app online till date without any major downtimes too. I recommend anyone looking for a decent replacement to try them.
Fly builds on their own hardware. Is Railway doing the same? If not, that'd explain some of why Railway has relatively less number of outages (they're engineering fewer things).
I understand that end-users want reliability (and Fly gets a bad rep despite pretty significant investment on this front in the past 2 years), but such outages aren't exclusive to one provider & not the other. Building cloud infra is no one's definition of easy.
I've used Railway control panel maybe a total of 10 times in my life and half the time it was having weird issues. Control panel UI not loading or not working, actions failing, deploys randomly failing... I love the idea but in practice it's not something I'd want to use for anything serious.
How does it compare in terms of price?
This is probably 5th or 6th major outage from Fly.io that I have personally seen. Pretty sure there were many others and some just went unnoticed. I recommended the service to a friend, and within two days he faced two outages.
Fly.io seriously needs to get it together. Why it hasn’t happened yet is a mystery to me. They have a good product but stability needs to be an absolute top for a hosting service. Everything else is secondary.
I get this but I think if people can give GitHub a pass for shitting the bed every two weeks maybe Fly should get a bit of goodwill here. I am not affiliated with Fly at all but I do think that people should temper their expectations when even mega corp can’t get it right
I guess the secret is to be the incumbent with no suitable replacement. Then you can be complete garbage in terms of reliability and everyone will just hand wave away your poor ops story
The biggest difference is GitHub in your infrastructure is (nearly always) internal. Fly in your infrastructure is external. Users generally don't see when you have issues with GitHub, but they do generally see when you have issues with Fly.
That's the core difference.
Who's giving GitHub a pass on shitting the bed? They go down often enough that if you don't have an internal git server setup for your CICD to hit, that's on you.
My point is made by your very post - getting off GitHub onto alternatives is not seriously discussed as an option - instead it’s “well, why didn’t you prepare better to deal with your vendor’s poor ops story”
I wasn't going to bring up being on an internally hosted gitlab instead of github, but that would be the "not giving them a pass" part.
We left it about a year ago due to reliability issues. We now use digitalocean apps and working like a charm. Zero downtime with DO.
You mean their App Platform right? How does the pricing compare to fly?
Yes, App Platform. Pricing is a little higher but way lower than AWS but it is fully justified. Zero downtime in the last 1 year.
With Fly, we had 3-4 downtimes in 2023 in a span of 4 months.
Reliability is hard when your volume is (presumably) scaling geometrically.
Can't use the "reliability is hard" excuse when you are quite literally in the business of selling reliability.
It’s just not that big of a mystery. It’s not an excuse; it’s just true. Also, they’re not especially selling reliability as much as they’re selling small geo-distributed deployments.
Does anyone use them beyond the free tier? Same with Vercel for example.
Vercel has revenue of over $100M. So yes at least a few companies use them beyond the free tier.
Which company? GitHub? As far as I know fly.io does not have a free tier.
Suspiciously, Turso started having issues around the same time. Their CEO confirmed on Discord it's due to the Fly outage:
> Ok.I caught up with our oncall and This seems related to the Fly.io incident that is reported in our status page. Our login does call things in the Fly.io API
> we are already in touch with Fly and will see if we can speed this up
Not the first time Turso goes down because of Fly issues. It must suck to have built a db service and have this downtime.
Apparently Turso are going to offer an AWS tier at some point.
Last month Turso released AWS-hosted databases to the public (still in Beta): https://turso.tech/blog/turso-aws-beta
Thanks!
No surprise. About a year ago, I looked at fly.io because of it's low pricing and I was wondering where they were cutting corners to still make some money. Ultimately, I found the answer in their tech docs where it was spelled out clearly that an fly instance is hardwired to one physical server and thus cannot fail over in case that server dies. Not sure if that part still is in the official documentation.
In practice, that means if a server goes down, they have to load the last snapshot from that instance from the Backup and push it on a new server, update the network path, and pray to god that not more server fail than spare capacity is available. Otherwise you have to wait for a restore until the datacenter mounted a few more boxes in the rack.
That explains quite a bit the randomness of those outage reports i.e. my app is down vs the other is fine and mine came back in 5 minutes vs the other took forever.
As a business on a budget, I think anything else i.e. a small civo cluster serves you better.
Fly.io can migrate vm+volume now: https://fly.io/docs/reference/machine-migration/ / https://archive.md/rAK0V
> a fly instance is hardwired to one physical server and thus cannot fail over
I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?
> I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?
You can run your workload (in this case a VM) on top of a scheduler, so if one node goes down the workload is just spun up on another available node.
You will have downtime, but it will be limited.
> so if one goes down ... just spun up on another
On Fly, one can absolutely set this up. Multiple ways: https://fly.io/docs/apps/app-availability / https://archive.md/SJ32K
> I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?
They mean the storage part. If your VM's storage(state) is on one server and that server dies, you have to restore from backup. If your VM's storage is on remote shared storage mounted to that server and the server dies, your VM can be restarted elsewhere that has access to that shared storage.
In AWS land it's the difference between instance store (local to a server) and EBS (remote, attached locally).
There's a tradeoff in that shared storage will be slightly slower due to having to traverse networking, and it's harder to manage properly; but the reliability gain is massive.
> Ultimately, I found the answer in their tech docs where it was spelled out clearly that an fly instance is hardwired to one physical server and thus cannot fail over in case that server dies.
Majority of EC2 instance types did not have live migration until very recently. Some probably still don't (they don't really spell out how and when it's supposed to work). It is also not free - there's a noticeable brown-out when your VM gets migrated on GCP for example.
Can you shed some more light on this "browning out" phenomenon?
Here's the GCP doc [1]. Other live migration products are similar.
Generally, you have worse performance while in the preparing to move state, an actual pause, then worse performance as the move finishes up. Depending on the networking setup, some inbound packets may be lost or delayed.
[1] https://cloud.google.com/compute/docs/instances/live-migrati...
If you want HA on Fly you need to deploy an app to multiple regions (multiple machines).
Fly might still go down completely if their proxy layer fails but it's much less common.
The proxy layer was the cause of yesterday's outage according to support.
Yes but the previous comment was about hardware failure.
The status tells a story about a high-availability/clustering system failure so I think in this case the problem is rather the complexity of the HA machinery hurting the system's availability vs something like a simple VPS.
Recurring pattern I notice is outages tend to occur the week of major holidays in US.
- MS 365/Teams/Exchange had a blip in the morning
- Fly.io with complete outage
- then a handful of sites and services impacted due to those outages
Usually advocate against “change freezes” but I think a change freeze around major holidays makes sense. Give all teams a recharge/pause/whatever.
Don’t put too much pressure on the B-squads that were unfortunate to draw the short stick.
Bad code rarely causes outages at this scale. The culprit is always configuration changes.
Sure you can try and reduce those as well during the holiday season, but what if a certificate has to be renewed? What if a critical security patch needs to be applied? What if a set of servers need to be reprovisioned? What if a hard disk is running out of space?
You cannot plan your way out of operational challenges, regardless of what time of year it is.
> Sure you can try and reduce those as well during the holiday season, but what if a certificate has to be renewed? What if a critical security patch needs to be applied? What if a set of servers need to be reprovisioned? What if a hard disk is running out of space?
Reading this, I see two routine operational issues, one security issue and one hardware issue.
You can’t plan you way around security issues or hardware failures, but operational issues you both can and should plan around. Holiday schedules like this are fixed points in time, so there’s absolutely no reason why you can’t plan all routine works to be completed either a week in advance, or a week after, the holiday period.
Certificates don’t need to be near the point of expiry to be renewed. Capacity doesn’t need to be at critical levels to be expanded. Ultimately, this is a risk management question (as a sibling has also commented). Is the organisation willing to take on increased risk in exchange for deferring operational expenses?
If the operational expense is inevitable (the certificate will need renewing), that seems like an easy answer when it comes to risk management over holidays.
If the operational expense is not inevitable (will we really need to expand capacity?), it then becomes a game of probabilities and financials - likelihood of expense being incurred, amount of expense incurred if done ahead of time, impact to business if something goes wrong during a holiday.
I think a good way of looking at it is risk. Is the change (whether it is code or configuration, etc.) worth the risk it brings on.
For example if it's a small feature then it probably makes sense to wait and keep things stable. But, if it's something that itself causes larger imminent danger like security patches / hard disk space constraints, then it's worth taking on the risk of change to mitigate the risk of not doing it.
At the end of the day no system is perfect and it ends up being judgement calls but I think viewing it as a risk tradeoff is helpful to understand.
This is a good observation. Do you have any resources I can read up on to make this safer?
I think you can't avoid the fact that these holiday weeks are different from regular weeks. If you "change freeze" then you also freeze out the little fixes and perf tuning that usually happens across these systems, because they're not "critical".
And then inevitably it turns out that there's a special marketing/product push, with special pricing logic that needs new code, and new UI widgets, causing a huge traffic/load surge, and it needs to go out NOW during the freeze, and this is revenue, so it is critical to the business leaders. Most of eng, and all of infra, didn't know about it, because the product team was cramming until the last minute, and it was kinda secret. So it turns out you can freeze the high-quality little fixes, but you can't really freeze the flaky brand-new features ...
It's just a struggle, and I still advise to forget the freeze, and try to be reasonable and not rush things (before, during, or after the freeze).
Any big tech company with large peak periods disagrees with you. It's absolutely worth freezing non-critical changes.
Urgent business change needs to go through? Sure, be prepared to defend to a vp/exec why it needs to go in now.
Urgent security fix? Yep same vp will approve it.
It's a no-brainer to stop your typical changes which aren't needed for a couple of weeks. By the way, it doesn't mean your whole pipeline needs to stop. You can still have stuff ready to go to prod or pre prod after the freeze
Some shops conduct game days as the freeze approaches.
https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-2... / https://archive.md/uaJlR
Then you just get devs rushing out changes before the freeze…
As a developer I don't see why I would rush out a change before the freeze when I could just wait until after. Maybe a stakeholder that really wants it would press for it to get out but personally I'd rather wait until after so I'm not fixing a bug during my holiday.
Congrats on not working for the product team I work for
and stampeding changes in after the thaw, also leading to downtime. so it depends on the org, but doing a freeze is still reasonable policy. Downtime on December 15th is less expensive than on black Friday or cyber Monday for most retailers, so it's just a business decision at that point.
Blip? 365 has an ongoing incident since yesterday morning, european timezone. The reason I know is because I use their compliance tools to secure information in a rather large bankruptcy.
What do "Freezes" mean? Like, do you stop renewing your certificates? Do you stop taking in security updates for your software?
Sure maybe "unnecessary" changes, but the line gets very gray very fast.
No unnecessary code deployments.
It's not very grey, prod becomes as if you told everyone but your ops team to go home and then sent your ops team on a cruise with pagers. If it's not important enough to merit interrupting their vacation you don't do it.
Certs shouldn't still be done by hand that this point; if another heartbleed comes out in the next 7 days then the risk can be examined, escalated, and the CISO can overrule the freeze. If it's a patch for remote root via Bluetooth drivers on a server that has no Bluetooth hardware, it's gonna wait.
you're right that there's a grey line, but crossing that line involves waking up several people and the on call person makes a judgement call. if it's not important enough to wake up several people over, then things stay frozen.
There's still a lot of situations where automatic certificate enrollment and renewal is not possible. TLS is not the only use of X.509 certificates, and even then, public facing HTTPS is not the only use of TLS.
It needs to get better but it's not there yet.
Right, that's basically what I mean. There are a lot of automated changes happening in the background for services. I guess the whole thing I'm saying is that not every breakage is happening because of a code change.
The series of outages early in 2023 also had some Corrosion-related pain: https://community.fly.io/t/reliability-its-not-great/11253
Seems like rolling their own datastore turned out to be a bad bet.
Im not super familiar with their constraints but scylladb can do eventual consistency and is generally quite flexible. CouchDB is also an option for multi-leader replication.
Oof, hugops to the team.
Yep...can confirm my self hosted Bitwarden there is completely FUBAR connection wise even if it is in EA, so it should be a worldwide outage...lemme guess, some internal tooling error, consensus split brain, or if it looks like someone leaked BGP routes again?
It was a consensus split-brain (“database replication failure”) it seems
Mine is in Asia and it's still accessible.
DNS. It's always DNS. /s
https://github.com/jart/cosmopolitan/blob/master/third_party...
Might be! Shameless plug of a DNS tool i wrote years ago for anyone this pushes to learn more about DNS
https://dug.unfrl.com/
fly.io just has the weirdest outages. It has issues so regularly we dont even need to run mock outages to make sure our system fail overs work.
When I worked for a company who worked with big banks / financial institutions we used to run disaster recovery tests. Effectively a simulated outage where the company would try to run off their backup sites. They ran everything from those sites, it was impressive.
Once in a while we'd have a real outage that matched the test we ran as recently as the weekend before.
I was helping a bank switch over to the DR site(s) one day during such a real outage and I left my mic open when someone asked me what the commotion was on the upper floors of our HQ. I said "super happy fun surprise disaster recovery test for company X".
VP of BIG bank was on the line monitoring and laughed "I'm using that one on the executive call in 15, thanks!" Supposedly it got picked up at the bank internally after the VP made the joke and was an unofficial code for such an outage for a long time.
In most BIG banks, "Vice President" is almost an entry-level title. Easily have 1000s of them. For example, this article points out that Goldman Sachs had ~12K VPs out of more than 30K employees: https://web.archive.org/web/20150311012855/https://www.wsj.c...
Just like all Sales folks have heavily inflated titles, no customer wants to think they're dealing with a junior salesperson/loan officer when you're about to hand over your money.
It seems like every vendor sales team I work with is an "executive" or "director of sales" even though in reality they're just regular old salespeople.
VP at Goldman is equivalent to Senior SWE according to levels.fyi and their entry level is Analyst. I'm surprised by the compensation though. I would have thought people working at a place with gold in the name would be making more. Also apparently Morgan Stanley pays their VPs $67k/year.
Tech outstripped big finance corps tech a while ago.
Traders make loads, not the SWEs
That VP comp number seems quite low fwiw
Yes how much longer till we see Morgan Stanley VPs picketing outside demanding a living wage and humming The Internationale.
Thankfully your comment was positive!
In fairness to the fly.io folks (who are extremely serious hackers), they’re standing up a whole cloud provider and they’ve priced it attractively and they’re much customer-friendlier than most alternatives.
I don’t envy the difficulty of doing this, but I’m quite confident they’ll iron the bugs out.
The tech is impressive and the pricing is attractive which is why we use them. I just wish there was less black magic.
I don’t always agree with @tptacek on social/political issues, and I don’t always agree with @xe on the direction of Nix, but these are legends on the technical side of things. And they’re trying to build an equitable relationship between the user of cloud services and the provider, not fund a private space program.
If I were in the market for cloud services I’d highly prize a long-term relationship on mutual benefit and fair dealings over a short-term nuisance of being an early adopter.
I strongly suspect your investment in fly is going to pay off.
Xe here. As a sibling comment said, I didn't survive layoffs. If you're looking for someone like me, I'm on the market!
Hiring people is above my pay grade, but I can vouch to my lords and masters and anyone else who cares what I think that a legend is up for grabs.
b7r6@b7r6.net
I'd email but I'm about to pass out in bed. Please see https://xeiaso.net/contact/ in case I don't get back to you in the morning.
I want to believe, but in the meantime they’re killing the product I’ve been working hard to build trust with my own customers though. There is a limit to my idealism, and it’s well and truly in the past.
FWIW Xe was let go from Fly earlier this year during a round of layoffs.
Unfortunate. Xe rocks.
I suspect that making a cloud service provider run reliably requires tons of grunt work more than it requires technical heroism from a small number of highly talented individuals.
It is not reflected in their status page, but fly.io itself is not even loading.
https://fly.io/ loading for me
Confirmation ;)
It feels like fly is trying to repeat a growth model that worked 20 years ago: throw interesting toys at engineers, then wait for engineers to recommend their services as they move on in their careers.
Part of that playbook is the old Move Fast & Break Things. That can still be the right call for young projects, but it has two big problems:
1) AWS successfully moved themselves into the position of "safe" hosting choice, so it's much rarer for engineers to have influence on something that's seen by money men as a humdrum, solved problem;
2) engineers are not the internal influencers they used to be, being laid off left and right the last few years, and without time for hobby projects.
(maybe also 3) it's much harder to build a useful free tier on a hosting service, which used to be a necessary marketing expense to reach those engineers).
So idk, I feel like the bar is just higher for hosting stability than it used to be, and novelty is a much harder sell, even here. Or rather: if you're going to brag about reinventing so many wheels, they need to not to come off the cart as often.
It's interesting to see this discussion about fly.io's reliability on a day that (after over three days of downtime) Microsoft Azure finally decided the update of Azure Static Web Apps they deployed last Friday is indeed broken for customers using specific authentication settings...
...with not a single status update from Microsoft in sight.
I can’t even login to my old account. Password reset is timing out yet still receive password reset e-mail. Password reset link broken, with 500 status code.
I'm grateful to HN for keeping me well aware of Fly's issues. I'll never use them.
It's still 99.99+% SLA? Would you really pay 100% more for <0.01% more uptime?
No dog in this fight, all props to the Fly.io team for having the gumption to do what they are doing, I genuinely hope they are successful...
> It's still 99.99+% SLA
But this is simply not accurate. 99.99% uptime is < 52m 9.8s annually of downtime. They apparently blew well through that today. Looks like they essentially had the equivalent of 4 years of 99.99% uptime equivalent this evening.
Four nines is so unforgiving that it's almost the case that if people are required to be in the loop at any point during an incident, you will blow the fourth nine for the whole year in a single incident.
Again, I know it's hard. I would not want to be in the space. That fourth nine is really difficult to earn.
In the meanwhile, <hugops> to the Fly team as they work to resolve this (and hopefully get some rest).
99.99+% SLA typically means you get some billing credits for the downtime exceeding 99.99+ availability. So technically do get a "99.99+% SLA", but you don't get 99.99+% availability.
Other circles use "SLO" (where the O stands for objective).
(Anyone know what the details in fly.io SLA are?)
You are correct in the legal/technical sense!
Technically, anyone could offer five- or six-nines and just depend on most customers not to claim the credits :-D
Actually hitting/exceeding four nines is still tough.
This is not my experience at all, as a former paying customer.
You say that like it's their only issue.
Earlier in the year they had a catastrophic outage in LHR, we lost all our data. Yes this is also on me, I'm aware. Still, that's a hard nope from me, we migrated.
I think what a lot of people fail to understand is that there are certain categories of apps that simply “can never go down”
Examples include basically any PaaS, IaaS, or any company that provides a mission-critical service to another company (B2B SaaS).
If you run a basic B2C CRUD app, maybe it’s not a big deal if you service goes down for 5 minutes. Unfortunately there are quite a few categories of companies where downtime simply isn’t tolerated by customers. (I operate a company with a “zero downtime” expectation from customers - it’s no joke, and I would never use any infrastructure abstraction layer other than AWS, GCP or Azure - preferably AWS us-east-1 because, well, if you know the joke…)
> I think what a lot of people fail to understand is that there are certain categories of apps that simply “can never go down”
I refuse to believe that this category still exists, when I need to keep my county's alternate number for 911 in my address book, because CenturyLink had a 6 hour outage in 2014 and a two day outage in 2018. If the phone company can't manage to keep 911 running anymore, I'd be very surprised what does have zero downtime over a ten year period.
Personally, nine nines is too hard, so I shoot for eight eights.
My experience with very large scale B2B SaaS and PaaS has been that customers like to get money, if allowed by contract, by complaining about outages, but that overall, B2B SaaS is actually very forgiving.
Most B2B SaaS solutions have very long sales cycles and a high total cost to implement, so there is a lot of inertia to switching that “a few annoying hours of downtime a year” isn’t going to cover. Also, the metric that will drive churn isn’t actually zero downtime, it’s “nearest competitor’s downtime,” which is usually a very different number.
Every PaaS and IaaS I’ve ever used has had some amount of downtime, often considerably more than 5 minutes, and I’ve run production services on many of them. Plenty of random issues on major cloud providers as well. Certainly plenty of situations with dozens of Twitter posts happening but never any acknowledgement on the AWS status page. Nothing’s perfect.
Yea, when running services where 5 minutes of downtime results in lots of support tickets, you learn to accept that the incident will happen and learn to manage the incident rather than relying that it will never occur.
you realize all of those services you mention can't give you zero downtime, they would never even advertise that. They have quite good reliability certainly, but on long enough time horizons absolutely no-one has zero downtime.
All of your examples have had multiple cases of going down, some for multiple days (2011 AWS was the first really long one I think) - or potentially worse, just deleting all customer data permanently and irretrievably.
Meaning empirically, downtime seems to be tolerated by their customers up to some point?
If your app cannot go down ever, then you cannot use a cloud provider either (because even AWS and Azure do fail sometime, just look up for “Azur down” on HN).
But the truth is everybody can afford some level of outage, simply because nobody has the budget to provision an infra that can never fail.
I’ve seen a team try and be truly “multi-cloud” but then ended up with this Frankenstein architecture where instead of being able to weather one cloud going down, their app would die if _any_ cloud had an issue. It was also surprisingly hard to convince people it doesn’t matter how many globally distributed clusters you have if all your data is in us-east.
We switched from Fly to CF workers a while ago, and never looked back
They are fundamentally different. If Cloudflare provided a way to host docker containers with volumes though, that would be game over for so many paas platforms.
Can't wait: https://blog.cloudflare.com/container-platform-preview/
wow, this will be huge
Only if they can sort out their atrocity of a documentation website.
I switched from apples to oranges and never looked back.
Our stuff on CF Workers has been working non stop for years now.
About 6 months ago we migrated our most critical stuff from Fly to CF and boy every time Fly has issues I'm so glad we did.
Too much custom stuff too quickly, there is a lot of efficiency in vertical integration and a fully cohesive stack but it takes a very long time to stabilize if you take that route.
We spent months trying to convince them of problems with their H2 implementation in their LB/proxy (they insisted nginx was at fault, spoiler - it wasn't) but had to leave (we also went to CF, which has its own problems). Eventually one of their employees wrong a long blog post about H2 that made it obvious they finally found and fixed those problems but months too late for my employer at the time.
It would have been infinitely better for us if they could have just fixed their stability problems, that abstraction suited us as did their LB/proxy impl and SNI pricing.
I wish them well, some really smart folk over there but I can imagine these reliability problems are probably really grinding down morale.
How are they equivalent?
congrats on not developing a playbook for the time you have to 'look back'.
Providers will fail. good contingencies won't.
...hears faint sound...I SAID GOOD, QUIET YOU!
HUGOPS
Everything is going to be 200 OK!
I tried Fly early. I was very excited about this service, but I've never had a worse hosting experience. So I left. Coincidentally I tried it again a few days ago. Surely things must be better. Nope. Auth issues in the CLI, frustrations deploying a Docker app to a Fly machine. I wouldn't recommend it to anyone.
I find their user experience to be exceptional. The only flake I’ve encountered is in uptime and general reliability of services I don’t interface with directly. They’ve done a stellar job on the stuff you actually deal with, but the glue holding your services together seems pretty wobbly.
My apps on Fly have not gone down this time.
Kinda funny that they've named their global state store "Corrosion"... not really a word I'd associate with stability and persistence.
It's an internal project based on Rust, not a product. So I don't think it matters too much what they name it. It's opens source which is great, but still not a product that they need to market.
And to be fair, it’s a bit of a cute meme to name rust projects things that relate to it. Oxide, etc
I stored important data in mnesia, so who would I be to talk. :p
amnesia means forget, so mnesia means remember, I would guess?
https://community.fly.io/t/reliability-its-not-great/11253
https://github.com/superfly/corrosion
I take your point but corrosion-resistant metals such as Aluminum, Titanium, Weathering Steel and Stainless Steel don’t avoid corrosion entirely but form a thin and extremely stable corrosion layer (under the right conditions).
Gold and platinum really are corrosion resistant though (but have questionable mechanical properties…)
What exactly does flyio.net do?
If you mean specifically flyio.net and not just fly.io the company, I'm guessing they host their status page on a separate domain in case of DNS/registrar issues with their primary domain.
IIRC their value prop is that they let you rapidly spin up deployments/machines in regions that are closest to your users, the idea being that it will be lower latency and thus better UX.
It’s basically what Heroku used to be but with CDN-like presence.
Hosting service that has a lot of interesting distributed features.
WEB 2.0. SEE. TOLD YA! THEY SHOULDA UPGRADED TO THAT NEWFANGLED 3.0! ;)
Color me not surprised. My few interactions with people there just gave off the impression of them being in a bit over their heads. I don't know how well that translated to their actual ops, but it's difficult to not connect the two when they continue to have major outage after major outage for a product that 'should' be their customer's bedrock upon which they build everything else.
Don’t a bunch of Elixir/Erlang guys work at fly.io? It’s weird to me that that hallmark of reliability is associated with something that the public sees as unreliable. What gives with that association?
I was considering these guys the other day until I saw their pricing page: https://fly.io/pricing/
(There's not a single price on there, why even create the page?)
There's a link to what appears to be the actual pricing page https://fly.io/docs/about/pricing/
There's also a link to the pricing calculator https://fly.io/calculator
Is that calculator hourly or monthly?
Literally says "Monthly Costs" in the green panel on the right that calculates the total.
It's right there: "Monthly Cost"
OMG, that's hilarious. I use them, and I know what my prices are, but I'd never noticed that the page called pricing doesn't actually have any.
The prices are just one click deeper. Hardly a nefarious dark pattern.