Fly.io outage – resolved

(status.flyio.net)

230 points | by punkpeye 19 hours ago

213 comments

  • benhoyt 18 hours ago

    My fly.io-hosted website went down for 5 minutes (6 hours ago), but then came right back up, and has been up ever since. I use a free monitoring service that checks it every 5 minutes, so it's possible it missed another short bit of downtime. But fly.io has been pretty reliable overall for me!

    • nomilk 17 hours ago

      Would be fascinated to see your data over a period of months.

      Application up time is flakey, but what was worse were fly deploys failing for no clear reason. Sometimes layers would just hang and eventually fail for no particular reason; I'd run the same command an hour or two later without any changes and it would just work as expected.

      I'd love to make a monitoring service to deploy a basic app (i.e. run the fly deploy command) every 5 minutes and see how often those deploys fail or hang. I'd guess ~5% inexplicably fail, which is frustrating unless you've got a lot of spare time.

      • jrockway 8 hours ago

        I used to run a service that created k8s clusters on GCP for our customers. We did want to check that that functionality kept working and had a prober test it periodically. It was actually broken a lot.

        Always good to monitor your dependencies if you have the time. Then when someone complains about an issue in your service, you can check your monitoring to see if your upstream services are broken. If they are, at least you know where to start debugging.

      • sanswork 15 hours ago

        My downtimes from fly are pretty rare but generally global when they happen, in this outage we had no downtime but couldn't deploy for a few hours. I have issues with deploying about once per quarter(deploy most days across a few apps)

        • nomilk 15 hours ago

          If that’s the case I suspect fly is getting a lot more reliable. I stopped using them about a year ago so haven’t kept up on their reliability since. Glad to hear, it’s good for a competitive market to have many providers, and fly might have issues but hopefully has a bright future

          • sanswork 14 hours ago

            They are definitely getting more reliable. I was an early user and moved off them to self hosted for quite a while because of the frequent downtime in early days.

            Their support still leaves a lot to be desired even as someone that pays for it but the ease of running and deploying a distributed front end keeps bringing me back.

      • rozenmd 13 hours ago

        This may be of interest to you: https://news.ycombinator.com/item?id=42243282

    • rozenmd 13 hours ago

      I externally monitor fly.io and it's docs here: https://flyio.onlineornot.com/

      Looks like it lasted 16 minutes for them.

    • dprotaso 3 hours ago

      What free monitoring tool do you use?

    • davidgl 11 hours ago

      Same for us, down for ~5 mins, back up and fine, error was 501

      • TacticalCoder 5 hours ago

        Someone said 16 minutes: so it's not even 5 nines service.

    • beezlewax 14 hours ago

      Do you mind if I ask what monitoring service that is?

  • jart 16 hours ago

    fly.io publishes their post-mortems here: https://fly.io/infra-log/

    The last post-mortem they wrote is very interesting and full of details. Basically back in 2016 the heart or keystone component of fly.io production infrastructure was called consul, which is a highly secure TLS server that tracks shared state and it requires that both the server certificate and the client certificate be authenticated. Since it was centralized, it had scaling issues, so fly.io wrote a replacement for it in 2020 called corrosion, and quickly forgot about consul, but didn't have the heart to kill it. Then in October 2024 consul's root key signing key expires, which brought down all connectivity, and since it uses bidirectional authentication, they couldn't bring it back online until they deployed new SSL certificates to every machine in their fleet. Somehow they did this in half an hour, but the chain of dominoes had already been set in motion to reveal other weaknesses in their infrastructure that they could eliminate. There was this other internal service whose own independent set of TLS keys had also expired long ago, but they didn't notice until they tried rebooting it as part of the consul rekey, since doing so severed the TCP connections it had established way back when its certificate was valid. Plus the whole time this is happening, their logging tools are DDOSing their network provider. It took some real heroes to save the company and all their customers too when that many things explode at once.

    • ignoramous 15 hours ago

      On that Consul outage, Fly Infra concludes, "The moral of the story is, no more half-measures."

      On their careers page [1], the Fly team goes, "We're not big believers in tech debt."

      As an outsider, reads like a cacophony of contradictions?

      [1] https://fly.io/docs/hiring/working/#we-re-ruthless-about-doi...

      • jart 15 hours ago

        No one actually lives up to their principles, but it's still important that we have them.

        If you actually do live up to yours, then you need to adopt better principles.

        • whilenot-dev 13 hours ago

          Any principle in itself isn't without critique, agree, but it's still the choice being made to pick this specific principle that tells the whole story. There are so many principles to pick from and the tech dept pick follows up with a "We have a 3-month “no refactoring” rule for new hires. This isn’t everyone’s preferred work style! We try to be up front about stuff.", which sounds a bit like an additional perform or else... principle that just delays ownership of the stuff you're supposed to work with. In the best case that sounds like naiive optimism and in the worst case that's gross negligence... neither one speaks "engineering" to me.

      • Aeolun 13 hours ago

        Two contradictory statements do not read like a 'cacophony' of anything to me xD I think you need a whole lot more than two to do that word justice.

        • JimDabell 7 hours ago

          “No more half-measures” and “We’re not big believers in tech debt” aren’t even contradictory statements, let alone a cacophony of them.

        • mattgreenrocks 6 hours ago

          The comment section doing what it does best!

          • ignoramous 5 hours ago

            For brevity I chose to put up only the conclusion from a postmortem (of which I've read plenty by now) and another point from their otherwise comparatively shorter careers page, which imo capture the inherent tension between building out fast & building out right. This is not something I've started complaining about today or yesterday. I've used Fly in prod for 4 years and spilled much ink on this topic on their forums already. Even if I critique, I remain optimistic about Fly despite the seemingly endless list of failure modes building such complex systems entail: https://community.fly.io/t/fly-down/10224/15

            (personally speaking, I'm humble enough because I can hardly build a toy side-project right!)

      • bdcravens 6 hours ago

        "full measures" aren't the same thing as tech debt. Complexity isn't even the same thing as tech debt.

  • cryptos 12 hours ago

    Fly.io seems to be a bit of a mixed bag:

    https://news.ycombinator.com/item?id=41917436

    https://news.ycombinator.com/item?id=35044516

    https://news.ycombinator.com/item?id=34742946

    https://news.ycombinator.com/item?id=34229751

    If a cloud platform doesn't really provide reliability, I'd say it's probably not worth it. You could better just rent a (virtual) server and save the cloud tax.

    • huijzer 11 hours ago

      For experiments and hobby projects the value proposition is amazing. Where else can you spin up an independent instance for $1.94 per month?*

      *Note this is for an instance with only 256MB RAM (https://fly.io/docs/about/pricing/), but it's definitely possible to run non-trivial projects on that. Rust-based web servers like Rocket require only about 10MB RAM. Basic PHP servers should also fit from what I can find.

      • oefrha 10 hours ago

        There are plenty of better deals as long as you don’t limit yourself to big clouds and clouds with startup-esque landing pages frequently posted to HN. LowEndTalk may be the most well-known place for finding such deals.

        (Not saying the typical cheap VPS on LowEndTalk has comparable PaaS features. Only responding to parent’s use case of a single cheap instance.)

      • throwaway63467 8 hours ago

        Best business model in the world, buy stuff in big bags, put it in smaller ones, sell at a multiple of the original price.

        Fly is mostly (to my knowledge) reselling Netactuate and OVH servers, their main innovation is the developer experience on top, using Docker on a MicroVM based approach. Of course not only that, but I think it’s their main differentiator.

        Haven’t used that in a while but Scaleway offered ridiculously cheap dedicated ARM hardware close to these price points, not sure if they still do.

      • input_sh 10 hours ago

        Nowhere? Because that's a ridiculously low amount of RAM to offer even in your cheapest offerings?

        You can easily get 4 GB of RAM for $5 from the likes of Hetzner or Hostinger, so that's 16x more RAM for 2.5x the price. One relatively unknown provider I have used in the past offers 2 GB of RAM for €3.6/month (if paid monthly, €3 if anually), so 8x more RAM for 1.5-2x the price. I'm sure I could find something even cheaper, but I'm just looking at providers I have personally used.

        BTW that dropdown seems to be sorted cheapest > most expensive. If you go to the bottom of the list the price for that same VPS doubles.

        • KomoD 9 hours ago

          > Nowhere? Because that's a ridiculously low amount of RAM to offer even in your cheapest offerings?

          There's definitely places that offer it... also 512m

          I know because I've personally bought such plans and that was $5-10/yr because I didn't need dedicated ipv4.

      • pc86 6 hours ago

        Maybe if you're limiting yourself to AWS-wrapper cloud companies. What good is a $2/mo cloud instance if it's down multiple times a month?

        Just get a $5/mo VPS instead if you're really concerned about a few dollars a month.

      • pajeetz 2 hours ago

        i recommend lowendtalk what fly.io doing is running colocated baremetal servers and using firecracker to overcommit (probably via memory ballooning and other disk compression on demand)

        if you are going to haggle over $2/month then you are better off just connecting your raspberry pi with wireguard/cloudflare tunnel on a residential connection

      • hansvm 6 hours ago

        I used to use Racknerd for that sort of thing, and the costs were around there -- maybe $1.90/mo for a 512MB instance. It was easy to squeeze several hobby projects onto the machine.

      • hobo_mark 10 hours ago

        One such microVM per month used to be within the free monthly allowance, is that not the case anymore?

      • kelvinjps10 9 hours ago

        I'm getting 1$ for a 2gb ram vps in ovh for the first year

      • belter 10 hours ago

        Sounds like a Lambda function....

      • TiredOfLife 9 hours ago

        Oracle free is one 4 core 24gb ram vps + 2 dualcore amd vps.

        • treesknees 6 hours ago

          And actually, it's the resources that are free (CPU, memory, network) and you're allowed to split them up into multiple VMs if you want to.

          One of my VMs had an uptime of more than 1050 days before the infrastructure rebooted it, so in terms of availability they've certainly surprised me.

          The only downside I've come across with Oracle Free is that the 'best' regions are typically full. I ended up provisioning my free VMs in another region/country and it works fine.

          I suppose another downside (if you want to view it this way) is they will delete idle unused free VMs after a certain time period. You have to add a credit card to your account to "upgrade" your account and run free resource indefinitely. While you're not charged for anything, it makes me nervous forking over a CC number to Oracle.

    • zackify 8 hours ago

      The reliability is very very bad. It was really insane that 2 times in the past few months the main dashboard was down as I’m demoing something. Not to mention the deploy outages and almost daily some random thing was unavailable or delayed.

      I had to leave a few months ago after the price raises and how many times my boss saw some issue in the project I had with them.

      They also deprecated and removed their sqlite backup service. Back to GCP and not worrying about so many outages now.

      • pajeetz 2 hours ago

        theres just so many anecdotes/nightmare stories from people using fly.io here much more than the ones linked by GP

        expect to see more of these "post-mortem apologies" from fly.io in the future because it won't be the last

      • pc86 6 hours ago

        Now just to worry about GCP getting shut down with a few days' notice. /s

        But in all seriousness the gall to raise prices before actually fixing the reliability problems is pretty shocking. I understand it's a bit of a chicken-and-egg thing where you maybe are tight on resources but there's no scenario where it's acceptable to have a product with these kinds of problems and then raise prices on existing customers who are putting up with it.

        • encom 6 hours ago

          No /s is needed. Relying on any Google product long term is crazy.

          • sofixa 4 hours ago

            Google's b2b products are relatively stable (relative to their b2c free services). You generally get somewhere like a year of notice if they shut it down.

    • qeternity 11 hours ago

      I don't really understand the value prop of fly.io. They seem to have an impressive engineering team despite the outages, but is edge compute really something that 99.9% of devs need? There are tons of large companies that operate out of a single AWS region and those services are used by millions around the globe. It just strikes me as something that enables premature optimization right out of the box.

      • k__ 11 hours ago

        It's basically the new Heroku with less lock-in, because it works with Docker.

        You get edge computing, autoscaling, and load balancing without additional configuration.

        Not as flexible as AWS, but also much easier to setup and maintain.

        But the reliability issues suck now and then.

        • gurgunday 8 hours ago

          DigitalOcean has been doing this for years, and their value proposition is unmatched IMO

          For $5 you get:

          Latest gen CPUs and RAM

          HTTPS

          DDoS protection

          Cloudflare CDN

          Autoscale

          Competent support

          I'd say the best part is the predictable monthly prices

          And while most people probably don't care, they are an established public company, so there is more chance they will exist in 10 years

          • dijksterhuis 8 hours ago

            are global r/w token permissions still a thing, or did the token scopes thing finally come out of beta?

            also, my experience with support was not the same as yours. they were utterly useless for the most part.

            for a personal web dev (or similar) project, like, i agree, they’ve got good value.

            but having worked in a small biz where DO was what they built everything on — no. bad idea. spend more. use aws (graviton ec2 instances)/azure.

          • fragmede 7 hours ago

            the $5 droplet is underpowered and can't run anything substantial. it's just the price to get you in the door.

            • pajeetz 2 hours ago

              you wouldn't be able to run anything substantial with that kind of budget

              but GO and pocketbase is on record for supporting 10k concurrent requests per second on low powered VPS

            • yabones 7 hours ago

              It doesn't really need to run anything "substantial" though. Running some janky wordpress site with some scabbed-on ecommerce customizations is like 50% of the internet.

            • infecto 7 hours ago

              a 1vCPU 512mb instance is plenty for most base cases. Maybe you need one additional machine to act as a background worker. I am sure there are some noisy neighbors but to say its underpowered is silly.

              • fragmede 6 hours ago

                I'm calling it underpowered because the $5 one had trouble running my custom ssh daemon. ssh! the cryptography for that shouldn't chug down the server I'm renting from them. a bigger instance from them isn't having the same problems.

        • ignoramous 8 hours ago

          > Not as flexible as AWS

          Today, Fly.io is more or less in the same market as Lightsail, not AWS. And when you compare it to Lightsail, it blows it away.

          • watermelon0 7 hours ago

            Did you count reliability into your assesment here? I'm reading about Fly.io outages multiple times a year, whereas Lightsail seem to be as stable as AWS EC2.

          • mtlynch 8 hours ago

            And when you compare it to Lightsail, it blows it away.

            This is a bit of a confusing sentence because there are so many pronouns. Do all of the "it"s refer to Fly.io?

            • dijksterhuis 8 hours ago

              > And when you compare [fly.io] to Lightsail, [fly.io] blows [Lightsail] away.

        • nikodotio 9 hours ago

          This is precisely it. The ease of deploy, https domain configuration, scaling.

          Additionally, having machines that turn off when not in use is easy to configure, which I never managed on AWS.

        • infecto 7 hours ago

          I have asked this multiple times but is anyone really using edge compute and getting value out of it? I am certain there are cases but I have not seen any of them written up before.

          • pier25 6 hours ago

            We have an embeddable audio player served globally with very low latency. This wouldn't be possible without edge compute/data.

          • sofixa 2 hours ago

            Depends on what you mean by edge compute, but you probably are.

            5G towers are a ton of compute on the edge to secure and protect the traffic passing through them.

            Or if by edge you mean having stuff close to your consumers, every non trivial operation does that.

      • infecto 7 hours ago

        I am going to go out on a limb and say there is no real value prop to fly.io. I could completely be wrong but it always feels like the modern MongoDB. Everyone wants to use it but I am not sure they are extracting value from it and instead its a shiny toy that is fun to build from.

      • austinpena 7 hours ago

        I have an SSR Astro project. Using Fly makes my project fast.

        For dynamic data I use SWR.

        I could use Cloudflare workers but it doesn’t play so nice with Astro.

        I also have a “form submission service” where I receive a Post and send an email.

        I need maximum uptime to avoid revenue loss.

        It’s a go service so I deploy ~6 machines across the US to ensure I don’t drop any requests.

        I haven’t had downtime in years.

      • victorbjorklund 10 hours ago

        If half your customers are in new your and half in sidney it makes you app faster if you run it in both places.

        There is a lot of things we do for our users that we don't need (no one "needs" SPA etc). But if it is easy to make your app faster for your users, why not?

      • jrockway 8 hours ago

        I would take edge compute if it's free and easy. That's fly.io's value prop.

        In a world where much web browsing starts with ACK SYN ACK, it is nice if the server is close to you.

      • brainzap 8 hours ago

        I typed fly launch, fly deploy and my node.js project was deployed. So I guess hobby projects?

    • ARCarr 2 hours ago

      I tried out Fly.io and deployed a little test app. I couldn't even access the app, because they put it onto a server that was under "emergency maintenance" and had been that way for twelve days.

    • pajeetz 2 hours ago

      fly.io has a very bad reputation for reliability there doesn't seem to be any damage control beyond hackernews and even here the consensus seems to be "dont run anything mission critical on fly.io or expect data redundancy"

      in fact, you can almost get the same thing fly.io does by running firecracker on your own bare metal servers and cheaper too.

      I'm afraid the public sentiment towards fly.io has been tainted for good (I can't count how many times they apologized now).

    • akoculu 10 hours ago
  • punkpeye 13 hours ago

    Contrary to the title of the post, Fly.io API remains inaccessible. Meaning, users still cannot access deploys/databases, etc.

    For accurate updates, follow https://community.fly.io/t/fly-io-site-is-currently-inaccess...

  • neya 14 hours ago

    Personal experience between Fly.io and Railway.com - Railway wins for me hands down. I have used both and the Railways support is stellar too, in comparison. Fly.io never responded to my query about data deletion till date. Despite emailing on their support email.

    I have had my Railway app online till date without any major downtimes too. I recommend anyone looking for a decent replacement to try them.

    • ignoramous 8 hours ago

      Fly builds on their own hardware. Is Railway doing the same? If not, that'd explain some of why Railway has relatively less number of outages (they're engineering fewer things).

      I understand that end-users want reliability (and Fly gets a bad rep despite pretty significant investment on this front in the past 2 years), but such outages aren't exclusive to one provider & not the other. Building cloud infra is no one's definition of easy.

    • andai 13 hours ago

      I've used Railway control panel maybe a total of 10 times in my life and half the time it was having weird issues. Control panel UI not loading or not working, actions failing, deploys randomly failing... I love the idea but in practice it's not something I'd want to use for anything serious.

    • punkpeye 14 hours ago

      How does it compare in terms of price?

  • shubhamjain 18 hours ago

    This is probably 5th or 6th major outage from Fly.io that I have personally seen. Pretty sure there were many others and some just went unnoticed. I recommended the service to a friend, and within two days he faced two outages.

    Fly.io seriously needs to get it together. Why it hasn’t happened yet is a mystery to me. They have a good product but stability needs to be an absolute top for a hosting service. Everything else is secondary.

    • SOLAR_FIELDS 17 hours ago

      I get this but I think if people can give GitHub a pass for shitting the bed every two weeks maybe Fly should get a bit of goodwill here. I am not affiliated with Fly at all but I do think that people should temper their expectations when even mega corp can’t get it right

      I guess the secret is to be the incumbent with no suitable replacement. Then you can be complete garbage in terms of reliability and everyone will just hand wave away your poor ops story

      • ojame 17 hours ago

        The biggest difference is GitHub in your infrastructure is (nearly always) internal. Fly in your infrastructure is external. Users generally don't see when you have issues with GitHub, but they do generally see when you have issues with Fly.

        That's the core difference.

      • fragmede 17 hours ago

        Who's giving GitHub a pass on shitting the bed? They go down often enough that if you don't have an internal git server setup for your CICD to hit, that's on you.

        • SOLAR_FIELDS 16 hours ago

          My point is made by your very post - getting off GitHub onto alternatives is not seriously discussed as an option - instead it’s “well, why didn’t you prepare better to deal with your vendor’s poor ops story”

          • fragmede 16 hours ago

            I wasn't going to bring up being on an internally hosted gitlab instead of github, but that would be the "not giving them a pass" part.

    • adityapatadia 18 hours ago

      We left it about a year ago due to reliability issues. We now use digitalocean apps and working like a charm. Zero downtime with DO.

      • subarctic 14 hours ago

        You mean their App Platform right? How does the pricing compare to fly?

        • adityapatadia 14 hours ago

          Yes, App Platform. Pricing is a little higher but way lower than AWS but it is fully justified. Zero downtime in the last 1 year.

          With Fly, we had 3-4 downtimes in 2023 in a span of 4 months.

    • mcqueenjordan 18 hours ago

      Reliability is hard when your volume is (presumably) scaling geometrically.

      • paxys 18 hours ago

        Can't use the "reliability is hard" excuse when you are quite literally in the business of selling reliability.

        • mcqueenjordan 17 hours ago

          It’s just not that big of a mystery. It’s not an excuse; it’s just true. Also, they’re not especially selling reliability as much as they’re selling small geo-distributed deployments.

    • ilrwbwrkhv 18 hours ago

      Does anyone use them beyond the free tier? Same with Vercel for example.

      • gk1 18 hours ago

        Vercel has revenue of over $100M. So yes at least a few companies use them beyond the free tier.

      • dizhn 15 hours ago

        Which company? GitHub? As far as I know fly.io does not have a free tier.

  • HellsMaddy 18 hours ago

    Suspiciously, Turso started having issues around the same time. Their CEO confirmed on Discord it's due to the Fly outage:

    > Ok.I caught up with our oncall and This seems related to the Fly.io incident that is reported in our status page. Our login does call things in the Fly.io API

    > we are already in touch with Fly and will see if we can speed this up

    • pier25 17 hours ago

      Not the first time Turso goes down because of Fly issues. It must suck to have built a db service and have this downtime.

      Apparently Turso are going to offer an AWS tier at some point.

  • marvin-hansen 17 hours ago

    No surprise. About a year ago, I looked at fly.io because of it's low pricing and I was wondering where they were cutting corners to still make some money. Ultimately, I found the answer in their tech docs where it was spelled out clearly that an fly instance is hardwired to one physical server and thus cannot fail over in case that server dies. Not sure if that part still is in the official documentation.

    In practice, that means if a server goes down, they have to load the last snapshot from that instance from the Backup and push it on a new server, update the network path, and pray to god that not more server fail than spare capacity is available. Otherwise you have to wait for a restore until the datacenter mounted a few more boxes in the rack.

    That explains quite a bit the randomness of those outage reports i.e. my app is down vs the other is fine and mine came back in 5 minutes vs the other took forever.

    As a business on a budget, I think anything else i.e. a small civo cluster serves you better.

    • ignoramous 17 hours ago

      Fly.io can migrate vm+volume now: https://fly.io/docs/reference/machine-migration/ / https://archive.md/rAK0V

      > a fly instance is hardwired to one physical server and thus cannot fail over

      I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?

      • mzi 14 hours ago

        > I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?

        You can run your workload (in this case a VM) on top of a scheduler, so if one node goes down the workload is just spun up on another available node.

        You will have downtime, but it will be limited.

      • sofixa an hour ago

        > I'm having trouble understanding how else this is supposed to be? I understand that live migration is a thing, but even in those cases, a VM is "hardwired" to some physical server, no?

        They mean the storage part. If your VM's storage(state) is on one server and that server dies, you have to restore from backup. If your VM's storage is on remote shared storage mounted to that server and the server dies, your VM can be restarted elsewhere that has access to that shared storage.

        In AWS land it's the difference between instance store (local to a server) and EBS (remote, attached locally).

        There's a tradeoff in that shared storage will be slightly slower due to having to traverse networking, and it's harder to manage properly; but the reliability gain is massive.

    • dilyevsky 17 hours ago

      > Ultimately, I found the answer in their tech docs where it was spelled out clearly that an fly instance is hardwired to one physical server and thus cannot fail over in case that server dies.

      Majority of EC2 instance types did not have live migration until very recently. Some probably still don't (they don't really spell out how and when it's supposed to work). It is also not free - there's a noticeable brown-out when your VM gets migrated on GCP for example.

      • ixaxaar 16 hours ago

        Can you shed some more light on this "browning out" phenomenon?

        • toast0 16 hours ago

          Here's the GCP doc [1]. Other live migration products are similar.

          Generally, you have worse performance while in the preparing to move state, an actual pause, then worse performance as the move finishes up. Depending on the networking setup, some inbound packets may be lost or delayed.

          [1] https://cloud.google.com/compute/docs/instances/live-migrati...

    • pier25 16 hours ago

      If you want HA on Fly you need to deploy an app to multiple regions (multiple machines).

      Fly might still go down completely if their proxy layer fails but it's much less common.

      • sb8244 7 hours ago

        The proxy layer was the cause of yesterday's outage according to support.

        • pier25 6 hours ago

          Yes but the previous comment was about hardware failure.

    • fulafel 17 hours ago

      The status tells a story about a high-availability/clustering system failure so I think in this case the problem is rather the complexity of the HA machinery hurting the system's availability vs something like a simple VPS.

  • xyst 16 hours ago

    Recurring pattern I notice is outages tend to occur the week of major holidays in US.

    - MS 365/Teams/Exchange had a blip in the morning

    - Fly.io with complete outage

    - then a handful of sites and services impacted due to those outages

    Usually advocate against “change freezes” but I think a change freeze around major holidays makes sense. Give all teams a recharge/pause/whatever.

    Don’t put too much pressure on the B-squads that were unfortunate to draw the short stick.

    • paxys 16 hours ago

      Bad code rarely causes outages at this scale. The culprit is always configuration changes.

      Sure you can try and reduce those as well during the holiday season, but what if a certificate has to be renewed? What if a critical security patch needs to be applied? What if a set of servers need to be reprovisioned? What if a hard disk is running out of space?

      You cannot plan your way out of operational challenges, regardless of what time of year it is.

      • oarsinsync 14 hours ago

        > Sure you can try and reduce those as well during the holiday season, but what if a certificate has to be renewed? What if a critical security patch needs to be applied? What if a set of servers need to be reprovisioned? What if a hard disk is running out of space?

        Reading this, I see two routine operational issues, one security issue and one hardware issue.

        You can’t plan you way around security issues or hardware failures, but operational issues you both can and should plan around. Holiday schedules like this are fixed points in time, so there’s absolutely no reason why you can’t plan all routine works to be completed either a week in advance, or a week after, the holiday period.

        Certificates don’t need to be near the point of expiry to be renewed. Capacity doesn’t need to be at critical levels to be expanded. Ultimately, this is a risk management question (as a sibling has also commented). Is the organisation willing to take on increased risk in exchange for deferring operational expenses?

        If the operational expense is inevitable (the certificate will need renewing), that seems like an easy answer when it comes to risk management over holidays.

        If the operational expense is not inevitable (will we really need to expand capacity?), it then becomes a game of probabilities and financials - likelihood of expense being incurred, amount of expense incurred if done ahead of time, impact to business if something goes wrong during a holiday.

      • jimmyl02 15 hours ago

        I think a good way of looking at it is risk. Is the change (whether it is code or configuration, etc.) worth the risk it brings on.

        For example if it's a small feature then it probably makes sense to wait and keep things stable. But, if it's something that itself causes larger imminent danger like security patches / hard disk space constraints, then it's worth taking on the risk of change to mitigate the risk of not doing it.

        At the end of the day no system is perfect and it ends up being judgement calls but I think viewing it as a risk tradeoff is helpful to understand.

      • bobsyourbuncle 15 hours ago

        This is a good observation. Do you have any resources I can read up on to make this safer?

    • ploxiln 16 hours ago

      I think you can't avoid the fact that these holiday weeks are different from regular weeks. If you "change freeze" then you also freeze out the little fixes and perf tuning that usually happens across these systems, because they're not "critical".

      And then inevitably it turns out that there's a special marketing/product push, with special pricing logic that needs new code, and new UI widgets, causing a huge traffic/load surge, and it needs to go out NOW during the freeze, and this is revenue, so it is critical to the business leaders. Most of eng, and all of infra, didn't know about it, because the product team was cramming until the last minute, and it was kinda secret. So it turns out you can freeze the high-quality little fixes, but you can't really freeze the flaky brand-new features ...

      It's just a struggle, and I still advise to forget the freeze, and try to be reasonable and not rush things (before, during, or after the freeze).

      • willsmith72 5 hours ago

        Any big tech company with large peak periods disagrees with you. It's absolutely worth freezing non-critical changes.

        Urgent business change needs to go through? Sure, be prepared to defend to a vp/exec why it needs to go in now.

        Urgent security fix? Yep same vp will approve it.

        It's a no-brainer to stop your typical changes which aren't needed for a couple of weeks. By the way, it doesn't mean your whole pipeline needs to stop. You can still have stuff ready to go to prod or pre prod after the freeze

      • ignoramous 16 hours ago
    • vrosas 16 hours ago

      Then you just get devs rushing out changes before the freeze…

      • subarctic 15 hours ago

        As a developer I don't see why I would rush out a change before the freeze when I could just wait until after. Maybe a stakeholder that really wants it would press for it to get out but personally I'd rather wait until after so I'm not fixing a bug during my holiday.

        • vrosas 15 hours ago

          Congrats on not working for the product team I work for

      • fragmede 15 hours ago

        and stampeding changes in after the thaw, also leading to downtime. so it depends on the org, but doing a freeze is still reasonable policy. Downtime on December 15th is less expensive than on black Friday or cyber Monday for most retailers, so it's just a business decision at that point.

    • cess11 14 hours ago

      Blip? 365 has an ongoing incident since yesterday morning, european timezone. The reason I know is because I use their compliance tools to secure information in a rather large bankruptcy.

    • aaomidi 16 hours ago

      What do "Freezes" mean? Like, do you stop renewing your certificates? Do you stop taking in security updates for your software?

      Sure maybe "unnecessary" changes, but the line gets very gray very fast.

      • vrosas 16 hours ago

        No unnecessary code deployments.

      • Spivak 15 hours ago

        It's not very grey, prod becomes as if you told everyone but your ops team to go home and then sent your ops team on a cruise with pagers. If it's not important enough to merit interrupting their vacation you don't do it.

      • fragmede 15 hours ago

        Certs shouldn't still be done by hand that this point; if another heartbleed comes out in the next 7 days then the risk can be examined, escalated, and the CISO can overrule the freeze. If it's a patch for remote root via Bluetooth drivers on a server that has no Bluetooth hardware, it's gonna wait.

        you're right that there's a grey line, but crossing that line involves waking up several people and the on call person makes a judgement call. if it's not important enough to wake up several people over, then things stay frozen.

        • kbolino 6 hours ago

          There's still a lot of situations where automatic certificate enrollment and renewal is not possible. TLS is not the only use of X.509 certificates, and even then, public facing HTTPS is not the only use of TLS.

          It needs to get better but it's not there yet.

        • aaomidi 13 hours ago

          Right, that's basically what I mean. There are a lot of automated changes happening in the background for services. I guess the whole thing I'm saying is that not every breakage is happening because of a code change.

  • akshayshah 17 hours ago

    The series of outages early in 2023 also had some Corrosion-related pain: https://community.fly.io/t/reliability-its-not-great/11253

    • __turbobrew__ 16 hours ago

      Seems like rolling their own datastore turned out to be a bad bet.

      Im not super familiar with their constraints but scylladb can do eventual consistency and is generally quite flexible. CouchDB is also an option for multi-leader replication.

  • arusahni 18 hours ago

    Oof, hugops to the team.

  • stevefan1999 18 hours ago

    Yep...can confirm my self hosted Bitwarden there is completely FUBAR connection wise even if it is in EA, so it should be a worldwide outage...lemme guess, some internal tooling error, consensus split brain, or if it looks like someone leaked BGP routes again?

  • redslazer 18 hours ago

    fly.io just has the weirdest outages. It has issues so regularly we dont even need to run mock outages to make sure our system fail overs work.

    • duxup 18 hours ago

      When I worked for a company who worked with big banks / financial institutions we used to run disaster recovery tests. Effectively a simulated outage where the company would try to run off their backup sites. They ran everything from those sites, it was impressive.

      Once in a while we'd have a real outage that matched the test we ran as recently as the weekend before.

      I was helping a bank switch over to the DR site(s) one day during such a real outage and I left my mic open when someone asked me what the commotion was on the upper floors of our HQ. I said "super happy fun surprise disaster recovery test for company X".

      VP of BIG bank was on the line monitoring and laughed "I'm using that one on the executive call in 15, thanks!" Supposedly it got picked up at the bank internally after the VP made the joke and was an unofficial code for such an outage for a long time.

      • latch 16 hours ago

        In most BIG banks, "Vice President" is almost an entry-level title. Easily have 1000s of them. For example, this article points out that Goldman Sachs had ~12K VPs out of more than 30K employees: https://web.archive.org/web/20150311012855/https://www.wsj.c...

        • SteveNuts 6 hours ago

          Just like all Sales folks have heavily inflated titles, no customer wants to think they're dealing with a junior salesperson/loan officer when you're about to hand over your money.

          It seems like every vendor sales team I work with is an "executive" or "director of sales" even though in reality they're just regular old salespeople.

        • jart 14 hours ago

          VP at Goldman is equivalent to Senior SWE according to levels.fyi and their entry level is Analyst. I'm surprised by the compensation though. I would have thought people working at a place with gold in the name would be making more. Also apparently Morgan Stanley pays their VPs $67k/year.

          • philipwhiuk 9 hours ago

            Tech outstripped big finance corps tech a while ago.

            Traders make loads, not the SWEs

          • bormaj 8 hours ago

            That VP comp number seems quite low fwiw

            • jart 2 hours ago

              Yes how much longer till we see Morgan Stanley VPs picketing outside demanding a living wage and humming The Internationale.

      • NetOpWibby 17 hours ago

        Thankfully your comment was positive!

    • benreesman 18 hours ago

      In fairness to the fly.io folks (who are extremely serious hackers), they’re standing up a whole cloud provider and they’ve priced it attractively and they’re much customer-friendlier than most alternatives.

      I don’t envy the difficulty of doing this, but I’m quite confident they’ll iron the bugs out.

      • redslazer 17 hours ago

        The tech is impressive and the pricing is attractive which is why we use them. I just wish there was less black magic.

        • benreesman 17 hours ago

          I don’t always agree with @tptacek on social/political issues, and I don’t always agree with @xe on the direction of Nix, but these are legends on the technical side of things. And they’re trying to build an equitable relationship between the user of cloud services and the provider, not fund a private space program.

          If I were in the market for cloud services I’d highly prize a long-term relationship on mutual benefit and fair dealings over a short-term nuisance of being an early adopter.

          I strongly suspect your investment in fly is going to pay off.

          • xena 15 hours ago

            Xe here. As a sibling comment said, I didn't survive layoffs. If you're looking for someone like me, I'm on the market!

            • benreesman 15 hours ago

              Hiring people is above my pay grade, but I can vouch to my lords and masters and anyone else who cares what I think that a legend is up for grabs.

              b7r6@b7r6.net

          • verelo 16 hours ago

            I want to believe, but in the meantime they’re killing the product I’ve been working hard to build trust with my own customers though. There is a limit to my idealism, and it’s well and truly in the past.

          • reissbaker 15 hours ago

            FWIW Xe was let go from Fly earlier this year during a round of layoffs.

          • foldr 10 hours ago

            I suspect that making a cloud service provider run reliably requires tons of grunt work more than it requires technical heroism from a small number of highly talented individuals.

  • punkpeye 19 hours ago

    It is not reflected in their status page, but fly.io itself is not even loading.

  • mattbee 9 hours ago

    It feels like fly is trying to repeat a growth model that worked 20 years ago: throw interesting toys at engineers, then wait for engineers to recommend their services as they move on in their careers.

    Part of that playbook is the old Move Fast & Break Things. That can still be the right call for young projects, but it has two big problems:

    1) AWS successfully moved themselves into the position of "safe" hosting choice, so it's much rarer for engineers to have influence on something that's seen by money men as a humdrum, solved problem;

    2) engineers are not the internal influencers they used to be, being laid off left and right the last few years, and without time for hobby projects.

    (maybe also 3) it's much harder to build a useful free tier on a hosting service, which used to be a necessary marketing expense to reach those engineers).

    So idk, I feel like the bar is just higher for hosting stability than it used to be, and novelty is a much harder sell, even here. Or rather: if you're going to brag about reinventing so many wheels, they need to not to come off the cart as often.

  • Huppie 3 hours ago

    It's interesting to see this discussion about fly.io's reliability on a day that (after over three days of downtime) Microsoft Azure finally decided the update of Azure Static Web Apps they deployed last Friday is indeed broken for customers using specific authentication settings...

    ...with not a single status update from Microsoft in sight.

  • xyst 16 hours ago

    I can’t even login to my old account. Password reset is timing out yet still receive password reset e-mail. Password reset link broken, with 500 status code.

  • teaearlgraycold 17 hours ago

    I'm grateful to HN for keeping me well aware of Fly's issues. I'll never use them.

    • kachapopopow 17 hours ago

      It's still 99.99+% SLA? Would you really pay 100% more for <0.01% more uptime?

      • runako 17 hours ago

        No dog in this fight, all props to the Fly.io team for having the gumption to do what they are doing, I genuinely hope they are successful...

        > It's still 99.99+% SLA

        But this is simply not accurate. 99.99% uptime is < 52m 9.8s annually of downtime. They apparently blew well through that today. Looks like they essentially had the equivalent of 4 years of 99.99% uptime equivalent this evening.

        Four nines is so unforgiving that it's almost the case that if people are required to be in the loop at any point during an incident, you will blow the fourth nine for the whole year in a single incident.

        Again, I know it's hard. I would not want to be in the space. That fourth nine is really difficult to earn.

        In the meanwhile, <hugops> to the Fly team as they work to resolve this (and hopefully get some rest).

        • fulafel 17 hours ago

          99.99+% SLA typically means you get some billing credits for the downtime exceeding 99.99+ availability. So technically do get a "99.99+% SLA", but you don't get 99.99+% availability.

          Other circles use "SLO" (where the O stands for objective).

          (Anyone know what the details in fly.io SLA are?)

          • runako 16 hours ago

            You are correct in the legal/technical sense!

            Technically, anyone could offer five- or six-nines and just depend on most customers not to claim the credits :-D

            Actually hitting/exceeding four nines is still tough.

      • mrcwinn 17 hours ago

        This is not my experience at all, as a former paying customer.

      • PUSH_AX 11 hours ago

        You say that like it's their only issue.

        Earlier in the year they had a catastrophic outage in LHR, we lost all our data. Yes this is also on me, I'm aware. Still, that's a hard nope from me, we migrated.

      • cj 17 hours ago

        I think what a lot of people fail to understand is that there are certain categories of apps that simply “can never go down”

        Examples include basically any PaaS, IaaS, or any company that provides a mission-critical service to another company (B2B SaaS).

        If you run a basic B2C CRUD app, maybe it’s not a big deal if you service goes down for 5 minutes. Unfortunately there are quite a few categories of companies where downtime simply isn’t tolerated by customers. (I operate a company with a “zero downtime” expectation from customers - it’s no joke, and I would never use any infrastructure abstraction layer other than AWS, GCP or Azure - preferably AWS us-east-1 because, well, if you know the joke…)

        • toast0 16 hours ago

          > I think what a lot of people fail to understand is that there are certain categories of apps that simply “can never go down”

          I refuse to believe that this category still exists, when I need to keep my county's alternate number for 911 in my address book, because CenturyLink had a 6 hour outage in 2014 and a two day outage in 2018. If the phone company can't manage to keep 911 running anymore, I'd be very surprised what does have zero downtime over a ten year period.

          Personally, nine nines is too hard, so I shoot for eight eights.

        • bri3d 15 hours ago

          My experience with very large scale B2B SaaS and PaaS has been that customers like to get money, if allowed by contract, by complaining about outages, but that overall, B2B SaaS is actually very forgiving.

          Most B2B SaaS solutions have very long sales cycles and a high total cost to implement, so there is a lot of inertia to switching that “a few annoying hours of downtime a year” isn’t going to cover. Also, the metric that will drive churn isn’t actually zero downtime, it’s “nearest competitor’s downtime,” which is usually a very different number.

        • macNchz 17 hours ago

          Every PaaS and IaaS I’ve ever used has had some amount of downtime, often considerably more than 5 minutes, and I’ve run production services on many of them. Plenty of random issues on major cloud providers as well. Certainly plenty of situations with dozens of Twitter posts happening but never any acknowledgement on the AWS status page. Nothing’s perfect.

          • cj 16 hours ago

            Yea, when running services where 5 minutes of downtime results in lots of support tickets, you learn to accept that the incident will happen and learn to manage the incident rather than relying that it will never occur.

        • MobiusHorizons 16 hours ago

          you realize all of those services you mention can't give you zero downtime, they would never even advertise that. They have quite good reliability certainly, but on long enough time horizons absolutely no-one has zero downtime.

        • sgrove 13 hours ago

          All of your examples have had multiple cases of going down, some for multiple days (2011 AWS was the first really long one I think) - or potentially worse, just deleting all customer data permanently and irretrievably.

          Meaning empirically, downtime seems to be tolerated by their customers up to some point?

        • littlestymaar 16 hours ago

          If your app cannot go down ever, then you cannot use a cloud provider either (because even AWS and Azure do fail sometime, just look up for “Azur down” on HN).

          But the truth is everybody can afford some level of outage, simply because nobody has the budget to provision an infra that can never fail.

          • vrosas 16 hours ago

            I’ve seen a team try and be truly “multi-cloud” but then ended up with this Frankenstein architecture where instead of being able to weather one cloud going down, their app would die if _any_ cloud had an issue. It was also surprisingly hard to convince people it doesn’t matter how many globally distributed clusters you have if all your data is in us-east.

  • DataOverload 18 hours ago

    We switched from Fly to CF workers a while ago, and never looked back

    • punkpeye 17 hours ago

      They are fundamentally different. If Cloudflare provided a way to host docker containers with volumes though, that would be game over for so many paas platforms.

    • frakkingcylons 17 hours ago

      I switched from apples to oranges and never looked back.

    • pier25 17 hours ago

      Our stuff on CF Workers has been working non stop for years now.

      About 6 months ago we migrated our most critical stuff from Fly to CF and boy every time Fly has issues I'm so glad we did.

      • jpgvm 7 hours ago

        Too much custom stuff too quickly, there is a lot of efficiency in vertical integration and a fully cohesive stack but it takes a very long time to stabilize if you take that route.

        We spent months trying to convince them of problems with their H2 implementation in their LB/proxy (they insisted nginx was at fault, spoiler - it wasn't) but had to leave (we also went to CF, which has its own problems). Eventually one of their employees wrong a long blog post about H2 that made it obvious they finally found and fixed those problems but months too late for my employer at the time.

        It would have been infinitely better for us if they could have just fixed their stability problems, that abstraction suited us as did their LB/proxy impl and SNI pricing.

        I wish them well, some really smart folk over there but I can imagine these reliability problems are probably really grinding down morale.

    • rstupek 17 hours ago

      How are they equivalent?

    • eek2121 17 hours ago

      congrats on not developing a playbook for the time you have to 'look back'.

      Providers will fail. good contingencies won't.

      ...hears faint sound...I SAID GOOD, QUIET YOU!

  • gigapotential 16 hours ago

    HUGOPS

    Everything is going to be 200 OK!

  • mrcwinn 18 hours ago

    I tried Fly early. I was very excited about this service, but I've never had a worse hosting experience. So I left. Coincidentally I tried it again a few days ago. Surely things must be better. Nope. Auth issues in the CLI, frustrations deploying a Docker app to a Fly machine. I wouldn't recommend it to anyone.

    • steve_adams_86 17 hours ago

      I find their user experience to be exceptional. The only flake I’ve encountered is in uptime and general reliability of services I don’t interface with directly. They’ve done a stellar job on the stuff you actually deal with, but the glue holding your services together seems pretty wobbly.

  • pier25 17 hours ago

    My apps on Fly have not gone down this time.

  • MaxfordAndSons 18 hours ago

    Kinda funny that they've named their global state store "Corrosion"... not really a word I'd associate with stability and persistence.

    • lordofgibbons 18 hours ago

      It's an internal project based on Rust, not a product. So I don't think it matters too much what they name it. It's opens source which is great, but still not a product that they need to market.

      • SOLAR_FIELDS 17 hours ago

        And to be fair, it’s a bit of a cute meme to name rust projects things that relate to it. Oxide, etc

    • toast0 16 hours ago

      I stored important data in mnesia, so who would I be to talk. :p

      • throwawaymaths 15 hours ago

        amnesia means forget, so mnesia means remember, I would guess?

    • kermatt 18 hours ago
    • dumah 17 hours ago

      I take your point but corrosion-resistant metals such as Aluminum, Titanium, Weathering Steel and Stainless Steel don’t avoid corrosion entirely but form a thin and extremely stable corrosion layer (under the right conditions).

      • littlestymaar 16 hours ago

        Gold and platinum really are corrosion resistant though (but have questionable mechanical properties…)

  • EGreg 18 hours ago

    What exactly does flyio.net do?

    • HellsMaddy 17 hours ago

      If you mean specifically flyio.net and not just fly.io the company, I'm guessing they host their status page on a separate domain in case of DNS/registrar issues with their primary domain.

    • stackghost 17 hours ago

      IIRC their value prop is that they let you rapidly spin up deployments/machines in regions that are closest to your users, the idea being that it will be lower latency and thus better UX.

    • vachina 17 hours ago

      It’s basically what Heroku used to be but with CDN-like presence.

    • michaelbuckbee 17 hours ago

      Hosting service that has a lot of interesting distributed features.

    • eek2121 17 hours ago

      WEB 2.0. SEE. TOLD YA! THEY SHOULDA UPGRADED TO THAT NEWFANGLED 3.0! ;)

  • theideaofcoffee 16 hours ago

    Color me not surprised. My few interactions with people there just gave off the impression of them being in a bit over their heads. I don't know how well that translated to their actual ops, but it's difficult to not connect the two when they continue to have major outage after major outage for a product that 'should' be their customer's bedrock upon which they build everything else.

  • travisgriggs 6 hours ago

    Don’t a bunch of Elixir/Erlang guys work at fly.io? It’s weird to me that that hallmark of reliability is associated with something that the public sees as unreliable. What gives with that association?

  • veggieWHITES 18 hours ago

    I was considering these guys the other day until I saw their pricing page: https://fly.io/pricing/

    (There's not a single price on there, why even create the page?)

    • rascul 18 hours ago

      There's a link to what appears to be the actual pricing page https://fly.io/docs/about/pricing/

      There's also a link to the pricing calculator https://fly.io/calculator

      • totetsu 18 hours ago

        Is that calculator hourly or monthly?

        • radicalriddler 17 hours ago

          Literally says "Monthly Costs" in the green panel on the right that calculates the total.

        • eviks 17 hours ago

          It's right there: "Monthly Cost"

    • Aeolun 12 hours ago

      OMG, that's hilarious. I use them, and I know what my prices are, but I'd never noticed that the page called pricing doesn't actually have any.

    • schmichael 18 hours ago

      The prices are just one click deeper. Hardly a nefarious dark pattern.