BREAKING: Multiple global websites shown as offline, including Amazon Web Services and Reddit

seriesn · June 2021

In today’s episode of “I will use big providers because they never have downtime. Small webhosting company Blyat”

jon617 · June 2021

Did anyone try turning it off and on? Power cycling usually fixes the Internet for me.

DataIdeas-Josh · June 2021

@PeterP said:
To bring this thread back on-topic, has Fastly mentioned anything yet as to what caused the outage this morning? I'm yet to see anything as I just got home for the night, but I'm sure I'll come across something if not here.

probably was bgp...

PeterP · June 2021

@DataIdeas-Josh said:

@PeterP said:
To bring this thread back on-topic, has Fastly mentioned anything yet as to what caused the outage this morning? I'm yet to see anything as I just got home for the night, but I'm sure I'll come across something if not here.

probably was bgp...

Swap out DNS for BGP and you've got yourself a new meme

raindog308 · June 2021

I'm confused...why would Amazon Web Services use Fastly? They have their own CDN, CloudFront.

lentro · June 2021

@raindog308 said:
I'm confused...why would Amazon Web Services use Fastly? They have their own CDN, CloudFront.

“ Amazon’s own retail website actually runs through Fastly, rather than CloudFront, and has done since May 2020.”

https://www.theguardian.com/technology/2021/jun/08/edge-cloud-error-tuesday-internet-outage-fastly-speed

AWS is even too expensive for Amazon lol, I can’t come up with another explanation. Reminds me of a time when an employee of a big tech company with a respected cloud service rented dedicated servers from me because their employee discount still costed 2x low end rates lol.

jetchirag · June 2021

@raindog308 said:
I'm confused...why would Amazon Web Services use Fastly? They have their own CDN, CloudFront.

More POPs:
https://aws.amazon.com/about-aws/global-infrastructure/ vs
https://www.fastly.com/network-map

a-super-random-user · June 2021

I suspect it was Chia.

Levi · June 2021

@raindog308 said:
I'm confused...why would Amazon Web Services use Fastly?

Fastly is more advanced? And if your own cdn will go down what will you do?

Maounique · June 2021

@LTniger said: And if your own cdn will go down what will you do?

And web site. Use Twitter... If it is not on AWS
So that is why Paypal wouldn't take my password? It shouldn't be related...

dahartigan · June 2021

@sdglhm said:
I suspect it was Chia.

The Chia virus involucrated their servers

jsg · June 2021

@PeterP said:
To bring this thread back on-topic, has Fastly mentioned anything yet as to what caused the outage this morning?

Fastly SVP said:
What happened?
On May 12, we began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances.

Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors.

Translation: They incompetently committed two arch-sins:

they don't properly check user input/user provided data
their design isn't properly compartmentalized and so one single customer could bring down virtually the whole service by (presumably inadvertently) making a change in their config.

Maounique · June 2021

@jsg said: they don't properly check user input/user provided data

I don't see anything like that in the post-mortem. In fact they specify the config was valid, just applying it triggered a bug previously introduced (long ago, in fact, so it was a very rare thing/combination of things, presumably).

bulbasaur · June 2021

@jsg nice throwing around the words "incompetent", but:

they don't properly check user input/user provided data

This seems like an edge case in a valid configuration change, how do you prevent that from happening with input validation.

their design isn't properly compartmentalized and so one single customer could bring down virtually the whole service by (presumably inadvertently) making a change in their config.

PoPs are usually shared with event loop based software running, how do you perform isolation with such software? (Please don't say threads or per customer IPs. None of those scale for a CDN service.)

yoursunny · June 2021

@stevewatson301 said:

their design isn't properly compartmentalized and so one single customer could bring down virtually the whole service by (presumably inadvertently) making a change in their config.

PoPs are usually shared with event loop based software running, how do you perform isolation with such software? (Please don't say threads or per customer IPs. None of those scale for a CDN service.)

Run code that require isolation in eBPF or WebAssembly.
I can reach 100Gbps on six CPU cores using eBPF (and a lot of non eBPF code).
Also, my software is based on continuous polling, not event loops.

jsg · June 2021

Fastly said:
On May 12, we began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances.

Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors.

I accept that that may sound normal/OK to many. From an IT-security and proper system and software design POV however this sounds like a confession of bloody incompetence.

Now, one may discuss whether the bug is the problem or the fact that one customer "pushing a config" can bring down virtually the whole network or ... but to me it all says the same: too much marketing and large corp blabla and too little tech competence.

Note that I did not say "run away!" or "switch over to competitor XYZ!". Simple reason: while one may exist I do not know of any major CDN provider with sound engineering down to the core.

bulbasaur · June 2021

@jsg said: too little tech competence.

And yet I'm about to hear from you how you would prevent such a situation, apart from giving each customer their own thread, IP (or maybe containers), none of which actually scale for a CDN service.

If a certain engineering methodology is nearly impossible to put in practice, you can hardly claim incompetence.

jsg · June 2021

@stevewatson301 said:

@jsg said: too little tech competence.

And yet I'm about to hear from you how you would prevent such a situation,

By properly designing and engineering.

If a certain engineering methodology is nearly impossible to put in practice, you can hardly claim incompetence.

When they have bugs and when a bug can bring down virtually their whole operation, one can claim incompetence.

Btw, Stockholm syndrome? Or why do we have this discussion? Hello, earth to some users here: they f_cked up. Big time. And your position is "No, one must not call them incompetent!" and protecting them?

Sorry but contrary to, so it seems, popular believe bugs and shoddy designing and engineering are not somehow God-given and unavoidable.

bulbasaur · June 2021

jsg be like:

default · June 2021

The end is nigh.

jsg · June 2021

@stevewatson301

I did and do not promote HostSolutions - no matter what you feel the reality to be
You repeatedly try to treat me as the matter, which I'm not. This thread has a topic.
Thanks for demonstrating your lack of arguments and your way of dealing with it: attacking the guy who disagrees with you.
Thanks for demonstrating your "logic": you don't care about the fact that Factly had a clusterf_ck but you promote them based on them being a 6 bln$ company.
Btw, I did not say "don't buy from Fastly". Explicitly. But hey, don't let get reality in your way ...

[self censored for the sake of politeness]

alilet · June 2021

@stevewatson301 said:
jsg be like:

jsg be like:

jsg · June 2021

Thanks a lot for confirming my sig.

Evoxt · June 2021

Oh wow, I always thought Amazon uses their own CDN. Didn't know they are using Fastly as well

TimboJones · June 2021

@jsg said:

@stevewatson301 said:

@jsg said: too little tech competence.

And yet I'm about to hear from you how you would prevent such a situation,

By properly designing and engineering.

If a certain engineering methodology is nearly impossible to put in practice, you can hardly claim incompetence.

When they have bugs and when a bug can bring down virtually their whole operation, one can claim incompetence.

Btw, Stockholm syndrome? Or why do we have this discussion? Hello, earth to some users here: they f_cked up. Big time. And your position is "No, one must not call them incompetent!" and protecting them?

Sorry but contrary to, so it seems, popular believe bugs and shoddy designing and engineering are not somehow God-given and unavoidable.

That's why there's QA, because non-zero bugs are unavoidable and QA is there to catch and prevent catastrophic events. This is a QA failure, not an engineering failure.

Did it work as designed or did it not? If it worked according to design but led to catastrophe, it was a design fault. If it didn't function according to design, it's QA's fault. The way it was described is a QA fail.

Howdy, Stranger!

Categories

In this Discussion

BREAKING: Multiple global websites shown as offline, including Amazon Web Services and Reddit

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

BREAKING: Multiple global websites shown as offline, including Amazon Web Services and Reddit

Comments