Skip to Content.
Sympa Menu

per-entity - Re: [Per-Entity] distribution of aggregate metadata

Subject: Per-Entity Metadata Working Group

List archive

Re: [Per-Entity] distribution of aggregate metadata


Chronological Thread 
  • From: Patrick Radtke <>
  • To: Nick Roy <>
  • Cc: Rhys Smith <>, Scott Cantor <>, Thomas Scavo <>, Per-Entity Metadata Working Group <>
  • Subject: Re: [Per-Entity] distribution of aggregate metadata
  • Date: Wed, 10 Aug 2016 16:01:46 -0700
  • Ironport-phdr: 9a23:gA736RFHI/rqVarFrxMnuJ1GYnF86YWxBRYc798ds5kLTJ75p82wAkXT6L1XgUPTWs2DsrQf1LqQ7vurADFIyK3CmU5BWaQEbwUCh8QSkl5oK+++Imq/EsTXaTcnFt9JTl5v8iLzG0FUHMHjew+a+SXqvnY6Uy/yPgttJ+nzBpWaz4Huj7jzqNXvZFBuhTG+KYl1PV3iqx/Wp+EXh5dvMKA81kGPr3dVLbd432RtcGqagwz97cT4xpdv7ydd86Y5/tJaUK75V685V6ZFFzUqPmYp5dH68xLESF3ctTMnTmwKn08QUED+5xbgU8K063Oiuw==

On Wed, Aug 10, 2016 at 12:53 PM, Nick Roy
<>
wrote:
> "at least 99.9% [uptime]..." That is two nines short of what this group has
> been bouncing around. So - how does that sit with you all?

TL;DR: Total uptime is less important than the impact of a specific
outage, and what the process is to minimize an in progress outage.
We're entering the world of distributed computing: things will go
wrong, the network will fail and someone will forget to pay the Amazon
bill (happened at a previous job).

---------------

I don't think five 9 is really achievable or practically measurable. A
1 second downtime each day gives you less than five 9s and would
likely go unnoticed by your monitoring solution.

Amazon had a multi hour search outage on prime day so they can't even
reach 4 nines for the year. The two universities I've worked at
couldn't provide 4 nines of network connectivity due to router bugs
and backhoes.

From my perspective uptime measurements on their own shouldn't drive
the decision since they don't take into account user impact. A daily
outage of 20 seconds at 4 am may go unnoticed, while 2 hour outage one
day a year will result in a lot of complaints even though they both
have the approximately the same uptime measurements for the year.

I see the likely causes of service failure (assuming a CDN)
1) Network outage between CDN and consumer
2) Outage at the CDN
3) Publishing error that poisons the CDN.

For 1) and 2) the failures at a specific CDN location would likely
affect a small portion of the users. AWS has 20 cloud front
distribution sites in the US. The aggregate could also be published to
both AWS and Azure and you could do round robin DNS and shorter TTL to
ensure if an outage affected one CDN provider then an SP/IdP would
eventually be able to resolve (and cache) metadata against the other.

For 3) I think you'll want to be prepared for how you want to flush
the CDN and start fresh.

I know we don't want to put intelligence into the client since the
client base is too heterogenous, but that is the way Netflix went with
all their internal services to achieve a highly perceived uptime.

-Patrick



Archive powered by MHonArc 2.6.19.

Top of Page