1
00:00:00,040 --> 00:00:01,960
Quick note, this episode isn't 
sponsored. 

2
00:00:02,880 --> 00:00:06,560
I'm building a new kind of IDE 
called Rex because existing ones

3
00:00:06,560 --> 00:00:09,040
make it hard to work across 
multiple projects in parallel. 

4
00:00:09,800 --> 00:00:11,760
I'm sharing it to get feedback 
from listeners. 

5
00:00:11,920 --> 00:00:13,440
I'd really love to hear your 
thoughts. 

6
00:00:14,000 --> 00:00:16,800
The link is in the description. 
And now let's move on with 

7
00:00:16,800 --> 00:00:18,600
today's super interesting 
episode. 

8
00:00:20,400 --> 00:00:24,720
Welcome back to the Deep Dive. 
Today we are tackling a subject 

9
00:00:24,720 --> 00:00:27,560
that I think keeps a lot of 
engineers, you know up at night.

10
00:00:27,960 --> 00:00:31,440
And I want to start with a bit 
of a provocation. 

11
00:00:31,520 --> 00:00:33,800
A counterintuitive premise if 
you will. 

12
00:00:33,800 --> 00:00:35,280
I love a good provocation. 
Let's hear it. 

13
00:00:35,360 --> 00:00:39,440
OK, in modern cloud 
architectures, perfect code. 

14
00:00:39,680 --> 00:00:44,000
And I'm talking structurally 
sound, bug free, beautiful 

15
00:00:44,000 --> 00:00:45,560
logic, the kind of stuff you'd 
like. 

16
00:00:45,880 --> 00:00:49,680
Frame on a wall can still create
catastrophic failure. 

17
00:00:49,680 --> 00:00:51,320
Oh, absolutely. 
That is not hyperbole. 

18
00:00:51,320 --> 00:00:53,680
I mean that is a Tuesday in 
distributed systems. 

19
00:00:53,680 --> 00:00:54,920
Right. 
And that's terrifying. 

20
00:00:54,920 --> 00:00:58,000
We all grow up, as, you know, 
junior engineers thinking if I 

21
00:00:58,000 --> 00:01:00,280
write the function correctly, 
the software works. 

22
00:01:00,400 --> 00:01:02,960
It's a very binary sort of true,
false view of the world. 

23
00:01:03,080 --> 00:01:04,680
It is. 
But you're telling me that in 

24
00:01:04,680 --> 00:01:08,160
the systems we're discussing 
today, production outages rarely

25
00:01:08,160 --> 00:01:11,760
happen because of a syntax error
or a logic bug inside a single 

26
00:01:11,760 --> 00:01:13,280
function? 
Rarely. 

27
00:01:13,360 --> 00:01:16,600
I mean sure, bugs happen, people
make typos they have off by 1 

28
00:01:16,600 --> 00:01:18,880
errors they, you know, forget to
close a bracket. 

29
00:01:18,880 --> 00:01:21,560
But the really nasty ones. 
The ones that paid you at 3:00 

30
00:01:21,560 --> 00:01:23,440
AM. 
Exactly the ones that take down 

31
00:01:23,440 --> 00:01:26,960
an entire platform for hours and
make the CTO sweat through their

32
00:01:26,960 --> 00:01:28,960
shirt. 
Those usually happen in the 

33
00:01:28,960 --> 00:01:30,520
empty space between the 
services. 

34
00:01:30,520 --> 00:01:32,560
The space between. 
It's not about what the code is 

35
00:01:32,560 --> 00:01:35,440
doing, it's about what the code 
assumes the rest of the world is

36
00:01:35,440 --> 00:01:37,480
doing. 
And that is exactly our mission.

37
00:01:37,480 --> 00:01:40,520
For this deep dive. 
We are taking a whole stack of 

38
00:01:40,520 --> 00:01:44,080
research architecture guides 
from AWS, some academic papers 

39
00:01:44,080 --> 00:01:46,480
on things like metastable 
failures and Bayesian 

40
00:01:46,480 --> 00:01:50,320
frameworks, and some really deep
analysis on microservice 

41
00:01:50,320 --> 00:01:52,640
pitfalls. 
Specifically looking at a paper 

42
00:01:52,640 --> 00:01:56,240
titled The Hidden Danger of 
Assumption Drift in Distributed 

43
00:01:56,240 --> 00:02:00,120
Systems and we are going to use 
it to well to upgrade your 

44
00:02:00,120 --> 00:02:01,680
brain. 
We're moving from junior 

45
00:02:01,680 --> 00:02:05,360
thinking, which is, you know, is
my function correct, to senior 

46
00:02:05,360 --> 00:02:08,240
thinking, which is what 
assumptions is my system making 

47
00:02:08,240 --> 00:02:10,160
and when will they inevitably be
wrong? 

48
00:02:10,240 --> 00:02:12,440
Exactly. 
And honestly, if you are 

49
00:02:12,440 --> 00:02:15,960
preparing for a high level 
system design interview or a 

50
00:02:15,960 --> 00:02:19,840
coding interview for a senior 
role, this is the secret sauce. 

51
00:02:19,920 --> 00:02:23,880
This is the stuff that separates
the pass from the strong hire. 

52
00:02:23,960 --> 00:02:27,360
I want you to walk out of this 
deep dive ready to stand at a 

53
00:02:27,360 --> 00:02:29,920
whiteboard and just blow an 
interviewer's mind. 

54
00:02:29,920 --> 00:02:32,080
It's the difference between 
knowing how to code and knowing 

55
00:02:32,080 --> 00:02:36,040
how to engineer, and there's a 
massive, massive gap between 

56
00:02:36,040 --> 00:02:38,360
those two things. 
So to guide us through this, 

57
00:02:38,360 --> 00:02:41,360
we're going to anchor everything
on that concept I mentioned a 

58
00:02:41,360 --> 00:02:44,480
second ago, Assumption drift. 
Assumption drift. 

59
00:02:44,760 --> 00:02:47,280
It's a great term. 
It basically describes what 

60
00:02:47,280 --> 00:02:52,640
happens when different parts of 
your distributed system invent 

61
00:02:52,640 --> 00:02:54,920
their own version of reality. 
I love that phrasing. 

62
00:02:54,920 --> 00:02:57,560
Inventing their own reality. 
It sounds like a sci-fi plot, 

63
00:02:57,560 --> 00:02:59,200
but it's really just bad 
architecture. 

64
00:02:59,200 --> 00:03:02,120
It sounds philosophical, but 
it's completely mechanical. 

65
00:03:02,440 --> 00:03:05,760
Service A assumes a request 
takes 2 seconds, Service B 

66
00:03:05,760 --> 00:03:07,920
assumes it can take 30. 
Their clocks are different, they

67
00:03:07,920 --> 00:03:09,160
drift apart. 
And in that gap? 

68
00:03:09,400 --> 00:03:13,200
In that gap, failure happens. 
And not just failure, silent 

69
00:03:13,200 --> 00:03:14,880
failure. 
The kind that's almost 

70
00:03:14,880 --> 00:03:18,520
impossible to debug because 
every individual part thinks 

71
00:03:18,520 --> 00:03:21,440
it's doing the right thing. 
So here is the road map for 

72
00:03:21,440 --> 00:03:22,920
today. 
We're going to build this up 

73
00:03:22,920 --> 00:03:25,880
gradually, piece by piece. 
We'll start with the foundation 

74
00:03:25,880 --> 00:03:29,120
layer, Why distributed systems 
are just fundamentally weird 

75
00:03:29,120 --> 00:03:31,040
compared to the code running on 
your laptop. 

76
00:03:31,200 --> 00:03:34,840
Then we'll dive deep into this 
assumption drift, how it creates

77
00:03:34,840 --> 00:03:37,000
these solid killers in your 
architecture. 

78
00:03:37,080 --> 00:03:40,440
And finally, we'll arm you with 
the defensive toolkit we need to

79
00:03:40,440 --> 00:03:43,560
talk about patterns like jitter,
circuit Breakers, and deadline 

80
00:03:43,560 --> 00:03:46,080
propagation. 
Not just what they are, but how 

81
00:03:46,080 --> 00:03:48,960
to explain them in an interview 
to show you actually know how 

82
00:03:48,960 --> 00:03:50,600
the sausage is made. 
Let's do it. 

83
00:03:50,840 --> 00:03:52,520
OK, let's start with the 
weirdness. 

84
00:03:52,520 --> 00:03:55,880
You brought a classic analogy to
the table from some of the AWS 

85
00:03:55,880 --> 00:03:59,800
source material, Pac-Man. 
Yes, the Pac-Man analogy. 

86
00:03:59,960 --> 00:04:03,640
It's perfect for explaining the 
leap from local to distributed 

87
00:04:03,640 --> 00:04:05,800
execution. 
It really crystallizes the 

88
00:04:05,800 --> 00:04:07,000
problem. 
Walk us through it. 

89
00:04:07,000 --> 00:04:11,280
Why is Pac-Man relevant to, you 
know, modern cloud architecture?

90
00:04:11,400 --> 00:04:15,000
OK, so imagine you're writing 
the code for the original 

91
00:04:15,000 --> 00:04:17,440
Pac-Man game. 
It's the 1980s. 

92
00:04:17,560 --> 00:04:19,399
It runs on a single arcade 
machine. 

93
00:04:19,680 --> 00:04:22,520
You have a line of code, 
something like board dot find 

94
00:04:22,560 --> 00:04:23,840
Pac-Man. 
Simple enough. 

95
00:04:23,840 --> 00:04:26,400
Find the yellow guy on the grid.
Right, and what happens? 

96
00:04:26,400 --> 00:04:29,880
The CPU looks at the memory, 
finds the coordinates and 

97
00:04:29,880 --> 00:04:31,880
returns them. 
It's nearly instantaneous 

98
00:04:31,880 --> 00:04:35,040
nanoseconds, but more 
importantly, the code shares 

99
00:04:35,120 --> 00:04:38,520
fate with the machine. 
Fate sharing That sounds 

100
00:04:38,520 --> 00:04:39,880
dramatic. 
It is dramatic. 

101
00:04:39,880 --> 00:04:42,760
It's a critical concept. 
It means if the power cord gets 

102
00:04:42,760 --> 00:04:46,960
pulled, the whole thing dies. 
The board, the ghosts, Pac-Man, 

103
00:04:46,960 --> 00:04:49,520
the code looking for Pac-Man, it
all goes dark together. 

104
00:04:50,320 --> 00:04:52,880
So there's no situation where 
just one part fails. 

105
00:04:52,880 --> 00:04:54,680
Exactly. 
There's no scenario where the 

106
00:04:54,680 --> 00:04:57,720
fine function keeps running, but
the board itself has, you know, 

107
00:04:57,800 --> 00:04:59,400
vaporized. 
They live together, they die 

108
00:04:59,400 --> 00:05:01,560
together. 
It is a single, cohesive 

109
00:05:01,560 --> 00:05:03,280
universe. 
One brain, one body. 

110
00:05:03,280 --> 00:05:05,520
OK, that makes perfect sense. 
It's all one machine. 

111
00:05:06,160 --> 00:05:09,000
Nicely. 
Now let's teleport to today. 

112
00:05:09,800 --> 00:05:13,960
You're building Cloud Pac-Man. 
The board is on a server in a 

113
00:05:13,960 --> 00:05:17,160
data center in Virginia. 
The user is on a phone in 

114
00:05:17,160 --> 00:05:19,640
London. 
The game logic is running on a 

115
00:05:19,640 --> 00:05:22,520
server in Tokyo. 
Now you run that same line of 

116
00:05:22,520 --> 00:05:26,840
code board dot find Pac-Man. 
Syntactically it looks exactly 

117
00:05:26,840 --> 00:05:29,680
the same in my editor. 
It looks the same in your IDE, 

118
00:05:29,960 --> 00:05:33,400
but under the herd it has just 
exploded in complexity. 

119
00:05:34,000 --> 00:05:36,640
That single function call isn't 
a simple memory look up anymore,

120
00:05:36,840 --> 00:05:39,760
it's a network request. 
And network requests don't just 

121
00:05:39,760 --> 00:05:42,120
work or fail, they go on a 
journey. 

122
00:05:42,240 --> 00:05:43,880
A perilous journey, it sounds 
like. 

123
00:05:43,920 --> 00:05:46,200
Very. 
In the AWS Architecture guide 

124
00:05:46,200 --> 00:05:49,280
they break this down into what I
like to call the 8 failure modes

125
00:05:49,280 --> 00:05:52,000
of the apocalypse. 
And for an interview you need to

126
00:05:52,000 --> 00:05:54,400
know these. 
Not just memorize them, but feel

127
00:05:54,400 --> 00:05:56,560
them in your bones. 
OK, let's run through these 

128
00:05:56,560 --> 00:05:59,920
because I think people, myself 
included, really underestimate 

129
00:05:59,920 --> 00:06:02,200
how many physical steps are 
actually involved in just 

130
00:06:02,200 --> 00:06:04,320
sending a message. 
OK, so let's trace the packet. 

131
00:06:04,320 --> 00:06:07,680
Step one POST request. 
You, the client, try to put the 

132
00:06:07,680 --> 00:06:10,040
message onto the network. 
And that can fail right out of 

133
00:06:10,040 --> 00:06:12,960
the gate. 
Immediately your network card, 

134
00:06:12,960 --> 00:06:17,240
your Nic could be fried. 
The local router could be down, 

135
00:06:17,800 --> 00:06:20,760
the operating system kernel 
could be out of memory for 

136
00:06:20,760 --> 00:06:23,320
network buffers. 
You haven't even left your own 

137
00:06:23,320 --> 00:06:25,040
machine yet, and you've already 
failed. 

138
00:06:25,160 --> 00:06:28,240
OK, so failure mode one is we 
trip over our own shoelaces 

139
00:06:28,240 --> 00:06:29,720
before leaving the. 
House, you got it. 

140
00:06:29,720 --> 00:06:33,720
Step 2 deliver request. 
So the message is on the wire. 

141
00:06:33,720 --> 00:06:36,640
It's traveling under the ocean 
in a fiber optic cable, but it 

142
00:06:36,640 --> 00:06:38,320
has to actually get to the 
server. 

143
00:06:38,400 --> 00:06:39,800
And things can happen on the 
way. 

144
00:06:40,040 --> 00:06:42,880
A lot of things. 
A backhoe could cut that fiber 

145
00:06:42,880 --> 00:06:44,800
optic cable, which happens 
surprisingly often. 

146
00:06:44,840 --> 00:06:47,360
Yeah, a cosmic ray could flip a 
bit in transit. 

147
00:06:47,960 --> 00:06:51,440
Or, and this is a subtle 1, the 
server could crash literally 

148
00:06:51,440 --> 00:06:55,320
milliseconds after receiving the
packet but before the operating 

149
00:06:55,320 --> 00:06:56,760
system hands it to your 
application. 

150
00:06:56,840 --> 00:07:00,000
So the message arrives, but the 
person who is supposed to sign 

151
00:07:00,000 --> 00:07:02,840
for it is dead on arrival. 
Perfect analogy. 

152
00:07:02,840 --> 00:07:04,800
OK, Step 3. 
Validate request. 

153
00:07:04,880 --> 00:07:07,360
The server application wakes up,
it gets the message and it looks

154
00:07:07,360 --> 00:07:09,160
at it and says. 
I have no idea what this is. 

155
00:07:09,160 --> 00:07:11,280
A version mismatch or something?
Exactly. 

156
00:07:11,520 --> 00:07:14,640
Maybe you upgraded the client to
send Jason, but the server is 

157
00:07:14,640 --> 00:07:17,720
still expecting XML. 
That's a valid message, just not

158
00:07:17,720 --> 00:07:22,040
one it understands rejected. 
OK, so we're three failure modes

159
00:07:22,040 --> 00:07:24,880
deep and we haven't even started
to do any actual work yet. 

160
00:07:24,880 --> 00:07:28,360
Not a thing. 
Yeah, now Step 4, update server 

161
00:07:28,360 --> 00:07:29,840
state. 
This is the first time the 

162
00:07:29,840 --> 00:07:31,800
server tries to actually do the 
thing you asked. 

163
00:07:32,120 --> 00:07:34,520
Move Pac-Man. 
But maybe the database is 

164
00:07:34,520 --> 00:07:37,600
locked, Maybe the disk is full. 
The application logic itself 

165
00:07:37,600 --> 00:07:39,320
fails. 
OK, but let's say it works. 

166
00:07:39,360 --> 00:07:41,160
Pac-Man moves. 
We're good, right? 

167
00:07:41,200 --> 00:07:44,240
No, you're only halfway there. 
Now the server has to tell you 

168
00:07:44,240 --> 00:07:46,560
that it worked. 
This whole thing has to happen 

169
00:07:46,560 --> 00:07:47,920
in reverse. 
Oh man, OK. 

170
00:07:48,240 --> 00:07:51,600
Step 5. 
Post reply The server's network 

171
00:07:51,600 --> 00:07:54,920
card might die just as it's 
about to send the success 

172
00:07:54,920 --> 00:07:55,880
message you're. 
Kidding me? 

173
00:07:55,880 --> 00:07:57,160
That's just cruel. 
It happens. 

174
00:07:57,160 --> 00:07:59,000
Step 6. 
Deliver reply. 

175
00:07:59,480 --> 00:08:01,160
The network goes down on the 
return trip. 

176
00:08:01,480 --> 00:08:03,640
A shark bites the undersea cable
on the way back. 

177
00:08:04,000 --> 00:08:06,360
Your reply is lost at sea. 
This is brutal. 

178
00:08:06,360 --> 00:08:08,120
It's a miracle anything works at
all. 

179
00:08:08,240 --> 00:08:11,520
It really is step 7. 
Validate reply. 

180
00:08:12,800 --> 00:08:15,800
The client gets the success 
message back, but it's garbled. 

181
00:08:16,200 --> 00:08:19,360
Or maybe the client app crashed 
and rebooted and doesn't even 

182
00:08:19,360 --> 00:08:21,120
remember asking the question in 
the first place. 

183
00:08:21,120 --> 00:08:23,320
So it gets an answer to a 
question it forgot it asked. 

184
00:08:23,320 --> 00:08:25,920
Right. 
And finally step 8, update 

185
00:08:25,920 --> 00:08:28,600
client state. 
The client gets the success 

186
00:08:28,600 --> 00:08:31,480
message but then fails to 
actually process it and update 

187
00:08:31,480 --> 00:08:34,240
the screen. 
The logic is done, but the user 

188
00:08:34,240 --> 00:08:37,039
never sees it. 
So that one innocent looking 

189
00:08:37,039 --> 00:08:42,039
line of code board dot find 
Pac-Man is actually a gauntlet 

190
00:08:42,039 --> 00:08:43,640
of eight discrete death. 
Traps. 

191
00:08:43,640 --> 00:08:46,560
Precisely, And here is the 
crucial insight for the 

192
00:08:46,560 --> 00:08:48,280
interview. 
This is the thing that trips up 

193
00:08:48,280 --> 00:08:50,400
almost everyone who hasn't 
really lived this pain. 

194
00:08:50,440 --> 00:08:52,320
Lay it on us. 
The unknown state. 

195
00:08:52,320 --> 00:08:54,360
The unknown state. 
This is different from true or 

196
00:08:54,360 --> 00:08:56,360
false. 
Fundamentally different in local

197
00:08:56,360 --> 00:08:58,600
code. 
A function returns true, false 

198
00:08:58,600 --> 00:09:02,360
or it throws an error 3 states. 
In distributed systems there is 

199
00:09:02,360 --> 00:09:05,400
a fourth, and it is the most 
important one, unknown. 

200
00:09:05,520 --> 00:09:07,760
Explain that in practical terms.
Give me an example. 

201
00:09:07,920 --> 00:09:09,360
Let's say you're building a 
banking app. 

202
00:09:09,480 --> 00:09:12,680
You make a request to transfer 
$100 from your checking to your 

203
00:09:12,680 --> 00:09:15,280
savings. 
You send the request, you wait, 

204
00:09:15,920 --> 00:09:18,400
and nothing. 
The connection just times out. 

205
00:09:18,400 --> 00:09:20,400
OK, so it failed. 
I'll try again. 

206
00:09:20,520 --> 00:09:21,600
Did it. 
Well, I didn't get a 

207
00:09:21,600 --> 00:09:24,160
confirmation, I didn't get a 
success message, so I assume it 

208
00:09:24,160 --> 00:09:26,720
failed. 
And that assumption is one of 

209
00:09:26,720 --> 00:09:29,120
the most dangerous assumptions 
in software engineering. 

210
00:09:29,760 --> 00:09:31,120
Look at our list of failure 
modes. 

211
00:09:32,000 --> 00:09:34,920
Maybe the request got there, the
bank moved the money, but the 

212
00:09:34,920 --> 00:09:37,880
reply got lost. 
That was failure mode 6. 

213
00:09:37,880 --> 00:09:40,560
So the money is gone from my 
checking account. 

214
00:09:40,560 --> 00:09:42,560
Right. 
Or maybe the request never got 

215
00:09:42,560 --> 00:09:44,320
there at all. 
That was failure mode too. 

216
00:09:44,880 --> 00:09:46,880
In that case, the money is still
there. 

217
00:09:47,360 --> 00:09:50,480
You as a client are sitting 
there holding a time out 

218
00:09:50,480 --> 00:09:52,560
exception. 
You do not know. 

219
00:09:52,560 --> 00:09:54,840
You cannot know if the money was
transferred or not. 

220
00:09:55,120 --> 00:09:57,600
That is incredibly 
uncomfortable. 

221
00:09:57,600 --> 00:10:00,640
Humans hate it. 
We are wired for binary 

222
00:10:00,640 --> 00:10:03,480
outcomes. 
We want to know, did it happen? 

223
00:10:03,720 --> 00:10:06,640
And the system is just shrugging
its shoulders and saying maybe. 

224
00:10:06,720 --> 00:10:09,920
So if I'm in an interview and 
the interviewer asks what 

225
00:10:09,920 --> 00:10:13,240
happens if this request to the 
payment service times out, the 

226
00:10:13,240 --> 00:10:16,760
wrong answer is we assume it 
failed and retry. 

227
00:10:17,000 --> 00:10:18,960
That is a fatal error in the 
interview. 

228
00:10:19,560 --> 00:10:22,720
If you assume it failed and you 
retry, you might transfer the 

229
00:10:22,720 --> 00:10:25,440
money twice. 
If you assume it succeeded and 

230
00:10:25,440 --> 00:10:28,040
you don't retry, you might never
transfer the money. 

231
00:10:28,240 --> 00:10:29,760
You have to design for the 
unknown. 

232
00:10:30,000 --> 00:10:32,360
And this all comes back to that 
concept you mentioned earlier, 

233
00:10:32,360 --> 00:10:36,240
independent failure, yes. 
Unlike the local Pac-Man machine

234
00:10:36,240 --> 00:10:39,400
where the CPU and the memory 
share fate, in a distributed 

235
00:10:39,400 --> 00:10:42,600
system, the network can be 
perfectly fine while the server 

236
00:10:42,600 --> 00:10:44,840
dies. 
The server can be perfectly fine

237
00:10:44,840 --> 00:10:47,800
while the network dies. 
They fail independently of each 

238
00:10:47,800 --> 00:10:49,560
other. 
So we've established that the 

239
00:10:49,560 --> 00:10:52,320
environment is hostile, the 
network is unreliable, and we 

240
00:10:52,320 --> 00:10:55,080
can't trust timeouts to tell us 
the actual truth of what 

241
00:10:55,080 --> 00:10:57,120
happened. 
Correct, that's the foundation. 

242
00:10:57,120 --> 00:10:59,960
Now let's talk about how we as 
engineers make this whole 

243
00:10:59,960 --> 00:11:01,920
situation so much worse for 
ourselves. 

244
00:11:02,280 --> 00:11:03,720
Let's talk about assumption 
drift. 

245
00:11:04,400 --> 00:11:07,840
We have this source, The Hidden 
Danger of assumption drift, and 

246
00:11:07,840 --> 00:11:11,240
it introduces the idea of the 
almost correct system. 

247
00:11:11,440 --> 00:11:15,720
The almost correct system. 
It's actually far more dangerous

248
00:11:15,920 --> 00:11:18,880
than a blatantly broken system. 
How so? 

249
00:11:19,400 --> 00:11:22,160
That seems backwards. 
Well, if a system is broken, 

250
00:11:22,160 --> 00:11:25,200
like it crashes on startup, you 
fix it. 

251
00:11:25,520 --> 00:11:28,640
It fails loud. 
Failing loud is a feature. 

252
00:11:28,800 --> 00:11:31,160
It's obvious there's a problem. 
Right, you get an alert, you 

253
00:11:31,160 --> 00:11:34,720
find the bug, you push a fix. 
An almost correct system is 

254
00:11:34,720 --> 00:11:36,720
different. 
It passes all the unit tests. 

255
00:11:36,720 --> 00:11:39,320
It works perfectly in staging, 
it works when you demo it to 

256
00:11:39,320 --> 00:11:42,440
your boss. 
But it has these hidden latent 

257
00:11:42,440 --> 00:11:46,400
assumptions that only drift 
apart and cause a catastrophe 

258
00:11:46,640 --> 00:11:48,560
under a very specific kind of 
load. 

259
00:11:48,560 --> 00:11:51,320
And the classic example of this 
which the source walks through 

260
00:11:51,320 --> 00:11:54,040
is a time out chain disaster. 
This is a distributed systems 

261
00:11:54,040 --> 00:11:56,880
Horror Story, and it's a true 
story that happens all the time.

262
00:11:57,400 --> 00:12:01,400
Let's imagine a simple chain. 
You have a user, we'll call them

263
00:12:01,400 --> 00:12:04,040
the client, a load balancer in 
the middle and a back end 

264
00:12:04,040 --> 00:12:06,920
service doing the work. 
OK, a standard 3 tier setup. 

265
00:12:07,040 --> 00:12:09,280
Now let's look at their time out
configurations. 

266
00:12:09,760 --> 00:12:12,120
These might have been set by 
different teams at different 

267
00:12:12,120 --> 00:12:14,800
times and they all look 
reasonable individually. 

268
00:12:15,240 --> 00:12:19,200
The client, say a mobile app, 
wants a snappy experience, so 

269
00:12:19,200 --> 00:12:21,600
it's developer sets a time out 
of two seconds. 

270
00:12:22,040 --> 00:12:23,880
It makes sense. 
I don't want to wait forever for

271
00:12:23,880 --> 00:12:25,800
a page to load. 
If it takes longer than two 

272
00:12:25,800 --> 00:12:28,440
seconds, I'm probably going to 
refresh or just leave the app. 

273
00:12:28,440 --> 00:12:31,200
Exactly. 
The load balancer, which is a 

274
00:12:31,320 --> 00:12:34,240
piece of infrastructure, needs 
to be a bit more patient to 

275
00:12:34,240 --> 00:12:36,680
manage traffic, so it has a 
timeout of five seconds. 

276
00:12:36,760 --> 00:12:39,760
OK, still seems fine. 
And the back end service which 

277
00:12:39,760 --> 00:12:42,400
does the heavy lifting. 
Maybe it's a complex database 

278
00:12:42,400 --> 00:12:45,680
query has a time out of 30 
seconds to make sure it always 

279
00:12:45,680 --> 00:12:47,720
finishes its jobs, even the big 
ones. 

280
00:12:47,880 --> 00:12:50,880
So we have two seconds on the 
client, 5 on the load balancer 

281
00:12:50,880 --> 00:12:54,280
and 30 on the back end. 
Individually these numbers seem 

282
00:12:54,280 --> 00:12:56,480
perfectly fine. 
The back end needs time to 

283
00:12:56,480 --> 00:12:58,720
crunch numbers. 
The client needs to be fast. 

284
00:12:58,960 --> 00:13:00,640
But watch what happens when you 
put them together. 

285
00:13:01,120 --> 00:13:03,960
The client sends a request, it 
goes through the load balancer, 

286
00:13:03,960 --> 00:13:06,280
hits the back end, and the back 
end starts processing. 

287
00:13:06,760 --> 00:13:08,960
But let's say the back end is 
having a slow day. 

288
00:13:09,120 --> 00:13:11,520
Maybe there's a called cache, 
Maybe the database is under 

289
00:13:11,520 --> 00:13:13,360
load. 
It's going to take 10 seconds to

290
00:13:13,360 --> 00:13:14,240
do the work. 
OK. 

291
00:13:14,240 --> 00:13:17,000
So it's going to be slow. 
Now at the two second mark, what

292
00:13:17,000 --> 00:13:20,560
does the client do? 
The client times out, it gives 

293
00:13:20,560 --> 00:13:23,280
up. 
In its world, the operation has 

294
00:13:23,280 --> 00:13:25,680
failed. 
Right, the client says this 

295
00:13:25,680 --> 00:13:28,240
didn't work. 
And typically, what does a well 

296
00:13:28,240 --> 00:13:31,360
behaved client application do 
when something fails? 

297
00:13:31,480 --> 00:13:32,880
It retries. 
It retries. 

298
00:13:32,880 --> 00:13:35,080
It sends a second request for 
the exact same thing. 

299
00:13:35,080 --> 00:13:37,280
But here is the critical moment 
of drift. 

300
00:13:37,960 --> 00:13:40,680
Does the back end know the 
client has given up and left? 

301
00:13:40,760 --> 00:13:44,240
No, of course not. 
The back end is still working on

302
00:13:44,240 --> 00:13:46,720
the first request. 
It has a 32nd timeout. 

303
00:13:47,120 --> 00:13:49,800
It thinks it has 28 seconds left
to finish the job. 

304
00:13:49,840 --> 00:13:51,760
Exactly. 
So the back end is chugging 

305
00:13:51,760 --> 00:13:56,200
along processing request one. 
Suddenly request 2 arise from 

306
00:13:56,200 --> 00:13:58,840
the same user. 
Now the back end is processing 

307
00:13:58,840 --> 00:14:01,440
both requests. 
But the client is only listening

308
00:14:01,440 --> 00:14:02,880
for the answer to the second 
one. 

309
00:14:03,240 --> 00:14:06,480
The first one is an orphan. 
It's an orphan, it's a ghost, 

310
00:14:06,840 --> 00:14:10,360
it's doing work for nobody, and 
if the back end is slow because 

311
00:14:10,360 --> 00:14:12,680
it's overloaded, we just doubled
the load on it. 

312
00:14:12,800 --> 00:14:15,800
And if that second request also 
takes more than two seconds. 

313
00:14:15,840 --> 00:14:18,840
The client retries again. 
Request 3 arrives. 

314
00:14:19,160 --> 00:14:21,160
Now the back end is doing triple
the work. 

315
00:14:21,480 --> 00:14:24,760
The system isn't just failing, 
it's actively tearing itself 

316
00:14:24,760 --> 00:14:27,240
apart in a feedback loop. 
And the crazy part is that the 

317
00:14:27,240 --> 00:14:29,560
code in the back end is 
technically correct. 

318
00:14:29,800 --> 00:14:32,720
It's dutifully processing every 
request it received. 

319
00:14:32,720 --> 00:14:34,320
It is. 
It's just processing a request 

320
00:14:34,320 --> 00:14:37,080
that nobody wants anymore. 
This is assumption drift in its 

321
00:14:37,080 --> 00:14:40,160
purest form. 
The client assumes if I timed 

322
00:14:40,160 --> 00:14:42,200
out, the request is dead and 
gone. 

323
00:14:42,640 --> 00:14:45,800
The back end assumes if I'm 
processing a request, someone is

324
00:14:45,800 --> 00:14:47,240
patiently waiting for the 
answer. 

325
00:14:48,040 --> 00:14:50,920
Both are totally wrong. 
Their realities have drifted 

326
00:14:50,920 --> 00:14:53,000
apart. 
The client lives in a 2 second 

327
00:14:53,000 --> 00:14:55,080
reality. 
The back end lives in a 32nd 

328
00:14:55,080 --> 00:14:56,840
reality. 
They're not in the same universe

329
00:14:56,840 --> 00:14:58,280
anymore. 
And in that void between 

330
00:14:58,280 --> 00:15:01,360
universes, your system dies. 
So how do we fix this? 

331
00:15:01,360 --> 00:15:04,160
What's the senior engineer move 
here at the whiteboard? 

332
00:15:04,320 --> 00:15:08,920
The big shift is moving from 
these implicit contracts, I 

333
00:15:08,920 --> 00:15:11,080
guess it will finish quickly to 
explicit ones. 

334
00:15:11,640 --> 00:15:14,360
The first step is what you could
call the alignment rule, which 

335
00:15:14,400 --> 00:15:17,120
is generally you want your 
timeouts to increase as you go 

336
00:15:17,120 --> 00:15:19,600
down the stack. 
The client should be the most 

337
00:15:19,600 --> 00:15:21,840
patient. 
If the client is willing to wait

338
00:15:21,840 --> 00:15:24,640
30 seconds, then the load 
balancer should be willing to 

339
00:15:24,640 --> 00:15:28,040
wait say 29 seconds and the back
end 28. 

340
00:15:28,440 --> 00:15:30,800
The system gives up from the 
back, not from the. 

341
00:15:30,800 --> 00:15:33,520
Front so you shed load at the 
source, not at the client. 

342
00:15:33,600 --> 00:15:35,840
Exactly. 
Yeah, but the gold standard, the

343
00:15:35,840 --> 00:15:38,240
thing that really shows you know
what you're talking about, is 

344
00:15:38,240 --> 00:15:40,680
deadline propagation. 
Deadline propagation? 

345
00:15:40,680 --> 00:15:43,200
That sounds fancy. 
It's incredibly elegant. 

346
00:15:43,560 --> 00:15:46,560
Instead of every service having 
its own static timeout number 

347
00:15:46,560 --> 00:15:49,920
like 30 seconds, the first 
service in the chain, the client

348
00:15:49,920 --> 00:15:52,640
or the front end said I have a 
total budget of 2 seconds for 

349
00:15:52,640 --> 00:15:55,600
this entire operation. 
Then passes that deadline down 

350
00:15:55,600 --> 00:15:57,320
the chain, usually in a request 
header. 

351
00:15:57,720 --> 00:16:01,720
So if the load balancer takes 5 
seconds to do its part, it looks

352
00:16:01,720 --> 00:16:05,080
at the budget and tells the back
end you have 15 seconds left go.

353
00:16:05,400 --> 00:16:09,320
Recisely and if the back end 
gets a request and sees that it 

354
00:16:09,320 --> 00:16:13,040
only has .1 seconds left in the 
budget, but it knows the job 

355
00:16:13,040 --> 00:16:15,920
takes at least a second, it 
doesn't even start, it fails 

356
00:16:15,920 --> 00:16:18,120
fast. 
It immediately returns an error 

357
00:16:18,120 --> 00:16:20,720
saying I can't possibly do this 
in time. 

358
00:16:20,720 --> 00:16:23,880
That saves so much wasted work. 
The ghost work. 

359
00:16:23,880 --> 00:16:26,120
It saves the system, it aligns 
all the assumptions. 

360
00:16:26,200 --> 00:16:28,000
Everyone is looking at the exact
same clock. 

361
00:16:28,200 --> 00:16:31,520
It stops the ghost work, the 
work being done for clients who 

362
00:16:31,520 --> 00:16:34,000
have already left the building. 
OK, that covers the timing 

363
00:16:34,000 --> 00:16:37,600
assumption, but you mentioned 
that the client retries and it 

364
00:16:37,600 --> 00:16:40,160
seems like retries are a huge 
double edged sword. 

365
00:16:40,160 --> 00:16:43,240
Oh, retries are selfish. 
That's the phrasing from the 

366
00:16:43,240 --> 00:16:45,680
Amazon Builders Library, and I 
absolutely love it. 

367
00:16:45,760 --> 00:16:48,120
Selfish. 
How is code selfish? 

368
00:16:48,480 --> 00:16:50,240
Think about it from the server's
perspective. 

369
00:16:50,720 --> 00:16:54,560
When you retry, you're demanding
more resources from a server 

370
00:16:54,560 --> 00:16:56,280
that is likely already 
struggling. 

371
00:16:56,760 --> 00:17:00,400
You're essentially saying I know
you're busy or slow or failing, 

372
00:17:00,880 --> 00:17:03,720
but stop what you're doing and 
process my request again right 

373
00:17:03,720 --> 00:17:05,319
now. 
It's like shouting at a waiter 

374
00:17:05,319 --> 00:17:07,599
in a slammed restaurant who is 
already dropping plates. 

375
00:17:07,599 --> 00:17:10,440
It doesn't help. 
It's the exact same dynamic, and

376
00:17:10,440 --> 00:17:13,520
this leads to these catastrophic
events called retry storms. 

377
00:17:14,400 --> 00:17:18,640
If you have a deep architecture 
service, A calls B which calls C

378
00:17:18,640 --> 00:17:22,680
which calls the database, and 
they all have a default retry 3 

379
00:17:22,680 --> 00:17:24,920
times policy. 
You can do the math. 

380
00:17:25,040 --> 00:17:26,680
Oh boy, I think I see where this
is going. 

381
00:17:26,720 --> 00:17:30,480
If the database at the bottom 
slows down, service C retries 

382
00:17:30,480 --> 00:17:32,120
its call to the database three 
times. 

383
00:17:32,160 --> 00:17:34,000
OK. 
But from service B's 

384
00:17:34,000 --> 00:17:37,960
perspective, it's call to 
service C is just slow, so it 

385
00:17:37,960 --> 00:17:41,720
retries its call to service C3 
times, and each of those calls 

386
00:17:41,720 --> 00:17:43,400
will trigger three database 
retries. 

387
00:17:43,400 --> 00:17:45,120
That's nine database queries 
now. 

388
00:17:45,120 --> 00:17:48,200
And service A sees that service 
B is slow, so it retries its 

389
00:17:48,200 --> 00:17:51,720
call to service B three times. 
So you have 3 * 3 * 3. 

390
00:17:51,720 --> 00:17:54,560
That's 27 queries to the 
database for what was originally

391
00:17:54,560 --> 00:17:57,680
a single user action. 
If you have 5 layers of services

392
00:17:57,680 --> 00:18:01,000
all retrying 3 times, the load 
on the database increases by a 

393
00:18:01,000 --> 00:18:04,480
factor of 243. 
You turn a small hiccup into a 

394
00:18:04,480 --> 00:18:07,320
self-inflicted distributed 
denial of service attack. 

395
00:18:07,320 --> 00:18:10,040
A complete meltdown. 
You've amplified a tiny problem 

396
00:18:10,040 --> 00:18:13,040
into a catastrophic failure. 
So do we just not retry? 

397
00:18:13,040 --> 00:18:14,920
Is that the answer? 
That seems wrong too. 

398
00:18:15,160 --> 00:18:17,760
No, we have to retry. 
Networks are flaky. 

399
00:18:17,760 --> 00:18:19,520
We all know that from the 8 
failure modes. 

400
00:18:19,960 --> 00:18:21,840
We need to handle transient 
failures. 

401
00:18:22,080 --> 00:18:25,280
A dropped packet shouldn't kill 
the whole operation, but we need

402
00:18:25,280 --> 00:18:28,240
to do it defensively. 
We need to do it with back off 

403
00:18:28,320 --> 00:18:30,520
and jitter. 
OK, let's unpack these. 

404
00:18:30,520 --> 00:18:33,520
Back off is pretty standard. 
I think most people have heard 

405
00:18:33,520 --> 00:18:35,360
of that. 
Exponential back off. 

406
00:18:35,560 --> 00:18:38,120
It's the most common type. 
Instead of retrying immediately,

407
00:18:38,120 --> 00:18:39,920
you wait one second. 
If that fails, you wait 2 

408
00:18:39,920 --> 00:18:41,600
seconds. 
If that fails, you wait 4 

409
00:18:41,600 --> 00:18:44,480
seconds, then 8, and so on. 
You give the server breathing 

410
00:18:44,480 --> 00:18:46,680
room. 
You're being polite instead of 

411
00:18:46,680 --> 00:18:48,840
selfish. 
But the AWS source makes a 

412
00:18:48,840 --> 00:18:52,560
really big deal about jitter. 
Why isn't exponential back off 

413
00:18:52,560 --> 00:18:55,920
enough on its own? 
Because of synchronization, this

414
00:18:55,920 --> 00:18:57,800
is a subtle but critical failure
mode. 

415
00:18:58,120 --> 00:19:01,360
Imagine a popular mobile app and
the server it talks to reboots. 

416
00:19:01,760 --> 00:19:05,320
So you have thousands of clients
all at the exact same time 

417
00:19:05,600 --> 00:19:09,240
getting a connection failure at 
12.000000. 

418
00:19:09,240 --> 00:19:11,120
OK, they all fail 
simultaneously. 

419
00:19:11,120 --> 00:19:13,160
They all have the same back off 
logic so they all back off for 

420
00:19:13,160 --> 00:19:16,720
one second. 
What happens at 12.000001? 

421
00:19:16,720 --> 00:19:19,000
They all hit the server again at
the exact same time. 

422
00:19:19,080 --> 00:19:20,680
Whim. 
They all fail again. 

423
00:19:20,680 --> 00:19:25,240
They all back off for 2 seconds.
What happens at 12.00 point 033?

424
00:19:26,240 --> 00:19:28,640
You get these perfectly 
synchronized spikes of traffic. 

425
00:19:28,640 --> 00:19:32,040
It looks like a metronome. 
Tick slam, Tick slam. 

426
00:19:32,120 --> 00:19:34,560
And the server can never recover
because it just gets hammered in

427
00:19:34,560 --> 00:19:36,400
these synchronized waves it 
can't get its feet under. 

428
00:19:36,400 --> 00:19:39,920
It never jitter is the solution.
Jitter adds randomness. 

429
00:19:40,120 --> 00:19:42,920
Instead of waiting exactly one 
second, you wait one second, 

430
00:19:42,920 --> 00:19:44,800
plus or minus a few 100 
milliseconds. 

431
00:19:45,360 --> 00:19:49,240
Or a better strategy is to pick 
a random number between zero and

432
00:19:49,240 --> 00:19:51,320
your current exponential back 
off cap. 

433
00:19:51,480 --> 00:19:53,720
So instead of a metronome. 
You want rain. 

434
00:19:53,800 --> 00:19:56,560
You want the request to patter 
against the server randomly, 

435
00:19:56,800 --> 00:19:59,360
spreading the load out over time
instead of concentrating into 

436
00:19:59,360 --> 00:20:01,680
these sharp spikes. 
I love that visual metronome 

437
00:20:01,680 --> 00:20:03,520
versus rain. 
That's a great way to explain it

438
00:20:03,520 --> 00:20:06,560
in an interview. 
It is, and if you are designing 

439
00:20:06,560 --> 00:20:09,280
a system in an interview and you
mentioned retries without also 

440
00:20:09,280 --> 00:20:12,840
mentioning jitter, you are 
setting yourself up for failure.

441
00:20:13,640 --> 00:20:15,680
The interviewer will look at you
and ask what about the 

442
00:20:15,680 --> 00:20:18,160
thundering herd problem and 
you'll have to backtrack. 

443
00:20:18,240 --> 00:20:19,680
It shows a gap in your 
knowledge. 

444
00:20:19,960 --> 00:20:22,240
Noted. 
OK, there's another huge piece 

445
00:20:22,240 --> 00:20:25,840
to this puzzle. 
If we are retrying requests, we 

446
00:20:25,840 --> 00:20:28,680
run the risk of doing things 
twice, like that bank transfer 

447
00:20:28,680 --> 00:20:29,800
we talked about at the 
beginning. 

448
00:20:29,920 --> 00:20:32,120
This is where we have to talk 
about idem potency. 

449
00:20:32,120 --> 00:20:34,840
It's a fun word to say idem 
potency. 

450
00:20:34,840 --> 00:20:38,360
And it basically just means 
doing something more than once 

451
00:20:38,360 --> 00:20:40,840
has the same effect as doing it 
just once. 

452
00:20:40,840 --> 00:20:42,680
Correct. 
The mathematical definition is 

453
00:20:42,760 --> 00:20:46,600
FS, FX, inquest FX. 
In a distributed system. 

454
00:20:46,880 --> 00:20:50,840
If you have retries, which we've
established you absolutely must,

455
00:20:51,560 --> 00:20:53,520
then your operations must be 
idempotent. 

456
00:20:53,520 --> 00:20:55,920
There's no way around it. 
But not all operations are 

457
00:20:55,920 --> 00:20:58,680
naturally idempotent, right? 
Like adding an item to a 

458
00:20:58,680 --> 00:20:59,520
shopping cart. 
Is. 

459
00:20:59,520 --> 00:21:01,600
No, it's not, So we have to 
categorize them. 

460
00:21:01,720 --> 00:21:03,480
First, there's natural 
idempotency. 

461
00:21:03,480 --> 00:21:06,520
This is the easy stuff. 
An operation like set user 

462
00:21:06,520 --> 00:21:08,560
status to active. 
You can call that 10 times in a 

463
00:21:08,560 --> 00:21:10,920
row, the status is still active.
That's safe to retry. 

464
00:21:11,160 --> 00:21:13,560
Then there's what you might call
business idempotency. 

465
00:21:14,280 --> 00:21:16,720
This relies on a natural key in 
your business domain. 

466
00:21:16,840 --> 00:21:19,520
The classic example is create 
customer e-mail. 

467
00:21:19,880 --> 00:21:21,800
The e-mail address is the unique
key. 

468
00:21:22,000 --> 00:21:26,280
If I try to create a customer 
for bob@example.com twice, the 

469
00:21:26,280 --> 00:21:30,200
second one should either fail 
with a user already exists error

470
00:21:30,240 --> 00:21:31,840
or just return the existing 
record. 

471
00:21:32,440 --> 00:21:34,240
The system state doesn't change 
incorrectly. 

472
00:21:34,240 --> 00:21:36,200
OK, that makes sense. 
You rely on the uniqueness of 

473
00:21:36,200 --> 00:21:38,840
the data itself. 
But the hard one, and the one 

474
00:21:38,840 --> 00:21:41,480
that always comes up in system 
design interviews, is operations

475
00:21:41,480 --> 00:21:43,240
that are inherently not 
idemitant. 

476
00:21:43,400 --> 00:21:46,000
The classic one is charge this 
credit card $50. 

477
00:21:46,320 --> 00:21:48,000
Right. 
The credit card company is more 

478
00:21:48,000 --> 00:21:50,360
than happy to let me charge a 
card $50 twice. 

479
00:21:50,640 --> 00:21:53,000
That's a perfectly valid 
sequence of operations. 

480
00:21:53,000 --> 00:21:54,480
Exactly. 
So you can't rely on the data 

481
00:21:54,480 --> 00:21:57,280
itself for these cases. 
You need to introduce an 

482
00:21:57,280 --> 00:22:01,120
artificial key, a unique ID or 
an idem potency key. 

483
00:22:01,200 --> 00:22:03,440
How does that work? 
When the client decides to make 

484
00:22:03,440 --> 00:22:06,880
the payment, it first generates 
a unique ID, usually UID. 

485
00:22:07,080 --> 00:22:09,640
Then it sends the request saying
charge $50 and use this 

486
00:22:09,640 --> 00:22:14,120
transaction ID 123ABC456. 
And the server has to remember 

487
00:22:14,120 --> 00:22:17,440
that it's seen 123 ABC 456 
before. 

488
00:22:17,560 --> 00:22:21,440
Yes, the server stores that idem
potency key, maybe in a 

489
00:22:21,440 --> 00:22:24,440
dedicated table or right 
alongside the transaction record

490
00:22:24,600 --> 00:22:27,520
for some period of time. 
If it receives a request with 

491
00:22:27,520 --> 00:22:30,680
that same ID again, it doesn't 
try to charge the card again. 

492
00:22:30,840 --> 00:22:33,720
It just looks up the result of 
the first attempt and says oh I 

493
00:22:33,720 --> 00:22:36,320
already did this, here is the 
receipt from the first time ah. 

494
00:22:36,720 --> 00:22:39,240
So that solves the unknown 
failure mode perfectly. 

495
00:22:39,520 --> 00:22:41,760
If I get a time out on my 
payment request, I just retry 

496
00:22:41,760 --> 00:22:43,800
with the exact same ID. 
Exactly. 

497
00:22:44,040 --> 00:22:46,440
If the first one failed before 
anything happened, the second 

498
00:22:46,440 --> 00:22:48,720
one goes through. 
If the first one actually 

499
00:22:48,720 --> 00:22:51,520
succeeded but the reply was 
lost, the second one doesn't 

500
00:22:51,520 --> 00:22:54,440
double charge, it just returns 
the safe success message. 

501
00:22:54,920 --> 00:22:56,600
It handles the ambiguity 
perfectly. 

502
00:22:56,960 --> 00:23:01,000
So the So what for the interview
is if you design a payment 

503
00:23:01,000 --> 00:23:05,040
system or an ordering system or 
anything transactional and you 

504
00:23:05,040 --> 00:23:08,120
don't explicitly mention idem 
potency keys alongside your 

505
00:23:08,120 --> 00:23:11,280
retry strategy with back off and
jitter, you're essentially 

506
00:23:11,280 --> 00:23:14,080
designing a system that will 
either lose or steal money. 

507
00:23:14,240 --> 00:23:16,200
You fail the interview, plain 
and simple. 

508
00:23:16,840 --> 00:23:19,400
You cannot build reliable 
distributed systems without 

509
00:23:19,400 --> 00:23:22,400
these fundamental patterns. 
You're building a casino, not a 

510
00:23:22,400 --> 00:23:24,400
bank. 
OK, we're getting into the 

511
00:23:24,400 --> 00:23:27,120
really deep water now. 
We've covered the basics of 

512
00:23:27,120 --> 00:23:29,680
defensive engineering. 
Now I want to talk about 

513
00:23:29,680 --> 00:23:33,360
something even scarier. 
Metastability. 

514
00:23:34,120 --> 00:23:37,280
Metastability. 
This is one of my favorite 

515
00:23:37,280 --> 00:23:39,360
topics because it's so non 
intuitive. 

516
00:23:39,560 --> 00:23:43,080
It comes from a fantastic paper 
called Metastable Failures in 

517
00:23:43,080 --> 00:23:46,760
Distributed Systems and it 
describes a specific type of 

518
00:23:46,760 --> 00:23:49,000
nightmare scenario. 
The nightmare that doesn't end 

519
00:23:49,000 --> 00:23:51,160
when you wake up. 
Pretty much, yeah. 

520
00:23:51,360 --> 00:23:54,840
We often think of systems as 
being in one of two states, up 

521
00:23:54,840 --> 00:23:57,440
or down. 
Metastability introduces a 

522
00:23:57,440 --> 00:24:01,280
terrifying third state that's, 
well, kind of both in neither. 

523
00:24:01,280 --> 00:24:04,320
And the definitions from the 
paper here are stable, 

524
00:24:04,320 --> 00:24:06,480
vulnerable and metastable. 
Correct. 

525
00:24:06,840 --> 00:24:10,080
Stable is the happy state. 
Normal operation, traffic is 

526
00:24:10,080 --> 00:24:12,280
flowing, latency is low, 
everything is fine. 

527
00:24:12,280 --> 00:24:13,480
OK. 
Business as usual. 

528
00:24:13,600 --> 00:24:15,120
Vulnerable is the interesting 
one. 

529
00:24:15,280 --> 00:24:17,000
The system looks fine from the 
outside. 

530
00:24:17,000 --> 00:24:20,080
The dashboards are green. 
It's serving traffic, but it has

531
00:24:20,080 --> 00:24:22,560
lost its safety margin. 
It's like an airplane that has 

532
00:24:22,560 --> 00:24:24,520
lost one of its engines but is 
still flying. 

533
00:24:25,000 --> 00:24:27,280
It's walking on the edge of a 
Cliff, but it hasn't fallen yet.

534
00:24:27,440 --> 00:24:31,240
Then comes the trigger. 
It's often something small, a 

535
00:24:31,240 --> 00:24:36,240
momentary network blip, a minor 
traffic spike, a single server 

536
00:24:36,240 --> 00:24:41,120
rebooting, and that little push 
sends the system over the edge 

537
00:24:41,600 --> 00:24:44,360
into them in a stable state. 
And this is where it gets really

538
00:24:44,360 --> 00:24:47,040
weird. 
In a normal failure, you remove 

539
00:24:47,040 --> 00:24:50,120
the trigger, you fix the bug, or
the traffic spike ends and the 

540
00:24:50,120 --> 00:24:52,080
system recovers and goes back to
normal. 

541
00:24:52,560 --> 00:24:55,640
But in amid a stable failure, 
removing the trigger does not 

542
00:24:55,640 --> 00:24:58,360
fix the system. 
The system stays down. 

543
00:24:58,560 --> 00:25:00,080
It gets stuck. 
Why? 

544
00:25:00,600 --> 00:25:03,040
What's holding it down? 
It's held down by the sustaining

545
00:25:03,040 --> 00:25:05,080
effect. 
There's some kind of feedback 

546
00:25:05,080 --> 00:25:08,200
loop in the system's own 
recovery or error handling logic

547
00:25:08,360 --> 00:25:10,400
that keeps it pinned in the 
failed state. 

548
00:25:10,640 --> 00:25:12,760
The analogy I liked from the 
source was getting stuck in a 

549
00:25:12,760 --> 00:25:14,760
hole. 
Yes, that's the perfect way to 

550
00:25:14,760 --> 00:25:17,200
think about it. 
Tripping on a small rock is the 

551
00:25:17,200 --> 00:25:19,000
trigger. 
It causes you to fall into a 

552
00:25:19,000 --> 00:25:22,320
deep hole, but once you're at 
the bottom of the hole, the rock

553
00:25:22,320 --> 00:25:24,280
doesn't matter anymore. 
You can remove the rock, but 

554
00:25:24,280 --> 00:25:26,440
you're still in the hole. 
The depth of the hole is the 

555
00:25:26,440 --> 00:25:28,600
sustaining effect. 
Let's make this concrete. 

556
00:25:28,840 --> 00:25:32,200
The look aside cache failure is 
the absolute classic example of 

557
00:25:32,200 --> 00:25:34,080
this. 
This has happened to basically 

558
00:25:34,080 --> 00:25:35,840
every large tech company at some
point. 

559
00:25:36,280 --> 00:25:39,720
You have a standard setup, a web
app, a cache like Redis and a 

560
00:25:39,720 --> 00:25:41,040
database. 
Very standard. 

561
00:25:41,040 --> 00:25:44,280
In the stable state, the cache 
is warm and has a 90% hit rate. 

562
00:25:44,720 --> 00:25:49,320
The app is handling say 3000 
requests per second or QPS with 

563
00:25:49,320 --> 00:25:53,080
a 90% ten hit rate. 2700 of 
those requests are served 

564
00:25:53,080 --> 00:25:57,040
instantly by the fast cache. 
Only 300 QPS actually hit the 

565
00:25:57,040 --> 00:25:59,760
slow database. 
And let's say the database can 

566
00:25:59,760 --> 00:26:01,640
handle 500 QPS. 
It's happy. 

567
00:26:01,760 --> 00:26:04,520
It's not even breaking a sweat. 
OK, the system is stable, but 

568
00:26:04,520 --> 00:26:08,160
it's hitting capacity limit is 
500 QPS at the database layer. 

569
00:26:08,200 --> 00:26:10,160
It's vulnerable. 
Exactly it's vulnerable. 

570
00:26:10,160 --> 00:26:12,920
Now the trigger, the cache 
server cluster has to be 

571
00:26:12,920 --> 00:26:15,360
rebooted, maybe for a security 
patch or an upgrade. 

572
00:26:15,640 --> 00:26:18,400
For a few moments the cache is 
completely empty, so. 

573
00:26:18,400 --> 00:26:20,120
The cache hit rate drops from 
90%. 

574
00:26:20,760 --> 00:26:23,440
Then what happens to the 3000 
QPS from the app? 

575
00:26:23,560 --> 00:26:24,960
They all go straight to the 
database. 

576
00:26:25,000 --> 00:26:28,880
All 3000 QPS hit the database. 
The database was built to handle

577
00:26:28,880 --> 00:26:31,000
500. 
It's now getting 6 times its 

578
00:26:31,000 --> 00:26:33,200
capacity. 
It immediately overloads. 

579
00:26:33,200 --> 00:26:36,760
CPU goes to 100%, connections 
are refused, queries start 

580
00:26:36,760 --> 00:26:38,520
timing out. 
The database melts. 

581
00:26:38,680 --> 00:26:42,280
Now here is the metastable trap.
The cache servers are back 

582
00:26:42,280 --> 00:26:44,080
online. 
Now they're up and running, 

583
00:26:44,080 --> 00:26:47,640
ready to be filled. 
How do you refill a Lookaside 

584
00:26:47,640 --> 00:26:50,600
cache? 
You you have to read the data 

585
00:26:50,600 --> 00:26:52,840
from the database and then you 
write it into the cache. 

586
00:26:53,320 --> 00:26:54,840
But you can't read from the 
database. 

587
00:26:54,840 --> 00:26:58,320
It's overloaded, it's on fire. 
Every request to it is timing 

588
00:26:58,320 --> 00:26:59,240
out. 
Oh no. 

589
00:26:59,240 --> 00:27:02,600
So the cache stays empty. 
The cache stays empty. 

590
00:27:03,160 --> 00:27:07,040
Because the cache is empty, all 
3000 QPS keep hitting the 

591
00:27:07,040 --> 00:27:09,000
database. 
Because the traffic keeps 

592
00:27:09,000 --> 00:27:11,280
hitting the database, the 
database stays overloaded and 

593
00:27:11,280 --> 00:27:13,480
down. 
Because the database is down, 

594
00:27:13,600 --> 00:27:16,320
you can't refill the cache. 
It's a perfect vicious. 

595
00:27:16,320 --> 00:27:17,760
Loop. 
It's a death spiral. 

596
00:27:17,760 --> 00:27:20,080
Yeah, even though the original 
trigger, the cache reboot is 

597
00:27:20,080 --> 00:27:23,440
long gone, the system is stuck. 
The sustaining effect is the 

598
00:27:23,440 --> 00:27:26,800
application's own logic for 
refilling the cache, which now 

599
00:27:26,800 --> 00:27:29,080
acts as a weapon that keeps the 
database down. 

600
00:27:29,400 --> 00:27:32,080
And the system was vulnerable 
the whole time because it's 

601
00:27:32,080 --> 00:27:34,720
secretly relied on that 90 cash 
hit rate to survive. 

602
00:27:34,720 --> 00:27:37,920
It had hidden capacity. 
You thought your system could 

603
00:27:37,920 --> 00:27:42,800
handle 3000 Q, but in reality 
your database, the hard limit, 

604
00:27:42,800 --> 00:27:46,360
could only handle 500. 
The cache was just masking that 

605
00:27:46,360 --> 00:27:48,880
vulnerability. 
That is genuinely terrifying. 

606
00:27:48,880 --> 00:27:50,520
How do you even get out of that 
state? 

607
00:27:50,600 --> 00:27:53,000
It's very painful. 
You usually have to manually 

608
00:27:53,000 --> 00:27:55,560
intervene and stop all traffic 
at the edge at the load 

609
00:27:55,560 --> 00:27:57,280
balancer. 
Like turn off the website for 

610
00:27:57,280 --> 00:27:58,680
everyone and let the database 
recover. 

611
00:27:59,040 --> 00:28:03,720
Then very slowly run scripts to 
warm up the cache by hand, then 

612
00:28:03,920 --> 00:28:06,880
cautiously let a small 
percentage of users back in. 

613
00:28:06,960 --> 00:28:10,320
It's a major, major outage. 
Another aspect of this 

614
00:28:10,320 --> 00:28:13,920
sustaining effect that the paper
mentions is work amplification. 

615
00:28:13,920 --> 00:28:16,760
Yes, this is another one of 
those super counterintuitive 

616
00:28:16,760 --> 00:28:18,680
things. 
We often assume that handling an

617
00:28:18,680 --> 00:28:20,880
error is cheap. 
You know you have a tried out 

618
00:28:20,880 --> 00:28:23,160
catch block a try to catch 
return error. 

619
00:28:23,280 --> 00:28:25,720
Right, it seems like the catch 
block does less work, it's just 

620
00:28:25,720 --> 00:28:28,000
creating an error message. 
How hard can that be? 

621
00:28:28,120 --> 00:28:31,160
But in many complex systems, the
error path is actually more 

622
00:28:31,160 --> 00:28:34,040
expensive than the success path.
How is that possible? 

623
00:28:34,160 --> 00:28:36,720
Think about everything that 
might happen when you catch an 

624
00:28:36,720 --> 00:28:40,320
important exception. 
You might capture a full stack 

625
00:28:40,320 --> 00:28:42,640
trace. 
That's CPU intensive. 

626
00:28:42,960 --> 00:28:46,600
You might write a huge detailed 
log entry to a file on disk. 

627
00:28:46,920 --> 00:28:50,080
That's IO. 
You might do Adns look up to log

628
00:28:50,080 --> 00:28:51,760
the client's host name for 
debugging. 

629
00:28:52,000 --> 00:28:54,720
That's a blocking network call. 
You might send an alert to a 

630
00:28:54,720 --> 00:28:57,240
monitoring system. 
Oh wow, so if the system is 

631
00:28:57,240 --> 00:28:59,760
overloaded and it starts 
throwing a lot of errors. 

632
00:28:59,760 --> 00:29:04,000
The very act of handling those 
errors consumes more CPU, more 

633
00:29:04,000 --> 00:29:07,440
IO, and more network resources 
than the success path would 

634
00:29:07,440 --> 00:29:10,520
have, which makes the overload 
worse, which causes more errors.

635
00:29:10,520 --> 00:29:12,600
It's another one of these 
feedback loops. 

636
00:29:12,600 --> 00:29:15,480
Another loop, to be a senior 
engineer you have to look at 

637
00:29:15,480 --> 00:29:19,000
your air handling code and ask 
is this cheap, is this lean? 

638
00:29:19,280 --> 00:29:22,760
Or am I accidentally building a 
mechanism that will DDoS myself 

639
00:29:22,760 --> 00:29:24,920
with stack traces when things go
wrong? 

640
00:29:25,200 --> 00:29:28,560
This is incredibly heavy stuff. 
It really really shifts the 

641
00:29:28,560 --> 00:29:32,800
perspective from does my code 
work to how does my system fail.

642
00:29:32,880 --> 00:29:35,120
How does it fail and can it 
recover on its own? 

643
00:29:35,880 --> 00:29:37,480
That is the only question that 
matters. 

644
00:29:37,600 --> 00:29:39,880
OK, let's talk about the 
advanced patterns. 

645
00:29:40,240 --> 00:29:43,320
We've identified these horrible 
problems, retry storms, 

646
00:29:43,440 --> 00:29:47,400
metastable loops. 
What are the heavy duty tools we

647
00:29:47,400 --> 00:29:49,800
use to fight back? 
The first one in the toolkit, 

648
00:29:50,160 --> 00:29:52,040
and it's a big one, is the 
circuit breaker. 

649
00:29:52,320 --> 00:29:54,000
Wikipedia defines this pretty 
clearly. 

650
00:29:54,040 --> 00:29:57,320
It's a direct analogy to the 
electrical fuse or circuit 

651
00:29:57,320 --> 00:29:59,760
breaker in your house. 
It's the exact same principle. 

652
00:30:00,240 --> 00:30:03,320
If you plug too many things into
an outlet, the wiring gets too 

653
00:30:03,320 --> 00:30:06,640
hot and the breaker trips to cut
the power and save the house 

654
00:30:06,640 --> 00:30:09,280
from burning down. 
In software. 

655
00:30:09,320 --> 00:30:12,800
If a downstream service is 
failing too often, we trip the 

656
00:30:12,800 --> 00:30:15,200
circuit breaker to protect it. 
What does that look like in 

657
00:30:15,200 --> 00:30:16,400
code? 
What are the mechanics? 

658
00:30:16,400 --> 00:30:18,520
It's a simple state machine. 
There are three states. 

659
00:30:18,760 --> 00:30:21,560
The normal state is closed. 
This means the circuit is 

660
00:30:21,560 --> 00:30:25,640
complete electricity or in our 
case network traffic is flowing.

661
00:30:25,640 --> 00:30:29,000
OK, that's the happy path. 
Then if the breaker detects too 

662
00:30:29,000 --> 00:30:32,320
many failures, say more than 50%
of requests are failing in the 

663
00:30:32,320 --> 00:30:35,640
last 10 seconds, it transitions 
to the open state. 

664
00:30:36,280 --> 00:30:39,560
This means the wire is cut when 
a new request comes in. 

665
00:30:39,560 --> 00:30:41,760
The circuit breaker doesn't even
try to call the downstream 

666
00:30:41,760 --> 00:30:43,880
service. 
It fails it immediately with an 

667
00:30:43,880 --> 00:30:45,920
error. 
And this directly prevents the 

668
00:30:45,920 --> 00:30:48,560
retry storm we talked about. 
It gives the struggling 

669
00:30:48,560 --> 00:30:51,720
downstream service a break some 
breathing room to recover. 

670
00:30:51,720 --> 00:30:54,400
Exactly. 
It stops the bleeding, but you 

671
00:30:54,400 --> 00:30:57,520
can't stay open forever. 
You need a way to know when it's

672
00:30:57,520 --> 00:31:00,720
safe to close the circuit again.
That's the third state half 

673
00:31:00,720 --> 00:31:01,640
open. 
The scout. 

674
00:31:01,680 --> 00:31:04,200
The scout. 
After a configured timeout, say 

675
00:31:04,440 --> 00:31:08,640
30 seconds, the breaker let's 
one single request through a 

676
00:31:08,640 --> 00:31:10,760
probe. 
If that single request succeeds,

677
00:31:10,880 --> 00:31:13,400
the breaker assumes the 
downstream service has recovered

678
00:31:13,640 --> 00:31:15,080
and it moves back to the closed 
state. 

679
00:31:15,480 --> 00:31:18,320
Everything is back to normal. 
And if that scout request fails.

680
00:31:18,360 --> 00:31:20,560
It goes right back to the open 
state and starts the time out 

681
00:31:20,560 --> 00:31:22,560
timer again. 
It's a really elegant little 

682
00:31:22,560 --> 00:31:26,080
state machine, but it directly 
solves that sustaining effect of

683
00:31:26,080 --> 00:31:29,840
the database overload. 
It does, but here is the nuance,

684
00:31:29,840 --> 00:31:32,680
the senior interview detail that
separates the candidates. 

685
00:31:33,160 --> 00:31:35,720
It's the mini circuit breaker 
problem, or the problem of 

686
00:31:35,720 --> 00:31:37,040
granularity. 
What's that? 

687
00:31:37,320 --> 00:31:40,720
Imagine you have a service that 
calls a database that is sharded

688
00:31:40,720 --> 00:31:45,400
into say, 100 shards. 
Shards A through Y are perfectly

689
00:31:45,400 --> 00:31:49,840
healthy, but Shard Z is on a 
faulty machine and is melting 

690
00:31:49,840 --> 00:31:51,800
down. 
OK, a partial failure. 

691
00:31:51,800 --> 00:31:55,080
If you have one big global 
circuit breaker for the entire 

692
00:31:55,080 --> 00:31:58,560
database dependency and Shard Z 
starts failing, what happens? 

693
00:31:58,800 --> 00:32:01,160
You trip the main breaker. 
Now nobody can talk to the 

694
00:32:01,160 --> 00:32:03,000
healthy shards A through Y 
either. 

695
00:32:03,160 --> 00:32:05,440
You've taken down the whole 
system because of one small 

696
00:32:05,440 --> 00:32:07,360
partial failure. 
So you need to have a circuit 

697
00:32:07,360 --> 00:32:09,400
breaker per Shard. 
Or partition. 

698
00:32:09,400 --> 00:32:10,920
Or per instance you're talking 
to. 

699
00:32:11,800 --> 00:32:14,680
The granularity of your circuit 
breaker matters immensely. 

700
00:32:14,760 --> 00:32:17,880
If you make it too broad, you 
create unnecessary outages. 

701
00:32:18,040 --> 00:32:21,840
If you make it too narrow, say a
breaker for every single 

702
00:32:21,840 --> 00:32:24,960
customer ID, you might use too 
much memory tracking the state 

703
00:32:24,960 --> 00:32:26,760
of all those Breakers. 
It's always a trade off. 

704
00:32:26,800 --> 00:32:28,920
Always, yeah. 
Discussing that trade off is 

705
00:32:28,920 --> 00:32:31,400
what gets you the job. 
Now let's go even deeper down 

706
00:32:31,400 --> 00:32:34,160
the rabbit hole. 
We have a paper here from Google

707
00:32:34,160 --> 00:32:37,400
Research on D2, TCP and deadline
awareness. 

708
00:32:38,160 --> 00:32:40,320
This feels like the absolute 
cutting edge. 

709
00:32:40,320 --> 00:32:42,560
This is for when you're 
designing extremely high 

710
00:32:42,560 --> 00:32:44,720
performance, low latency 
systems. 

711
00:32:45,080 --> 00:32:48,520
Think Google search, high 
frequency trading or ad bidding 

712
00:32:48,520 --> 00:32:50,680
platforms. 
And the specific problem they 

713
00:32:50,680 --> 00:32:55,360
describe in the paper is the old
DI problem, which stands for 

714
00:32:55,360 --> 00:32:57,600
Online Data Intensive 
Applications. 

715
00:32:57,640 --> 00:32:59,880
Right, and the classic example 
is Google search. 

716
00:33:00,040 --> 00:33:02,800
You type a query. 
A query hits a root server. 

717
00:33:03,280 --> 00:33:06,800
That root server then fans out 
your query to maybe the 1000 

718
00:33:06,800 --> 00:33:09,320
leaf servers. 
Each of those servers is 

719
00:33:09,320 --> 00:33:11,360
responsible for searching its 
own little chunk of the 

720
00:33:11,360 --> 00:33:13,240
Internet. 
And they all find their results 

721
00:33:13,240 --> 00:33:14,880
and reply. 
And they all reply at roughly 

722
00:33:14,880 --> 00:33:16,840
the same time. 
Which creates a traffic jam. 

723
00:33:17,080 --> 00:33:20,840
The paper calls this in cast 
congestion, a fan in burst. 

724
00:33:20,840 --> 00:33:22,920
Exactly. 
You have 1000 responses hitting 

725
00:33:22,920 --> 00:33:24,360
the same network switch all at 
once. 

726
00:33:24,920 --> 00:33:27,760
The switch's memory buffers fill
up instantly, packets start 

727
00:33:27,760 --> 00:33:30,440
getting dropped. 
And standard TCP, the protocol 

728
00:33:30,440 --> 00:33:32,800
that the entire Internet has run
on for decades. 

729
00:33:33,000 --> 00:33:34,760
How does it handle this packet 
loss? 

730
00:33:34,920 --> 00:33:38,320
Poorly for this use case. 
TCP is designed for one thing 

731
00:33:38,320 --> 00:33:42,360
above all else, fair share. 
It tries to be fair to everyone 

732
00:33:42,360 --> 00:33:44,920
using the pipe. 
When it detects packet loss, its

733
00:33:44,920 --> 00:33:48,280
algorithm tells everyone to slow
down their sending rate, usually

734
00:33:48,280 --> 00:33:50,880
by cutting it in half. 
But in this scenario, not 

735
00:33:50,880 --> 00:33:53,240
everyone is equal. 
Some responses might be more 

736
00:33:53,240 --> 00:33:55,640
important than others. 
It's not about importance, it's 

737
00:33:55,640 --> 00:33:59,040
about time. 
In an ODI app you have a hard 

738
00:33:59,040 --> 00:34:01,440
deadline. 
We must return search results to

739
00:34:01,440 --> 00:34:05,560
the user in 200 milliseconds. 
If a packet containing a piece 

740
00:34:05,560 --> 00:34:08,719
of a search result is going to 
arrive at 250 meters, it is 

741
00:34:08,719 --> 00:34:10,960
useless. 
The root server has already 

742
00:34:10,960 --> 00:34:12,600
given up and set back the 
results it has. 

743
00:34:12,600 --> 00:34:15,800
That late packet is garbage. 
So standard TCP might slow down 

744
00:34:15,800 --> 00:34:18,280
a packet that only has 10 
milliseconds left on its clock, 

745
00:34:18,280 --> 00:34:21,440
causing it to miss the deadline 
while giving that bandwidth to a

746
00:34:21,440 --> 00:34:23,360
packet that has plenty of time 
to spare. 

747
00:34:23,400 --> 00:34:26,440
Yes. 
TCP has no concept of time. 

748
00:34:26,679 --> 00:34:28,760
It only knows about bytes and 
fairness. 

749
00:34:29,120 --> 00:34:31,880
It will happily make everyone 
late in the name of being fair. 

750
00:34:31,880 --> 00:34:33,960
So what does D2 TCP do 
differently? 

751
00:34:33,960 --> 00:34:36,639
How does it fix this? 
It adds deadline awareness 

752
00:34:36,920 --> 00:34:40,120
directly into the congestion 
control algorithm at the kernel 

753
00:34:40,120 --> 00:34:42,400
level. 
It uses a mathematical trick 

754
00:34:42,400 --> 00:34:45,199
they call gamma correction. 
Gamma correction sounds like 

755
00:34:45,199 --> 00:34:48,480
something from video editing. 
It's a math term, but the 

756
00:34:48,480 --> 00:34:50,080
intuition is actually pretty 
simple. 

757
00:34:50,360 --> 00:34:55,239
It modifies TCP's back off 
behavior based on how close a 

758
00:34:55,239 --> 00:34:57,800
flow is to its deadline. 
How does it decide? 

759
00:34:57,800 --> 00:35:00,240
How does it know the deadline? 
The application tells it. 

760
00:35:00,400 --> 00:35:03,360
The application says this socket
connection is for an operation 

761
00:35:03,360 --> 00:35:05,200
that must complete by this time 
stamp. 

762
00:35:05,800 --> 00:35:08,760
DGTCP then knows the time budget
for every packet. 

763
00:35:09,280 --> 00:35:10,800
And what does it do with that 
knowledge? 

764
00:35:11,080 --> 00:35:14,280
If a flow is very near its 
deadline, it's in danger of 

765
00:35:14,280 --> 00:35:17,640
being late. 
DDTCPS Gamma correction tells it

766
00:35:18,120 --> 00:35:21,080
do not back off. 
Be aggressive, push through. 

767
00:35:21,280 --> 00:35:24,080
You need to finish now. 
And if a flow has a far off 

768
00:35:24,080 --> 00:35:27,080
deadline, plenty of time. 
It tells that flow to back off 

769
00:35:27,080 --> 00:35:29,080
very aggressively. 
You have time, Get out of the 

770
00:35:29,080 --> 00:35:30,720
way and let the urgent traffic 
pass. 

771
00:35:30,720 --> 00:35:33,480
It's prioritizing traffic based 
on time remaining. 

772
00:35:33,480 --> 00:35:35,400
Exactly. 
It moves the entire network, 

773
00:35:35,400 --> 00:35:38,320
stacks philosophy from fairness 
to urgency. 

774
00:35:38,480 --> 00:35:42,160
That is a profound shift in 
thinking, and for a system 

775
00:35:42,160 --> 00:35:45,680
design interview, mentioning 
that standard TCP itself might 

776
00:35:45,680 --> 00:35:48,480
be the bottleneck because it 
ignores deadlines, that's a 

777
00:35:48,480 --> 00:35:51,600
serious differentiator. 
It shows you understand the 

778
00:35:51,600 --> 00:35:54,400
stack all the way down to the 
transport layer. 

779
00:35:55,120 --> 00:35:57,640
Most engineers stop thinking at 
the load balancer. 

780
00:35:58,280 --> 00:36:01,520
This shows you think deeper. 
OK, we're rounding the final 

781
00:36:01,520 --> 00:36:04,000
corner here. 
We've talked about how to design

782
00:36:04,000 --> 00:36:06,680
these complex systems with 
circuit Breakers and deadline 

783
00:36:06,680 --> 00:36:09,240
propagation. 
The last part is about 

784
00:36:09,360 --> 00:36:11,880
verification. 
How do we know it all actually 

785
00:36:11,880 --> 00:36:14,160
works? 
This brings us to a fascinating 

786
00:36:14,160 --> 00:36:17,720
paper called the Zebra Conf. 
The big headline here is that 

787
00:36:17,720 --> 00:36:20,800
most complex failures aren't 
caused by code bugs anymore, 

788
00:36:21,080 --> 00:36:23,040
they're caused by configuration 
errors. 

789
00:36:23,040 --> 00:36:26,240
Heterogeneous configurations. 
Right, In a massive distributed 

790
00:36:26,240 --> 00:36:28,600
system, not every server is 
running the exact same 

791
00:36:28,600 --> 00:36:31,000
configuration. 
You might be in the middle of 

792
00:36:31,000 --> 00:36:32,600
rolling out a change. 
You might have different 

793
00:36:32,600 --> 00:36:35,000
hardware types that need 
different tuning parameters. 

794
00:36:35,000 --> 00:36:37,120
And the paper points out this 
insidious problem where 

795
00:36:37,160 --> 00:36:40,280
parameter A masks a fault in 
parameter B. 

796
00:36:40,480 --> 00:36:43,360
Yes, it's a combinatorial 
nightmare. 

797
00:36:43,520 --> 00:36:47,280
Maybe you have a retry limit of 
three on one group of servers 

798
00:36:47,280 --> 00:36:50,320
and five on another, and that 
specific interaction 

799
00:36:50,520 --> 00:36:52,960
accidentally hides a bug in your
timeout logic. 

800
00:36:53,440 --> 00:36:57,280
System works until the day you 
decide to make all the retry 

801
00:36:57,280 --> 00:37:00,640
limits 5 and suddenly the whole 
system falls over because the 

802
00:37:00,640 --> 00:37:02,800
bug in the timeout logic is now 
exposed. 

803
00:37:02,800 --> 00:37:05,680
So for the interview, the point 
is not just to design the code, 

804
00:37:05,680 --> 00:37:08,520
but to design the configuration 
management process itself. 

805
00:37:08,520 --> 00:37:10,840
But don't just say we put 
configs in a YAML file. 

806
00:37:11,160 --> 00:37:13,880
That's not an answer. 
How do you test config changes? 

807
00:37:14,120 --> 00:37:16,480
How do you verify that a change 
doesn't violate your system 

808
00:37:16,480 --> 00:37:19,160
score assumptions? 
How do you prevent config drift 

809
00:37:19,160 --> 00:37:21,000
from killing you silently over 
months? 

810
00:37:21,280 --> 00:37:24,120
And they mentioned a technique 
called Bayesian risk refinement.

811
00:37:24,280 --> 00:37:28,080
That's a very fancy mouthful, 
but the concept behind it is we 

812
00:37:28,080 --> 00:37:31,080
can't possibly test every 
combination of configurations. 

813
00:37:31,360 --> 00:37:34,800
The search space is infinite, so
we use probability Bayesian 

814
00:37:34,800 --> 00:37:38,520
logic to guide our testing. 
We build a model of our system 

815
00:37:38,760 --> 00:37:41,360
and use it to find the riskiest 
configuration changes. 

816
00:37:41,720 --> 00:37:44,640
We focus our limited testing 
resources on the areas where we 

817
00:37:44,640 --> 00:37:47,440
have the least confidence or 
where an error would be most 

818
00:37:47,440 --> 00:37:50,240
catastrophic. 
It's intelligent risk based 

819
00:37:50,240 --> 00:37:52,840
testing rather than just brute 
force exhaustive testing. 

820
00:37:52,960 --> 00:37:55,960
Exactly, and it's how modern 
large scale systems are actually

821
00:37:55,960 --> 00:37:57,760
managed. 
All right, we have covered a 

822
00:37:57,760 --> 00:38:01,520
massive amount of ground today 
from Pac-Man to gamma corrected 

823
00:38:01,520 --> 00:38:03,800
TCP. 
Let's bring it all together. 

824
00:38:03,800 --> 00:38:07,160
Let's create the Senior 
Engineers interview checklist. 

825
00:38:07,160 --> 00:38:10,200
OK, if you were walking into 
that interview room or logging 

826
00:38:10,200 --> 00:38:13,280
onto that Zoom call for a system
design round, here is your 

827
00:38:13,280 --> 00:38:15,720
mental CHEAT SHEET. 
Let's hear it, Item 1. 

828
00:38:16,120 --> 00:38:18,560
Assume failure. 
Don't design a happy path. 

829
00:38:18,560 --> 00:38:21,840
First start with the failures. 
Ask yourself what happens if the

830
00:38:21,840 --> 00:38:24,720
network dies after the request 
is processed but before the 

831
00:38:24,720 --> 00:38:27,280
reply is received. 
If you can answer that, you're 

832
00:38:27,280 --> 00:38:30,360
on the right track. 
Always handle the unknown state.

833
00:38:30,560 --> 00:38:32,480
Item 2. 
Check assumptions. 

834
00:38:32,840 --> 00:38:35,600
Explicitly state your 
assumptions about time load and 

835
00:38:35,600 --> 00:38:37,600
data. 
Are your timeouts aligned? 

836
00:38:37,720 --> 00:38:40,880
Remember, upstream timeouts 
should generally be longer than 

837
00:38:40,880 --> 00:38:43,320
downstream. 
Or better yet, use deadline 

838
00:38:43,320 --> 00:38:46,160
propagation. 
Don't let your system's reality 

839
00:38:46,160 --> 00:38:47,560
drift apart. 
Item 3. 

840
00:38:47,560 --> 00:38:50,160
Control flow. 
Don't just let request fly 

841
00:38:50,160 --> 00:38:52,640
blindly into the system. 
You are the engineer, you 

842
00:38:52,640 --> 00:38:55,520
control the flow. 
Use exponential, back off. 

843
00:38:55,960 --> 00:38:58,280
Use jitter to turn a metronome 
into rain. 

844
00:38:58,520 --> 00:39:01,120
Use circuit Breakers to give 
services a chance to recover. 

845
00:39:01,240 --> 00:39:02,760
Control the storm before it 
starts. 

846
00:39:02,880 --> 00:39:04,400
Item 4. 
Consistency. 

847
00:39:04,680 --> 00:39:08,560
If you retry, and you will, your
operations must be idem potent. 

848
00:39:09,240 --> 00:39:12,800
State how you will achieve this.
Use natural keys, business keys,

849
00:39:12,920 --> 00:39:16,320
or for the hardest problems, 
explicitly pass a unique idem 

850
00:39:16,320 --> 00:39:18,760
potency key. 
Do not design a system that can 

851
00:39:18,760 --> 00:39:21,480
charge a credit card twice. 
And finally item 5. 

852
00:39:21,480 --> 00:39:23,560
Meta stability. 
This is the advanced one. 

853
00:39:23,560 --> 00:39:26,240
Ask the big question. 
Does my recovery mechanism act 

854
00:39:26,240 --> 00:39:28,240
as a denial of service attack on
myself? 

855
00:39:28,560 --> 00:39:30,480
Do my retries create a retry 
storm? 

856
00:39:30,720 --> 00:39:33,160
Does my cash warming logic keep 
my database down? 

857
00:39:33,280 --> 00:39:34,960
Look for those deadly feedback 
loops. 

858
00:39:35,080 --> 00:39:37,440
That is a ridiculously powerful 
list. 

859
00:39:37,600 --> 00:39:41,640
It's the difference between 
saying I hope this works and I 

860
00:39:41,640 --> 00:39:44,400
know how this will break and 
I've designed it to recover 

861
00:39:44,400 --> 00:39:46,600
gracefully. 
I want to leave our listeners 

862
00:39:46,600 --> 00:39:49,480
with one final thought. 
We started this whole deep dive 

863
00:39:49,480 --> 00:39:52,200
with the idea that perfect code 
can fail. 

864
00:39:52,320 --> 00:39:54,200
And I think we've seen exactly 
why that's true. 

865
00:39:54,400 --> 00:39:56,840
There's a quote I love that I 
think sums this all up. 

866
00:39:57,520 --> 00:40:01,560
Great software isn't built by 
eliminating bugs, it's built by 

867
00:40:01,560 --> 00:40:03,720
eliminating surprises. 
I love that. 

868
00:40:03,720 --> 00:40:06,200
That's perfect. 
The bugs will always happen. 

869
00:40:06,200 --> 00:40:08,280
The cosmic rays will flip the 
bits. 

870
00:40:08,600 --> 00:40:10,880
The backhoe will cut the fiber 
optic cable. 

871
00:40:11,240 --> 00:40:13,080
Those are just facts of life in 
this field. 

872
00:40:13,080 --> 00:40:16,320
They are not surprises. 
But the surprise The system 

873
00:40:16,320 --> 00:40:19,480
collapsing into a heat because 
of a simple retry storm or 

874
00:40:19,480 --> 00:40:22,080
getting stuck in a metastoble 
loop for hours. 

875
00:40:22,280 --> 00:40:26,360
That is what we as engineers can
and must eliminate through good 

876
00:40:26,360 --> 00:40:28,480
design. 
The code inside the brackets, 

877
00:40:28,920 --> 00:40:30,840
the if statements, the for 
loops. 

878
00:40:30,840 --> 00:40:32,320
In many ways, that's the easy 
part. 

879
00:40:32,640 --> 00:40:34,960
The void outside the brackets, 
the empty space where the 

880
00:40:34,960 --> 00:40:37,040
messages travel and the 
assumptions live. 

881
00:40:37,760 --> 00:40:39,200
That is where the senior 
engineer lives. 

882
00:40:39,200 --> 00:40:41,920
And that is where you need to 
live to ace that interview. 

883
00:40:41,960 --> 00:40:44,960
Good luck, you've got this. 
Thanks for listening to the deep

884
00:40:44,960 --> 00:40:46,480
dive. 
We'll see you next time.