1
00:00:00,000 --> 00:00:02,900
Guess engineering is basically 
this discipline of experimenting

2
00:00:02,900 --> 00:00:05,300
on the system and the system can
be anything. 

3
00:00:05,300 --> 00:00:08,900
It doesn't have to be massive or
Netflix scale in order to 

4
00:00:08,900 --> 00:00:13,000
increase your confidence that 
system will survive difficult 

5
00:00:13,000 --> 00:00:17,100
conditions. 
So the experiments real goal is 

6
00:00:17,100 --> 00:00:22,200
to either confirm that your 
assumptions about a system are 

7
00:00:22,200 --> 00:00:26,900
correct or you find a problem or
you find place where your 

8
00:00:26,900 --> 00:00:31,300
assumptions and the reality 
don't, Not necessarily add up. 

9
00:00:35,800 --> 00:00:39,100
Hey everyone. 
My name is Henry Surya be Robin.

10
00:00:40,700 --> 00:00:44,500
And you're listening to the 
tekhelet Juno, the show will be 

11
00:00:44,500 --> 00:00:47,900
bringing you the greatest 
technical leaders practitioners 

12
00:00:48,100 --> 00:00:51,400
and thought leaders in the 
industry to discuss about their 

13
00:00:51,400 --> 00:00:56,200
Journey ideas and practices that
we all can learn and apply to 

14
00:00:56,200 --> 00:00:59,200
build a highly performing 
technical team and to make an 

15
00:00:59,200 --> 00:01:03,900
impact in your personal work. 
So let's dive into our Journal. 

16
00:01:09,000 --> 00:01:11,100
Hello everyone. 
I'm so happy to be back here 

17
00:01:11,100 --> 00:01:14,600
again with another new episode 
of the package on our podcast. 

18
00:01:14,800 --> 00:01:16,800
Thanks for tuning in, and 
spending your time with me 

19
00:01:16,800 --> 00:01:18,800
today, listening to this 
episode. 

20
00:01:19,100 --> 00:01:20,800
If you haven't, please subscribe
to Tech. 

21
00:01:20,800 --> 00:01:22,900
Did you know on your favorite 
podcast apps? 

22
00:01:23,100 --> 00:01:25,700
And also follow technology on 
our social media channels on 

23
00:01:25,700 --> 00:01:29,100
LinkedIn, Twitter and Instagram.
And you can also make some 

24
00:01:29,100 --> 00:01:31,800
contribution to the show and 
support the creation of this 

25
00:01:31,800 --> 00:01:34,200
podcast by subscribing, as a 
patron. 

26
00:01:34,400 --> 00:01:36,100
At technology. 
Know that death /. 

27
00:01:36,200 --> 00:01:39,800
Patron and help me towards 
producing great content every 

28
00:01:39,800 --> 00:01:43,800
week for today's episode. 
I am happy to share my 

29
00:01:43,800 --> 00:01:46,400
conversation with Michael. 
I Polly Kowski. 

30
00:01:47,000 --> 00:01:49,600
Mikolai is an engineering lead 
at Bloomberg. 

31
00:01:49,900 --> 00:01:53,700
And the author of chaos 
engineering side reliability 

32
00:01:53,800 --> 00:01:57,300
through controlled disruption. 
I'm sure that many of you would 

33
00:01:57,300 --> 00:02:00,700
have heard about chaos, 
engineering before popularized 

34
00:02:00,700 --> 00:02:04,700
by Netflix along with its tools 
such as chaos monkey and Simian 

35
00:02:04,700 --> 00:02:07,600
Army. 
I'm Curious about cows 

36
00:02:07,600 --> 00:02:11,500
engineering and how to implement
and execute it properly, in 

37
00:02:11,500 --> 00:02:13,400
order to continuously improve 
systems. 

38
00:02:13,400 --> 00:02:17,600
Reliability in the midst of 
disruption and disaster in this 

39
00:02:17,600 --> 00:02:20,100
episode. 
Mikoshi 8 in depth about what 

40
00:02:20,100 --> 00:02:23,100
Kelsey engineering is. 
And importantly, what cares? 

41
00:02:23,100 --> 00:02:25,400
Engineering is not by 
clarifying. 

42
00:02:25,400 --> 00:02:28,000
Some of the common 
misconceptions surrounding it 

43
00:02:28,500 --> 00:02:32,200
Miko, explain the prerequisites 
and steps required in order for 

44
00:02:32,200 --> 00:02:35,700
us to start doing Chaos 
engineering, and also mention 

45
00:02:35,700 --> 00:02:38,600
some of the chaos engineering 
tools that we can use at 

46
00:02:38,600 --> 00:02:42,300
different layers of our system. 
The skill set required of a 

47
00:02:42,300 --> 00:02:45,800
chaos engineer and how we should
explain the rationale and 

48
00:02:45,800 --> 00:02:49,000
motivation behind chaos, 
engineering to get the 

49
00:02:49,000 --> 00:02:52,500
management buy-in. 
And using a fun analogy by 

50
00:02:52,500 --> 00:02:56,600
involving hamburger and sharp 
towards the end Miko. 

51
00:02:56,600 --> 00:02:59,200
So shared about Kelsey 
engineering for people and 

52
00:02:59,200 --> 00:03:02,800
interesting exert, taken from 
his book and his ultimate 

53
00:03:02,800 --> 00:03:05,700
Mission over the last few years 
to make chaos engineering 

54
00:03:05,700 --> 00:03:08,400
boring. 
I hope you will enjoy this 

55
00:03:08,400 --> 00:03:11,500
episode and if you like it, 
consider helping the show by 

56
00:03:11,500 --> 00:03:14,900
living it a rating review or 
comment on your podcast app. 

57
00:03:15,000 --> 00:03:18,400
All social media channels, those
reviews and comments are one of 

58
00:03:18,408 --> 00:03:21,500
the best ways to help me get 
this podcast to reach more 

59
00:03:21,500 --> 00:03:24,900
listeners and hopefully they can
also benefit from all the 

60
00:03:24,900 --> 00:03:28,200
contents in this podcast. 
So let's get this episode 

61
00:03:28,200 --> 00:03:30,700
started right after our sponsor 
message. 

62
00:03:30,900 --> 00:03:32,900
Are you looking for a new cool 
swag? 

63
00:03:33,100 --> 00:03:35,800
Taglit Journal. 
Now, offers you some swags that 

64
00:03:35,800 --> 00:03:37,100
you can purchase. 
Online. 

65
00:03:37,500 --> 00:03:41,000
These works are printed on 
demand based on your preference 

66
00:03:41,200 --> 00:03:44,200
and will be delivered safely to 
you all over the world where 

67
00:03:44,200 --> 00:03:47,300
shipping is available. 
Check out all the cool strikes 

68
00:03:47,300 --> 00:03:49,800
available by visiting 
technology, you know that, death

69
00:03:49,800 --> 00:03:52,500
/ shop and don't forget to break
yourself. 

70
00:03:52,600 --> 00:03:54,700
Once you receive any of those 
tracks. 

71
00:03:57,400 --> 00:04:00,000
Hey everyone, welcome back to 
another new show of the package.

72
00:04:00,000 --> 00:04:02,900
You know, today I have with me a
guy called Michael. 

73
00:04:02,900 --> 00:04:05,800
I polakov ski or in short. 
Let's call him Michael. 

74
00:04:06,000 --> 00:04:08,000
I saw me. 
So, is the author of a recent 

75
00:04:08,000 --> 00:04:11,000
book, titled chaos engineering. 
So today? 

76
00:04:11,300 --> 00:04:13,000
I'm sure we're going to be 
talking a lot about chaos, 

77
00:04:13,000 --> 00:04:16,500
engineering how to implement it 
correctly, and maybe some of the

78
00:04:16,500 --> 00:04:19,600
gotchas or misconceptions about 
health engineering. 

79
00:04:19,800 --> 00:04:21,899
So, because it's good to have 
you here in the show. 

80
00:04:21,899 --> 00:04:24,800
Thanks for being here. 
Very, glad to be here. 

81
00:04:24,800 --> 00:04:26,700
Thanks for inviting me. 
Yeah. 

82
00:04:26,900 --> 00:04:29,700
So before we start, probably 
introduce yourself to the 

83
00:04:29,700 --> 00:04:32,200
audience here, can you tell us a
little bit more about your 

84
00:04:32,200 --> 00:04:34,100
career? 
Maybe some highlights or turning

85
00:04:34,100 --> 00:04:36,900
points? 
Sure, so I don't know how Far 

86
00:04:36,900 --> 00:04:38,400
back. 
You would like me to go. 

87
00:04:38,700 --> 00:04:42,000
Yeah, obviously right now, I do 
a lot of guess engineering. 

88
00:04:42,300 --> 00:04:46,300
I run a small SRE. 
Team that managed communities 

89
00:04:46,300 --> 00:04:49,800
and Chaos engineering kind of 
evolved as one of the very 

90
00:04:49,800 --> 00:04:54,300
important tools that we use to 
basically make our system more 

91
00:04:54,300 --> 00:04:56,700
reliable. 
And it placed nicely into this 

92
00:04:56,700 --> 00:04:59,900
entire site, reliability, 
engineering mindset. 

93
00:05:00,100 --> 00:05:03,200
I've been out Bloomberg for a 
while, now, before that. 

94
00:05:03,200 --> 00:05:05,800
I attempted a couple of 
startups. 

95
00:05:06,600 --> 00:05:09,500
So pretty, I think, common take 
background. 

96
00:05:09,700 --> 00:05:12,800
I've been hacking away at coding
since I was a kid. 

97
00:05:13,000 --> 00:05:15,400
So I think probably a lot of 
people can relate to that. 

98
00:05:15,900 --> 00:05:18,200
Okay, cool. 
I mean, let's maybe dive deeper 

99
00:05:18,200 --> 00:05:20,300
as we talked along about your 
career. 

100
00:05:20,400 --> 00:05:22,200
You have this this book not so 
long ago, right? 

101
00:05:22,200 --> 00:05:25,000
Chaos engineering. 
I think the first time I heard 

102
00:05:25,000 --> 00:05:28,400
about this term is during that 
time when Netflix popularized, 

103
00:05:28,400 --> 00:05:30,800
this idea, chaos engineering, 
and they've been running in 

104
00:05:30,800 --> 00:05:33,700
their production, killing their 
servers, in order to increase 

105
00:05:33,700 --> 00:05:37,100
the reliability, but maybe for 
all the audience Are can you 

106
00:05:37,100 --> 00:05:39,700
maybe explain what exactly is 
chaos? 

107
00:05:39,700 --> 00:05:42,500
Engineering. 
Sure firing and that's probably 

108
00:05:42,500 --> 00:05:46,200
a good idea because there's a 
lot of misconceptions and the 

109
00:05:46,200 --> 00:05:48,700
name itself doesn't help. 
So I guess engineering is 

110
00:05:48,700 --> 00:05:51,600
basically this discipline of 
experimenting on the system and 

111
00:05:51,600 --> 00:05:55,000
the system can be anything. 
It doesn't have to be massive or

112
00:05:55,000 --> 00:05:58,900
Netflix girl, in order to 
increase your confidence that 

113
00:05:58,900 --> 00:06:01,500
system will survive difficult 
conditions. 

114
00:06:01,500 --> 00:06:04,700
So a unit of course, 
engineering, the way that I see 

115
00:06:04,700 --> 00:06:06,200
it. 
Is, this chaos. 

116
00:06:06,500 --> 00:06:08,800
Experiment. 
When you basically have a bunch 

117
00:06:08,800 --> 00:06:12,400
of assumptions about the system.
Obviously we all design the 

118
00:06:12,400 --> 00:06:15,500
systems and we want them to 
behave certain ways. 

119
00:06:15,700 --> 00:06:18,200
But only in practice, it turns 
out that there's a lot of 

120
00:06:18,200 --> 00:06:21,800
behavior that we didn't account 
for or system is doing what we 

121
00:06:21,800 --> 00:06:24,200
told it instead of what we 
intended. 

122
00:06:24,300 --> 00:06:31,200
So the experiments real goal is 
to either confirm that your 

123
00:06:31,200 --> 00:06:33,500
assumptions about a system are 
correct. 

124
00:06:33,600 --> 00:06:37,900
Basically your hypothesis works 
or You find a problem or you 

125
00:06:37,900 --> 00:06:42,000
find place where your 
assumptions and the reality 

126
00:06:42,000 --> 00:06:45,400
don't necessarily add up. 
So this guy's engineering 

127
00:06:45,400 --> 00:06:49,400
experiments typically have four 
steps one as defining 

128
00:06:49,400 --> 00:06:52,900
observability because this is 
like a prerequisite for all of 

129
00:06:52,900 --> 00:06:56,100
that to happen. 
If you can't observe the system.

130
00:06:56,100 --> 00:06:59,800
Any kind of variable reliably 
than you can't really conduct a 

131
00:06:59,808 --> 00:07:03,000
scientific experiment. 
Then you go for what? 

132
00:07:03,000 --> 00:07:06,600
We typically call steady state, 
which is just a fancy way of Of 

133
00:07:06,600 --> 00:07:09,200
saying, this is the normal, kind
of behavior. 

134
00:07:09,200 --> 00:07:12,000
This is the normal range. 
So let's say that for 

135
00:07:12,000 --> 00:07:14,600
observability. 
We're looking at variable like 

136
00:07:14,600 --> 00:07:17,700
throughput or number of requests
per second that some kind of 

137
00:07:17,700 --> 00:07:19,900
server can handle the normal 
range. 

138
00:07:19,900 --> 00:07:23,700
The steady-state might be this 
number of thousands of requests 

139
00:07:23,700 --> 00:07:26,800
per second and then we go and we
do the fun stuff. 

140
00:07:26,800 --> 00:07:30,500
So we try to turn our 
expectations of the system into 

141
00:07:30,500 --> 00:07:33,600
a hypothesis and we say, okay, 
so we designed the system to be 

142
00:07:33,600 --> 00:07:37,400
redundant and means that if we 
take away one of Servers from 

143
00:07:37,400 --> 00:07:40,100
the pool. 
It should keep working within 

144
00:07:40,100 --> 00:07:44,600
this parameters and you go 
number four, and you implement 

145
00:07:44,600 --> 00:07:46,200
that and you verify what 
happened. 

146
00:07:46,400 --> 00:07:48,500
The nice thing about that is 
that everybody wins? 

147
00:07:48,500 --> 00:07:51,700
Because if your hypothesis was 
wrong, then you discover 

148
00:07:51,700 --> 00:07:54,200
something you can fix. 
And you can save yourself, 

149
00:07:54,200 --> 00:07:56,500
trouble, and fix it before you 
users. 

150
00:07:56,500 --> 00:07:59,800
Notice, if your hypothesis was 
right was, correct. 

151
00:07:59,800 --> 00:08:02,200
That means that your system is 
pretty good. 

152
00:08:02,300 --> 00:08:06,100
So you increase your confidence.
So, you know, it's the nice part

153
00:08:06,100 --> 00:08:09,100
of doing Because engineering. 
So this is really it like, it 

154
00:08:09,100 --> 00:08:11,100
doesn't need to be more 
complicated. 

155
00:08:11,200 --> 00:08:15,100
And then obviously, it came out 
of Netflix and it made for some 

156
00:08:15,100 --> 00:08:17,900
really good headlines because 
they were already doing this in 

157
00:08:17,900 --> 00:08:22,000
production breaking things. 
In production is typically one 

158
00:08:22,000 --> 00:08:24,600
of the internet meme. 
So if someone comes out and says

159
00:08:24,600 --> 00:08:27,300
that with a straight face, it's 
a bit of a controversy. 

160
00:08:27,700 --> 00:08:30,600
So that's really what I see. 
Let's go to engineering. 

161
00:08:30,900 --> 00:08:34,200
I think it's simple enough that 
pretty much everybody going. 

162
00:08:34,200 --> 00:08:37,400
At least kick the tires and see 
what value they Get out of it 

163
00:08:37,799 --> 00:08:40,799
from your explanation just now. 
So there's nothing mentioning 

164
00:08:40,799 --> 00:08:43,600
about chaos. 
So to speak, it's more about 

165
00:08:43,600 --> 00:08:45,700
scientific experiment. 
You know, what? 

166
00:08:45,700 --> 00:08:48,700
You're going to test. 
You have a hypothesis and like 

167
00:08:48,700 --> 00:08:51,300
you have the steady steady you 
want to test about and then you 

168
00:08:51,300 --> 00:08:53,600
introduce some kind of tests and
experiments. 

169
00:08:53,700 --> 00:08:57,500
Not necessarily all that chaotic
stuffs, like just killing things

170
00:08:57,500 --> 00:09:00,200
and doing some like one data 
center takedown or something 

171
00:09:00,200 --> 00:09:02,800
like that, but it's not 
necessary about the chaos 

172
00:09:02,800 --> 00:09:04,300
itself. 
Although a lot of people 

173
00:09:04,300 --> 00:09:07,500
actually has this misconception.
Okay, let's Just introduced this

174
00:09:07,500 --> 00:09:09,100
scale software and then 
everything. 

175
00:09:09,100 --> 00:09:11,500
Just go haywire. 
And let's see how the system 

176
00:09:11,500 --> 00:09:14,300
goes here. 
So chaos is a little bit of a 

177
00:09:14,400 --> 00:09:18,700
double-edged sword because on 
one hand, it catches attention 

178
00:09:18,700 --> 00:09:20,600
on headlines. 
On the other hand. 

179
00:09:20,600 --> 00:09:22,900
It requires a little bit of 
explanation. 

180
00:09:23,200 --> 00:09:25,700
So at least two things to touch 
upon here. 

181
00:09:25,900 --> 00:09:28,200
One, is that the carriers that 
we mention here. 

182
00:09:28,200 --> 00:09:31,100
It's not about increasing the 
amount of Chaos in your system. 

183
00:09:31,100 --> 00:09:34,200
It's about decreasing the amount
of gas in my definition of 

184
00:09:34,208 --> 00:09:38,600
Australian to make you picture. 
Up coat and protective glove 

185
00:09:38,600 --> 00:09:43,500
were, and I were and stuff, 
because if we can't control as 

186
00:09:43,500 --> 00:09:46,000
many of these variables as 
possible. 

187
00:09:46,100 --> 00:09:49,600
We can't reliably confirm or 
deny our hypothesis. 

188
00:09:49,900 --> 00:09:53,900
But the other reason for the 
chaos in the name is that 

189
00:09:53,900 --> 00:09:57,000
there's an entire spectrum of 
things that you can do with goes

190
00:09:57,000 --> 00:10:00,900
engineering, Gas, Monkey, the 
kind of thing that this entire 

191
00:10:01,000 --> 00:10:04,600
discipline started with really 
was very chaotic in the sense of

192
00:10:04,600 --> 00:10:07,500
randomness of the word. 
So So, if you don't know much 

193
00:10:07,500 --> 00:10:10,400
about your system or you're 
looking for the emerging 

194
00:10:10,400 --> 00:10:13,900
properties Randomness can really
give you a lot of value, kind of

195
00:10:13,900 --> 00:10:16,400
out of the box and requires very
little setup. 

196
00:10:16,600 --> 00:10:20,300
So we take the system, you 
release the chaos monkey on it. 

197
00:10:20,400 --> 00:10:23,900
You let it run and you probably 
find some things you can 

198
00:10:23,900 --> 00:10:26,400
parametrize. 
You can make sure that it breaks

199
00:10:26,400 --> 00:10:28,400
on your certain percentage or 
whatnot. 

200
00:10:28,700 --> 00:10:33,000
And this already gives you value
and I see the sign of a spectrum

201
00:10:33,000 --> 00:10:36,300
as similar to the discipline of 
fuzzing in testing. 

202
00:10:36,600 --> 00:10:41,300
You produce a lot of valid 
inputs that you would probably 

203
00:10:41,300 --> 00:10:43,600
not. 
Think of if you were writing 

204
00:10:43,600 --> 00:10:46,900
unit tests manually, you just 
run it through whatever you're 

205
00:10:46,900 --> 00:10:49,400
testing. 
And you might discover things 

206
00:10:49,400 --> 00:10:52,200
that you didn't think of. 
If you just cram all this inputs

207
00:10:52,200 --> 00:10:55,400
and like, brute force is 
approach, right? 

208
00:10:55,600 --> 00:10:59,000
And so, if you start with a 
system, it's a really nice way 

209
00:10:59,000 --> 00:11:01,900
to get into the girls 
engineering, because it doesn't 

210
00:11:01,900 --> 00:11:04,900
require much of a setup. 
You need to have a vague 

211
00:11:04,900 --> 00:11:07,900
understanding of the system. 
You can release it and you can 

212
00:11:07,900 --> 00:11:09,800
find thing. 
And the other thing I mentioned 

213
00:11:09,800 --> 00:11:13,700
is the emergent properties. 
It's also a fancy way of saying 

214
00:11:13,700 --> 00:11:17,800
that, but the greatest rate that
for those of your audience, who 

215
00:11:17,900 --> 00:11:21,900
not heard about that, think 
about neurons in your brain and 

216
00:11:21,900 --> 00:11:25,700
a single neuron doesn't have the
property of human conscious or 

217
00:11:25,700 --> 00:11:29,300
doesn't have a property of 
thinking per se, but then when 

218
00:11:29,300 --> 00:11:33,100
you put them all together and 
they interact from within this 

219
00:11:33,100 --> 00:11:36,800
interactions, you have the 
emergent property of Human 

220
00:11:36,800 --> 00:11:39,500
conscious. 
Same thing for another popular 

221
00:11:39,500 --> 00:11:42,800
example, of the cells in your 
heart. 

222
00:11:43,000 --> 00:11:46,700
None of them have the property 
of pumping blood and oxygenating

223
00:11:46,700 --> 00:11:50,800
your body, but put together they
create a system that actually 

224
00:11:50,800 --> 00:11:54,600
has this property. 
This examples give you very nice

225
00:11:54,600 --> 00:11:58,300
properties, right? 
But in any complexion of system,

226
00:11:58,300 --> 00:11:59,900
you're going to have the 
interactions that you just 

227
00:11:59,900 --> 00:12:03,200
didn't predict and sometimes 
they're going to be pretty bad 

228
00:12:03,200 --> 00:12:05,400
for you. 
So this kind of fuzzing 

229
00:12:05,500 --> 00:12:08,200
randomness. 
Scales engineering side of the 

230
00:12:08,200 --> 00:12:11,500
spectrum is great, and it's 
useful. 

231
00:12:11,500 --> 00:12:14,600
And it's part of the discipline 
over the last few years. 

232
00:12:14,600 --> 00:12:18,000
We've been trying to push for 
the other side of the spectrum 

233
00:12:18,000 --> 00:12:21,100
where you go into the system. 
You already know the system. 

234
00:12:21,100 --> 00:12:24,600
Well and you're working in 
particular properties and you 

235
00:12:24,600 --> 00:12:27,800
want to make sure that 
particular failure scenarios are

236
00:12:27,800 --> 00:12:30,300
covered from the SRE point of 
view. 

237
00:12:30,300 --> 00:12:33,400
If you had an outage that you 
didn't predict for because the 

238
00:12:33,400 --> 00:12:37,300
system worked differently than 
you expected you would To make 

239
00:12:37,300 --> 00:12:38,800
sure that doesn't happen. 
Again. 

240
00:12:39,000 --> 00:12:41,700
One of the best kind of 
regression tests that you can 

241
00:12:41,700 --> 00:12:45,000
come up with is to simulate that
failure scenario. 

242
00:12:45,300 --> 00:12:48,400
Make sure that the system 
actually still survives that. 

243
00:12:48,600 --> 00:12:52,100
So, the other side of the 
spectrum of the sophisticated, 

244
00:12:52,100 --> 00:12:54,800
very deliberate practice. 
When there is very, the 

245
00:12:54,800 --> 00:12:59,800
randomness is also available to 
you right now, as a practitioner

246
00:12:59,800 --> 00:13:03,600
of chaos engineering, you can 
pick wherever its best on that 

247
00:13:03,600 --> 00:13:06,500
Spectrum for you. 
And I think that's It's 

248
00:13:06,500 --> 00:13:09,600
definitely good thing. 
So, as a techie, everyone 

249
00:13:09,600 --> 00:13:12,200
understands at least the 
concept, but when you introduce 

250
00:13:12,200 --> 00:13:14,900
this, to the management of the 
business, for example, so we are

251
00:13:14,900 --> 00:13:18,300
going to introduce chaos monkey 
or chaos Engineering in our 

252
00:13:18,300 --> 00:13:20,500
system. 
How do you actually explain to 

253
00:13:20,500 --> 00:13:22,600
them? 
What is their motivation 

254
00:13:22,600 --> 00:13:25,000
rationale behind this? 
Because I'm sure any business, 

255
00:13:25,000 --> 00:13:27,500
and any Executives, they all 
want stability. 

256
00:13:27,700 --> 00:13:30,500
They don't want some kind of a 
Randomness or attack to the 

257
00:13:30,500 --> 00:13:33,300
system that is working fine. 
So how do you explain the 

258
00:13:33,300 --> 00:13:36,100
rationale and motivation behind 
this to sell it to them? 

259
00:13:36,800 --> 00:13:39,300
Yeah, that's probably like the 
number one question, and that 

260
00:13:39,300 --> 00:13:42,700
goes back to the name, being a 
little bit misleading, and a 

261
00:13:42,700 --> 00:13:44,800
little bit of a double-edged 
sword. 

262
00:13:45,000 --> 00:13:49,300
So it depends who you talk to. 
And I, in general tends to have 

263
00:13:49,300 --> 00:13:51,200
shit. 
Has two big groups. 

264
00:13:51,200 --> 00:13:53,100
Either. 
It's your colleagues or people 

265
00:13:53,100 --> 00:13:56,800
who have had the experience of 
actually being paged in the 

266
00:13:56,808 --> 00:13:59,700
middle of the night. 
And I think the only real 

267
00:13:59,800 --> 00:14:03,900
argument you need to get them on
board is to tell them that. 

268
00:14:03,900 --> 00:14:08,000
Listen, if we do it right there 
is a huge potential for you to 

269
00:14:08,000 --> 00:14:11,500
be called less at night. 
So there are like basically the 

270
00:14:11,500 --> 00:14:13,700
worst case scenario is that you 
just don't go some Junior and 

271
00:14:13,700 --> 00:14:17,200
you break something, if you do 
it well and you do it with 

272
00:14:17,200 --> 00:14:21,800
common sense and you apply all 
the same best principles for 

273
00:14:21,900 --> 00:14:24,800
deploying the code because the 
case experiments are code like 

274
00:14:24,900 --> 00:14:28,700
any other really you're going to
only affect a certain blast 

275
00:14:28,700 --> 00:14:31,100
radius anyway, but the 
worst-case scenario, is that. 

276
00:14:31,100 --> 00:14:32,900
Okay? 
So we did get engineering and we

277
00:14:32,900 --> 00:14:35,400
broke something. 
So what happens then, one of the

278
00:14:35,400 --> 00:14:38,200
things that It's worth 
mentioning is that you doing it 

279
00:14:38,200 --> 00:14:41,100
on purpose? 
So if it breaks your typically 

280
00:14:41,100 --> 00:14:45,500
in the office, you're typically 
ready to jump to fix that. 

281
00:14:45,500 --> 00:14:49,100
You don't have the complex, 
which waking up in the middle of

282
00:14:49,108 --> 00:14:51,300
the night. 
When you're being paged to 

283
00:14:51,300 --> 00:14:53,600
figure out what's going on, is 
nothing Pleasant. 

284
00:14:53,600 --> 00:14:56,800
I'm sure a lot of your audience 
will have had that experience. 

285
00:14:56,900 --> 00:14:59,700
You have to wake up. 
You need to make the coffee, 

286
00:14:59,700 --> 00:15:02,900
read through the alerts, that 
may or may not be easy to 

287
00:15:02,900 --> 00:15:06,800
understand deal with people who 
are panicking because They just 

288
00:15:06,800 --> 00:15:09,500
got woken up and they don't know
what's going on your contacts, 

289
00:15:09,500 --> 00:15:12,500
which, because between all these
things, you login, it takes 

290
00:15:12,500 --> 00:15:15,300
time. 
It's typically not a very 

291
00:15:15,300 --> 00:15:17,200
pleasant experience. 
Let's just be honest. 

292
00:15:17,500 --> 00:15:20,800
So, if you can minimize the 
number of times that this 

293
00:15:20,800 --> 00:15:25,900
happens, and instead try to do 
it more purposefully, that's a 

294
00:15:25,908 --> 00:15:30,400
big step up. 
So if you are on rotor for 

295
00:15:30,400 --> 00:15:32,500
supporting your system, this is 
probably. 

296
00:15:32,500 --> 00:15:36,100
All you need to know. 
The other group is people who 

297
00:15:36,900 --> 00:15:40,300
Decision-making on this kind of 
stuff your managers. 

298
00:15:40,300 --> 00:15:45,200
And this kind of person for my 
experience, the best arguments 

299
00:15:45,200 --> 00:15:50,200
to start with, its to do some 
back of the napkin mods. 

300
00:15:50,400 --> 00:15:52,900
Now, I can have a phrase for 
that. 

301
00:15:53,500 --> 00:15:57,300
I typically colored the 
hamburger versus Shark problem. 

302
00:15:57,600 --> 00:16:00,100
It's about the perception of 
risk. 

303
00:16:00,200 --> 00:16:04,100
So like I mentioned just a 
second ago, if you do it, right?

304
00:16:04,100 --> 00:16:06,200
And you are careful about the 
blast radius. 

305
00:16:06,300 --> 00:16:08,600
Jesus. 
It's pretty a lot like releasing

306
00:16:08,600 --> 00:16:11,700
any other challenge to 
production, you test things for 

307
00:16:11,700 --> 00:16:15,000
the stages and you apply all the
same principles you do for all 

308
00:16:15,000 --> 00:16:17,700
other bits of code. 
But if you ask a person 

309
00:16:17,700 --> 00:16:20,800
industry, how afraid should they
be of sharks? 

310
00:16:20,900 --> 00:16:25,800
They have been already primed by
Hollywood movies, the Joe's the 

311
00:16:25,800 --> 00:16:28,700
Meg, and all of that has to be 
very afraid of sharks. 

312
00:16:28,900 --> 00:16:31,500
But if they actually look at the
statistics of how many people 

313
00:16:31,500 --> 00:16:34,500
die of shark attacks. 
It's a very minuscule number, 

314
00:16:34,600 --> 00:16:38,200
there is more people dying of 
Coconuts landing on their heads.

315
00:16:38,200 --> 00:16:42,000
Every year than there is a 
shark's where us the things that

316
00:16:42,000 --> 00:16:46,700
are really statistically likely 
to kill them, things like heart,

317
00:16:46,700 --> 00:16:49,600
attacks or heart disease. 
In general, that take more than 

318
00:16:49,600 --> 00:16:53,100
half a million of Americans 
every single year compared to 

319
00:16:53,200 --> 00:16:56,500
probably a single digit number 
for shark attacks in any given 

320
00:16:56,500 --> 00:16:59,500
year. 
You see that this is actually 

321
00:16:59,500 --> 00:17:01,100
statistically much more 
dangerous D. 

322
00:17:01,100 --> 00:17:04,400
So, next time, you see that 
hamburger, we fold this crease 

323
00:17:04,400 --> 00:17:06,200
and, and all of that. 
Think about it that this. 

324
00:17:06,300 --> 00:17:08,599
Is actually more likely to kill 
you, then the shark. 

325
00:17:08,700 --> 00:17:10,400
What was the point. 
I'm trying to make here. 

326
00:17:10,500 --> 00:17:13,700
The but I'm trying to make is 
that a case engineering when you

327
00:17:13,700 --> 00:17:17,700
first hear about that and about 
introducing failure on purpose. 

328
00:17:17,800 --> 00:17:21,800
It's a lot like end-to-end 
testing of the unhappy paths the

329
00:17:21,800 --> 00:17:24,000
way I see it. 
It's like the evolution. 

330
00:17:24,000 --> 00:17:27,500
You do the unit test. 
You just a small little subset. 

331
00:17:27,800 --> 00:17:31,800
Then you do maybe component, 
testing some kind of integration

332
00:17:31,800 --> 00:17:33,800
testing. 
Then, at some point you get to 

333
00:17:33,800 --> 00:17:36,200
end-to-end testing, where you 
typically test. 

334
00:17:36,300 --> 00:17:38,900
Happy path or some popular path 
and stuff like that. 

335
00:17:38,900 --> 00:17:41,600
And then case in June is a lot 
like end-to-end testing of the 

336
00:17:41,600 --> 00:17:43,300
system. 
When you take it as a home 

337
00:17:43,700 --> 00:17:47,300
during unhappy events, when 
things break, when machines go 

338
00:17:47,300 --> 00:17:50,300
down, when Network gets slow, 
and this kind of thing. 

339
00:17:50,600 --> 00:17:54,800
So for someone who is hearing, 
okay, we're going to add failure

340
00:17:54,800 --> 00:17:58,800
to our assistance and we would 
like to also eventually do it in

341
00:17:58,800 --> 00:18:01,100
production. 
It sounds scary, but that's like

342
00:18:01,100 --> 00:18:02,900
the shock. 
If you think about it. 

343
00:18:03,200 --> 00:18:06,100
There are ways to manage the 
blast radius and they're armed. 

344
00:18:06,300 --> 00:18:09,600
Ways to manage that risk and the
goal of the entire exercise is 

345
00:18:09,600 --> 00:18:13,000
decreased the amount of chaos 
and decrease the amount of risk 

346
00:18:13,000 --> 00:18:15,400
that you have in your system, 
not increase it. 

347
00:18:15,600 --> 00:18:19,500
So just a napkin, a few numbers 
run through that and you can 

348
00:18:19,500 --> 00:18:23,200
typically explain to your 
engineers that it's really, they

349
00:18:23,200 --> 00:18:25,900
shouldn't pay too much attention
to this Curry sending name 

350
00:18:26,100 --> 00:18:29,300
because there's a lot of return 
on investment to harvest. 

351
00:18:29,800 --> 00:18:33,100
So, if I hear you correctly or 
discuss, engineering doesn't 

352
00:18:33,100 --> 00:18:36,200
necessarily mean that you have 
to run this experiment in. 

353
00:18:36,300 --> 00:18:39,400
Second, you can also run it in 
maybe a pre prod, or some kind 

354
00:18:39,400 --> 00:18:42,000
of environment where you can 
actually simulate and do. 

355
00:18:42,000 --> 00:18:43,600
All these tests. 
Is that correct? 

356
00:18:43,800 --> 00:18:46,400
Although, if you don't try it in
production, you're excluded from

357
00:18:46,400 --> 00:18:48,100
the Coast Community. 
Okay. 

358
00:18:48,200 --> 00:18:50,900
So, you know, this is one of the
things because of all the blog 

359
00:18:50,900 --> 00:18:53,500
posts. 
It's shiny doing stuff like this

360
00:18:53,500 --> 00:18:56,200
in production is unorthodox, and
it makes for a great 

361
00:18:56,200 --> 00:18:58,800
presentation, but that's the 
Holy Grail, right? 

362
00:18:58,800 --> 00:19:02,900
You want to be asked comfortable
with the system that you've done

363
00:19:02,900 --> 00:19:05,200
this for so long in the other 
stages. 

364
00:19:05,200 --> 00:19:07,600
That is now part. 
Out of your routine and you do 

365
00:19:07,600 --> 00:19:09,300
it in production, and that's 
great. 

366
00:19:09,300 --> 00:19:12,100
And if you can do that, that's 
obviously great. 

367
00:19:12,200 --> 00:19:14,900
Because if you think about that,
you're never fully testing 

368
00:19:14,900 --> 00:19:17,300
things until they get into 
production. 

369
00:19:17,500 --> 00:19:20,700
The data patterns will be 
different usage patterns will be

370
00:19:20,700 --> 00:19:22,900
different. 
So, by definition, you can never

371
00:19:22,900 --> 00:19:26,800
fully test things on it until 
the actual proverbial rubber 

372
00:19:26,800 --> 00:19:29,400
hits. 
The proverbial grow, but the 

373
00:19:29,400 --> 00:19:32,900
common sense sticking the 
clothes engineering sticker on 

374
00:19:32,900 --> 00:19:34,700
your laptop. 
Doesn't necessarily give you a 

375
00:19:34,708 --> 00:19:38,800
more Absolution to Piercing 
common sensor behind the scenes,

376
00:19:38,800 --> 00:19:42,100
the less shiny bit is that you 
are progressing. 

377
00:19:42,100 --> 00:19:45,200
That the same way that you 
progress, all other software. 

378
00:19:45,400 --> 00:19:48,000
Rewrite this code. 
You run it in your test 

379
00:19:48,000 --> 00:19:50,300
environments. 
You might increase it. 

380
00:19:50,300 --> 00:19:52,900
You probably going to do the 
same thing that you do. 

381
00:19:52,900 --> 00:19:55,800
When you release any other 
software and limit the 

382
00:19:56,100 --> 00:19:59,900
percentage of traffic that goes 
through the systems that have 

383
00:19:59,900 --> 00:20:03,500
that and then increase it. 
If anything goes wrong, you roll

384
00:20:03,500 --> 00:20:06,700
back. 
So the same best Suppose really 

385
00:20:06,700 --> 00:20:09,600
apply. 
Once you get to the holy grail 

386
00:20:09,600 --> 00:20:11,600
and the ranking production. 
That's great. 

387
00:20:11,600 --> 00:20:14,900
You can write a blog post and go
in the conference and talk about

388
00:20:14,900 --> 00:20:19,800
it as some of us do, but it's 
not just about that. 

389
00:20:19,800 --> 00:20:23,400
And also, it's probably worth 
mentioning that there are cases 

390
00:20:23,400 --> 00:20:26,400
that are use cases where it's 
probably never going to be. 

391
00:20:26,400 --> 00:20:30,100
Okay, if your choices either to 
potentially introduce this 

392
00:20:30,100 --> 00:20:33,400
failure and you're not confident
whether it's going to work and 

393
00:20:33,400 --> 00:20:36,200
someone might die because of 
that that's probably not. 

394
00:20:36,300 --> 00:20:39,700
Out the right, moral choice to 
do that or if you have very 

395
00:20:39,700 --> 00:20:43,500
heavy contractual obligations. 
Now from the legal point of 

396
00:20:43,500 --> 00:20:46,000
view, you might not be able to 
do that, but it doesn't mean 

397
00:20:46,000 --> 00:20:50,000
that you can't get 80% of the 
value of harvesting the lower 

398
00:20:50,000 --> 00:20:53,900
hanging fruit in the process. 
So, now, that's a myth. 

399
00:20:54,400 --> 00:20:57,500
Thanks for clarifying that. 
So, obviously, for us people who

400
00:20:57,500 --> 00:21:00,100
are not experienced in it. 
There are many other more 

401
00:21:00,100 --> 00:21:03,600
misconceptions, probably the one
that exclude you from the chaos 

402
00:21:03,600 --> 00:21:07,200
engineer group, so maybe you can
tell us some of The other common

403
00:21:07,200 --> 00:21:09,900
misconceptions that people have 
sure, sir. 

404
00:21:09,900 --> 00:21:15,400
There is an entire Suite of this
and I I typically on the regular

405
00:21:15,400 --> 00:21:17,800
basis. 
I asked my LinkedIn Network to 

406
00:21:17,800 --> 00:21:21,600
say what's the biggest blocker 
and it still seems to be getting

407
00:21:21,600 --> 00:21:25,000
the buy-in in big part because 
of this misconception. 

408
00:21:25,000 --> 00:21:28,000
So the production one is a big 
one and that's typically 

409
00:21:28,000 --> 00:21:31,400
something that people really 
think the other thing is 

410
00:21:31,400 --> 00:21:35,000
Randomness. 
So a lot of people basically 

411
00:21:35,100 --> 00:21:38,800
disregard U.s. 
Engineering because they feel 

412
00:21:38,800 --> 00:21:43,000
like it's a gimmick chaos, 
monkey, randomly smashing 

413
00:21:43,000 --> 00:21:47,800
things, while for bigger 
systems, like Netflix again, 

414
00:21:47,800 --> 00:21:51,200
that example that works nicely 
because doing something as 

415
00:21:51,200 --> 00:21:53,900
simple as taking down. 
The VA ends is already giving 

416
00:21:53,900 --> 00:21:56,500
you value. 
It might not work for a smaller 

417
00:21:56,500 --> 00:21:59,800
system. 
So explaining what we just 

418
00:21:59,800 --> 00:22:01,900
talked about that. 
It's not just about the 

419
00:22:01,900 --> 00:22:04,800
randomness. 
It's about entire Spectrum from 

420
00:22:04,800 --> 00:22:07,500
random to very deliberate. 
It's also important. 

421
00:22:07,500 --> 00:22:10,700
Another thing. 
I keep hearing is the oh, we 

422
00:22:10,700 --> 00:22:14,000
already have chaos enough. 
We don't need new ha, we've 

423
00:22:14,000 --> 00:22:17,400
touched upon this already too. 
But I really would like to drive

424
00:22:17,400 --> 00:22:19,700
the point home, that, if you 
still think that case, 

425
00:22:19,700 --> 00:22:22,700
engineering is about adding 
chaos to your systems. 

426
00:22:23,000 --> 00:22:25,600
Then you still haven't gotten 
the memo. 

427
00:22:25,800 --> 00:22:30,700
We add the failure so that the 
amount of incertitude, the 

428
00:22:30,708 --> 00:22:33,200
amount of Chaos in your system 
actually decreases. 

429
00:22:33,200 --> 00:22:36,100
So the net net, we do all of 
that. 

430
00:22:36,200 --> 00:22:39,600
A decreased amount of care. 
So obviously under that 

431
00:22:39,600 --> 00:22:42,100
umbrella. 
A lot of people say that because

432
00:22:42,100 --> 00:22:45,800
they don't feel that confident 
in their systems, all of our 

433
00:22:45,800 --> 00:22:49,100
systems break. 
And if you are just hearing 

434
00:22:49,100 --> 00:22:52,900
about this and you had a massive
outage last week, you might feel

435
00:22:52,900 --> 00:22:55,600
okay, fine. 
I need more maturity or 

436
00:22:55,600 --> 00:22:58,400
something like that. 
I think that's a false premise 

437
00:22:58,400 --> 00:23:02,100
to because regardless of where 
you are on your maturity 

438
00:23:02,100 --> 00:23:05,700
Spectrum, you can get some value
out of doing that. 

439
00:23:06,200 --> 00:23:09,800
Typically discuss experiments 
unless you go really into the 

440
00:23:09,800 --> 00:23:12,700
deep weeds. 
They are simple to setup. 

441
00:23:12,700 --> 00:23:16,200
They don't take that long to 
implement and especially as the 

442
00:23:16,200 --> 00:23:19,700
tooling around that gets better 
and better by the day. 

443
00:23:19,700 --> 00:23:23,800
It's easier and easier to do 
pretty sophisticated things with

444
00:23:23,800 --> 00:23:26,000
little effort. 
Worst case scenario, you don't 

445
00:23:26,000 --> 00:23:29,500
discover anything and you feel a
bit better because you know that

446
00:23:29,500 --> 00:23:32,600
this particular scenario is not 
going to take your system down 

447
00:23:32,800 --> 00:23:34,700
and best case scenario. 
You can discover something. 

448
00:23:34,700 --> 00:23:38,100
So it's really like the small. 
A batch that can pay off that 

449
00:23:38,100 --> 00:23:42,900
are pretty cheap to do. 
There's also the benefit of the 

450
00:23:42,900 --> 00:23:45,200
kind of mindset that goes with 
that. 

451
00:23:45,200 --> 00:23:49,300
If your engineers and your team 
think about the fact that this 

452
00:23:49,300 --> 00:23:53,000
is going to be tested, that this
is actually going to be put in 

453
00:23:53,000 --> 00:23:56,100
practice and they have to design
the systems with this. 

454
00:23:56,100 --> 00:23:59,100
In mind. 
It just goes more naturally to 

455
00:23:59,100 --> 00:24:04,100
bake this reliability from day 
one rather than doing this as it

456
00:24:04,100 --> 00:24:06,600
breaks. 
So the mindset is I also saw 

457
00:24:06,600 --> 00:24:08,600
that if we could probably talk 
about it would mock. 

458
00:24:08,700 --> 00:24:12,100
But we kind of think that who 
we're not that mature. 

459
00:24:12,100 --> 00:24:14,200
We still have outages and stuff 
like that. 

460
00:24:14,200 --> 00:24:17,800
It's really not helping because 
at any stage in your maturity, 

461
00:24:17,900 --> 00:24:22,100
even if you still getting 
outages from the regular basis, 

462
00:24:22,300 --> 00:24:26,600
you can add value out of that. 
So, no, it's not adding more 

463
00:24:26,600 --> 00:24:29,300
chaos. 
It's definitely not we're not 

464
00:24:29,300 --> 00:24:32,700
mature enough. 
I think one more is about the 

465
00:24:32,700 --> 00:24:38,800
scale, people look at Netflix. 
And they look at Google and they

466
00:24:38,800 --> 00:24:42,400
say okay. 
Yeah, this are massive companies

467
00:24:42,400 --> 00:24:46,300
and they're bleeding edge and 
they have all the scale and 

468
00:24:46,300 --> 00:24:49,500
that's great. 
But it's also worth noting that 

469
00:24:49,500 --> 00:24:52,500
guy's engineering doesn't 
require you to have all this 

470
00:24:52,500 --> 00:24:55,000
fancy and massive distributed 
systems. 

471
00:24:55,300 --> 00:24:58,500
It can really work. 
It's more of a methodology that 

472
00:24:58,500 --> 00:25:02,300
can really work on any kind of 
system whether you're working 

473
00:25:02,300 --> 00:25:05,800
with a single process Legacy 
server. 

474
00:25:06,100 --> 00:25:09,500
And you would like to make sure 
that you understand how it 

475
00:25:09,500 --> 00:25:12,800
breaks and the kind of things 
that you expect to happen. 

476
00:25:13,100 --> 00:25:18,200
Don't break it, rich rice or 
whatever, recovery logic. 

477
00:25:18,300 --> 00:25:19,800
You can do that. 
Nothing stopping you. 

478
00:25:20,200 --> 00:25:25,000
So really don't be shy, kick the
tires and you'll see that with a

479
00:25:25,000 --> 00:25:27,400
little bit of investment. 
You can get a lot of value out 

480
00:25:27,400 --> 00:25:31,500
of that and you don't need to be
like, oh no, we're too small for

481
00:25:31,500 --> 00:25:33,900
them. 
So I think that's like my top 

482
00:25:33,900 --> 00:25:35,900
Nate's that. 
I just keep hearing over and 

483
00:25:35,900 --> 00:25:38,000
over. 
It's probably too late though to

484
00:25:38,000 --> 00:25:40,600
change the way home. 
So we're probably going to be 

485
00:25:40,600 --> 00:25:43,200
hearing them. 
Thanks for clarifying, all these

486
00:25:43,200 --> 00:25:46,100
misconceptions and myths about 
Kelsey engineering. 

487
00:25:46,200 --> 00:25:48,800
So in the beginning, I think you
have mentioned this as well for 

488
00:25:48,800 --> 00:25:50,400
people who have heard about 
this. 

489
00:25:50,400 --> 00:25:51,800
So they want to give it a try. 
Right. 

490
00:25:52,200 --> 00:25:55,400
Can you summarize again the four
steps that they need in order to

491
00:25:55,400 --> 00:25:59,200
start introducing these scales 
engineering, sure thing. 

492
00:25:59,400 --> 00:26:03,500
So first is the observability 
and I like the word because it's

493
00:26:03,500 --> 00:26:05,900
very technical. 
But what it means is being able 

494
00:26:05,900 --> 00:26:09,000
to To measure something, 
reliably, what I mean by 

495
00:26:09,000 --> 00:26:12,500
reliably is that in computers, 
you know, we're kind of spoiled 

496
00:26:12,500 --> 00:26:15,300
because we can measure some 
things reasonably. 

497
00:26:15,300 --> 00:26:18,800
Well, if we're Quantum 
physicists, it will be a bigger 

498
00:26:18,800 --> 00:26:22,900
problem because the measurement 
could affect what we try to 

499
00:26:22,900 --> 00:26:24,800
measure. 
And then we have the uncertainty

500
00:26:24,800 --> 00:26:27,500
principle to Begin work. 
But what I'm getting at here is 

501
00:26:27,500 --> 00:26:31,700
that it's important to be good 
at observing. 

502
00:26:31,700 --> 00:26:34,700
This it can be very simple. 
This variable can be anything. 

503
00:26:34,700 --> 00:26:37,400
It could be whether the server. 
Up and down, then there's 

504
00:26:37,400 --> 00:26:41,900
something can observe our 
throughput or number of requests

505
00:26:41,900 --> 00:26:46,000
per second or anything, really? 
And then, once you have that, I 

506
00:26:46,000 --> 00:26:49,100
mentioned, the steady state, the
normal range of that is how you 

507
00:26:49,100 --> 00:26:52,400
verify what's going on. 
So normally the server is up, 

508
00:26:53,300 --> 00:26:56,300
are normally I get this many 
requests per second. 

509
00:26:56,400 --> 00:27:01,800
And then we hypothesize. 
We say ok, so we are disc right 

510
00:27:01,800 --> 00:27:03,600
heavy. 
Let's see that. 

511
00:27:03,600 --> 00:27:06,000
We're sure that we understand 
what's happening. 

512
00:27:06,100 --> 00:27:10,000
Meaning if someone steals a 
little bit of the disc, so we 

513
00:27:10,000 --> 00:27:15,300
might introduce a hypothesis. 
If the disc becomes 50%, slower.

514
00:27:15,400 --> 00:27:20,100
I still hit, let's say, 50% on 
my steady stage in terms of 

515
00:27:20,100 --> 00:27:23,100
requests per second. 
And then the last step is to 

516
00:27:23,100 --> 00:27:25,800
implement though, take one of 
the many tools that are 

517
00:27:25,800 --> 00:27:29,900
available right now to do that, 
and make sure that it doesn't 

518
00:27:29,900 --> 00:27:33,800
affect your observability. 
Go run it and see what happens. 

519
00:27:34,200 --> 00:27:35,600
More often. 
Do not, you're going to 

520
00:27:35,600 --> 00:27:37,500
discover. 
Things that you might have not 

521
00:27:37,500 --> 00:27:39,700
predicted. 
Thanks for summarizing that 

522
00:27:39,700 --> 00:27:41,600
again. 
So I'll just to recap first. 

523
00:27:41,600 --> 00:27:43,500
You need to have a good Observer
bility. 

524
00:27:43,500 --> 00:27:44,700
So I think this is a mass, 
right? 

525
00:27:44,700 --> 00:27:47,500
You cannot introduce something 
chaotic without actually knowing

526
00:27:47,500 --> 00:27:49,900
and observing how the system 
behaves we got that kind of 

527
00:27:49,900 --> 00:27:51,200
defeats the purpose. 
I guess. 

528
00:27:51,500 --> 00:27:54,100
Then the second one is you need 
to know your steady state. 

529
00:27:54,200 --> 00:27:57,900
So what exactly your system at 
the normal State before you 

530
00:27:57,900 --> 00:28:01,400
introduce this chaos and the 
third you start to hypothesize 

531
00:28:01,400 --> 00:28:04,500
and making experiments, I guess 
like what kind of things that 

532
00:28:04,500 --> 00:28:05,900
you want to test against the 
state? 

533
00:28:06,100 --> 00:28:08,000
Ready State. 
And then the fourth, the last 

534
00:28:08,000 --> 00:28:10,800
one will be to implement that. 
So I hope I summarize it 

535
00:28:10,800 --> 00:28:12,600
correctly. 
You mention as well. 

536
00:28:12,600 --> 00:28:15,500
There's this chaos, engineer 
community and things like that. 

537
00:28:15,600 --> 00:28:19,100
Is there a specific skills 
actually required from a chaos 

538
00:28:19,100 --> 00:28:21,000
engineer compared to? 
I don't know, like normally 

539
00:28:21,000 --> 00:28:24,900
engineer or a sorry, or maybe 
you can clarify on that pot any 

540
00:28:24,900 --> 00:28:28,000
particular skills for girls 
engineer here. 

541
00:28:28,000 --> 00:28:31,100
So this is one of the things 
that I really like about because

542
00:28:31,100 --> 00:28:35,100
engineering is that it really 
cuts through different stacks 

543
00:28:35,100 --> 00:28:37,800
and different too. 
Colleges and different kinds of 

544
00:28:37,800 --> 00:28:41,400
Designing systems. 
It's really more of a meta skill

545
00:28:41,400 --> 00:28:44,600
the way that I see it. 
That also means that it's not 

546
00:28:44,600 --> 00:28:47,600
necessarily a job description. 
I mean, some people have this 

547
00:28:47,600 --> 00:28:52,100
job description, but it's more 
of a mindset / skill, that a lot

548
00:28:52,100 --> 00:28:54,300
of different people can have 
that. 

549
00:28:54,500 --> 00:28:58,600
So from what I do daily, you 
know as a kind of a sorry type 

550
00:28:58,600 --> 00:29:03,000
person is that I care a lot 
about my sister not blowing up 

551
00:29:03,000 --> 00:29:06,900
then my system working well. 
And as the scale Increases and 

552
00:29:06,900 --> 00:29:10,600
everything. 
You realize that the failure is 

553
00:29:10,700 --> 00:29:14,500
when rather than if, and you 
start seeing more and more of 

554
00:29:14,500 --> 00:29:18,700
that as it grows. 
So the primary driver for me is 

555
00:29:18,700 --> 00:29:22,300
to have another tool that helps 
me sleep better at night and 

556
00:29:22,300 --> 00:29:26,900
get, paged less, or call it now.
So this is something that works 

557
00:29:26,900 --> 00:29:30,200
well for the as a radius, but if
you're an application team, you 

558
00:29:30,200 --> 00:29:32,000
can also leverage the same 
thing. 

559
00:29:32,200 --> 00:29:35,400
You design your system, your 
application with a very working 

560
00:29:35,400 --> 00:29:38,000
on, in a way. 
It said, it's the most reliable 

561
00:29:38,000 --> 00:29:41,000
possible. 
So you apply the same skills and

562
00:29:41,000 --> 00:29:43,800
there's nothing stopping you 
from running your own chaos 

563
00:29:43,800 --> 00:29:47,100
experiments. 
The SRE, kind of person might be

564
00:29:47,100 --> 00:29:51,300
running a platform that runs a 
lot of client echoed that they 

565
00:29:51,300 --> 00:29:55,000
know little about potentially 
and the application side. 

566
00:29:55,000 --> 00:29:58,300
You run fewer of the 
applications, but you have much 

567
00:29:58,300 --> 00:30:00,500
more intimate knowledge about 
them. 

568
00:30:00,600 --> 00:30:03,800
So, to give you a more tangible 
example, let's say that the 

569
00:30:03,808 --> 00:30:05,900
chances are that using a 
database. 

570
00:30:06,300 --> 00:30:10,200
One of the things that you 
write, as you write your test 

571
00:30:10,200 --> 00:30:13,500
cases, your unit test and you 
verify, what happens when the 

572
00:30:13,500 --> 00:30:17,300
database becomes unavailable. 
You verified that right errors 

573
00:30:17,300 --> 00:30:20,100
return, for example, so 
something like that or that 

574
00:30:20,100 --> 00:30:22,800
there is a rich way, something 
like that and that's great. 

575
00:30:23,100 --> 00:30:24,700
That's hopefully what everybody 
is doing. 

576
00:30:24,900 --> 00:30:28,300
But now the question is, do you 
know what happens when the 

577
00:30:28,300 --> 00:30:32,100
database is still available? 
But it becomes a bit slower. 

578
00:30:32,400 --> 00:30:37,200
Do you know what happens if it's
a all The connections to your 

579
00:30:37,200 --> 00:30:41,500
database are now being throttle 
because there's some busy 

580
00:30:41,600 --> 00:30:45,200
neighbor or just networking is 
overloaded and you get much 

581
00:30:45,200 --> 00:30:47,500
slower. 
Do you understand how the 

582
00:30:47,500 --> 00:30:50,300
compounding effects on all of 
that work? 

583
00:30:50,500 --> 00:30:53,800
And do you have things in place 
to work with that? 

584
00:30:53,900 --> 00:30:56,800
Or is your application just 
going to hang in Forever except 

585
00:30:56,800 --> 00:30:59,700
enough connections? 
Empties up the pool and you get 

586
00:30:59,700 --> 00:31:03,200
stuck forever. 
So at every level of the stock, 

587
00:31:03,200 --> 00:31:05,900
whatever you working, you can 
use the same thing. 

588
00:31:06,000 --> 00:31:10,900
NG I really try to drive this 
point home in my book, that's 

589
00:31:10,900 --> 00:31:15,000
much more applied and technical 
than the other books that were 

590
00:31:15,000 --> 00:31:17,600
available at the time. 
Is that wherever you are on the 

591
00:31:17,600 --> 00:31:20,700
stack on the platform, whether 
you working with the colonel and

592
00:31:20,700 --> 00:31:25,500
you want to verify what happens 
when Cisco's are blocked or are 

593
00:31:25,500 --> 00:31:28,800
delayed or something like that 
to the platform levels 

594
00:31:28,800 --> 00:31:32,900
networking maybe communities 
level containers level all the 

595
00:31:32,900 --> 00:31:36,300
way up to the browser things 
like Okay. 

596
00:31:36,600 --> 00:31:41,500
So JavaScript is everywhere. 
But do you really test out what 

597
00:31:41,500 --> 00:31:46,000
happens when the front end code 
is having trouble connecting 

598
00:31:46,000 --> 00:31:49,000
orbis? 
Actually getting slow responses.

599
00:31:49,100 --> 00:31:52,800
Do you still display coherent 
data rather than loading a 

600
00:31:52,800 --> 00:31:55,500
little bit here? 
A little bit there and giving 

601
00:31:55,500 --> 00:31:59,600
people a false impression. 
So it's really applicable to all

602
00:31:59,600 --> 00:32:02,900
the levels of the stack. 
This is something that's very 

603
00:32:02,900 --> 00:32:07,100
exciting for me because you get 
to look at the And things and 

604
00:32:07,100 --> 00:32:09,500
you know, just contains to this 
and go box. 

605
00:32:10,000 --> 00:32:13,100
So when talking about these 
layers, I'm very interested 

606
00:32:13,100 --> 00:32:16,200
because you can do this at any 
layers, not just platform 

607
00:32:16,200 --> 00:32:20,000
levels, which is what we 
normally heard from Netflix, but

608
00:32:20,000 --> 00:32:22,600
you can also run it at the 
application Level database, 

609
00:32:22,600 --> 00:32:25,800
level even Network level or like
what you said doing it on the 

610
00:32:25,800 --> 00:32:27,600
browser itself. 
I haven't heard it to be honest,

611
00:32:27,900 --> 00:32:29,900
but maybe can you give us some 
examples? 

612
00:32:29,900 --> 00:32:32,900
Like the name of the tools and 
what exactly they are testing 

613
00:32:32,900 --> 00:32:35,900
for these layers, maybe some 
examples would be great for the 

614
00:32:36,000 --> 00:32:37,700
Winston. 
Oh sure thinks. 

615
00:32:37,700 --> 00:32:42,900
So let's say that we've all had 
this moment when someone gave us

616
00:32:42,900 --> 00:32:47,900
some Legacy piece of software 
that has been compiled 10 years 

617
00:32:47,900 --> 00:32:52,500
ago and it runs but the 
documentation is a little bit 

618
00:32:52,600 --> 00:32:55,700
iffy at best. 
There is more of a tribal 

619
00:32:55,700 --> 00:32:59,000
knowledge and you don't 
necessarily want to go and read 

620
00:32:59,000 --> 00:33:02,400
the Fortran code or whatever C 
code that it's implemented in. 

621
00:33:02,700 --> 00:33:05,700
And so if you work in a startup 
that was started last year, 

622
00:33:05,700 --> 00:33:08,900
maybe Maybe you're going to be 
able to skip that part of your 

623
00:33:08,900 --> 00:33:10,600
life, but it's like a rite of 
passage. 

624
00:33:10,800 --> 00:33:14,700
So one thing that you could do 
about it, let's say that it's 

625
00:33:14,700 --> 00:33:18,300
some kind of server and that's 
an example from my book. 

626
00:33:18,300 --> 00:33:20,600
If you want to actually go and 
play with that. 

627
00:33:20,600 --> 00:33:23,700
There's a VM that lets you start
that and play. 

628
00:33:23,900 --> 00:33:27,700
Is that not a lot of people know
that you can use esterase, not 

629
00:33:27,700 --> 00:33:31,800
only to trace the system calls, 
but you can also use it to 

630
00:33:31,800 --> 00:33:35,800
implement changes, so you can 
error system. 

631
00:33:36,100 --> 00:33:39,300
Girls. 
Also, if you have modern in a 

632
00:33:39,300 --> 00:33:42,500
version of a stress, you can 
actually Implement patterns 

633
00:33:42,500 --> 00:33:45,700
where you can fail, for example,
every other request. 

634
00:33:45,700 --> 00:33:48,700
So while the things that you 
could do at this very very low 

635
00:33:48,700 --> 00:33:52,600
level is that the server is 
doing right to send the request 

636
00:33:52,600 --> 00:33:56,200
to you so you can go and you can
verify that. 

637
00:33:56,200 --> 00:33:57,700
For example, if it can't do it, 
right? 

638
00:33:57,700 --> 00:34:00,400
If it gets an error, there is 
some kind of ritual Logics. 

639
00:34:00,700 --> 00:34:04,200
So if you fail a portion of 
that, you're going to see what 

640
00:34:04,200 --> 00:34:07,400
the return logic kicks in. 
And you get the right response 

641
00:34:07,400 --> 00:34:12,199
or if it breaks completely. 
So even if you don't know much 

642
00:34:12,300 --> 00:34:16,000
and you don't even see the 
source code of this and all you 

643
00:34:16,000 --> 00:34:18,900
have to go by is the tribal 
notice that? 

644
00:34:18,900 --> 00:34:21,800
Oh, yeah, it's supposed to 
recover, always recovers. 

645
00:34:22,800 --> 00:34:26,699
You can actually go and get some
mileage out of the most basic 

646
00:34:26,699 --> 00:34:29,300
thing that thing does. 
Because every program your 

647
00:34:29,300 --> 00:34:31,800
system is going to have to go 
and do the right. 

648
00:34:31,800 --> 00:34:35,400
So this is something that a lot 
of people use as trays to see 

649
00:34:35,500 --> 00:34:37,600
the Lolo. 
What's it can do, but not 

650
00:34:37,600 --> 00:34:40,199
everybody knows that. 
It actually lets you implement. 

651
00:34:40,199 --> 00:34:43,300
Chaos experiments. 
Like that stress is also a great

652
00:34:43,300 --> 00:34:47,400
example of the importance of 
observability because the 

653
00:34:47,400 --> 00:34:52,400
penalty of running a program 
while it's being asterisk is 

654
00:34:52,400 --> 00:34:56,300
pretty high. 
So the measurement actually 

655
00:34:56,300 --> 00:34:59,900
effects, the program that you're
running, it slows things down 

656
00:34:59,900 --> 00:35:03,000
very significantly. 
So there are other ways that you

657
00:35:03,000 --> 00:35:05,800
can observe this Wings. 
There are new technologies. 

658
00:35:05,900 --> 00:35:09,300
Is that let you do that? 
Like, EPF the extended Berkeley 

659
00:35:09,300 --> 00:35:13,100
packet, filter that are becoming
increasingly common and 

660
00:35:13,100 --> 00:35:16,500
Powerful, because you can do a 
lot of similar measurements for 

661
00:35:16,500 --> 00:35:20,100
the observability point of view 
without actually affecting this.

662
00:35:20,400 --> 00:35:24,600
But in a lot of scenarios, it's 
going to be workable because you

663
00:35:24,600 --> 00:35:28,300
can measure, if all you're 
interested is success rate. 

664
00:35:28,300 --> 00:35:31,200
It might be okay for you to 
accept the penalty for doing 

665
00:35:31,200 --> 00:35:33,200
your test. 
Something you probably wouldn't 

666
00:35:33,200 --> 00:35:36,300
do on a production system 
because obviously, Let's do 

667
00:35:36,300 --> 00:35:38,300
things down and affect the 
users. 

668
00:35:38,600 --> 00:35:43,000
So this is like the lowest level
that I could think of and that's

669
00:35:43,000 --> 00:35:46,000
why I went all the way there in 
the book to show you that 

670
00:35:46,000 --> 00:35:49,700
worst-case scenario single 
process and you can still apply 

671
00:35:49,700 --> 00:35:54,100
the same principles and then I 
mentioned kubernetes when you go

672
00:35:54,100 --> 00:35:57,400
a level up, there is a lot of 
staff that could bring Auntie 

673
00:35:57,400 --> 00:36:01,900
solves for you, but it also 
introduces its own complexity 

674
00:36:02,100 --> 00:36:05,100
and if you don't have a good 
understanding and just expect 

675
00:36:05,100 --> 00:36:07,300
communities To always do the 
right thing. 

676
00:36:07,400 --> 00:36:11,100
You might be in for a surprise. 
If you are working with 

677
00:36:11,100 --> 00:36:15,000
kubernetes as a product, if 
you're running kubernetes or 

678
00:36:15,008 --> 00:36:17,500
someone else. 
It's also very important to 

679
00:36:17,500 --> 00:36:20,800
understand how communities 
itself fails and what the 

680
00:36:20,800 --> 00:36:23,100
fragile points are. 
And this is something that's 

681
00:36:23,100 --> 00:36:26,400
very easy to do. 
If because engineering and every

682
00:36:26,400 --> 00:36:29,700
one of those experiments is 
helping, then there are things 

683
00:36:29,700 --> 00:36:31,600
like, you can build case 
engineering into your 

684
00:36:31,600 --> 00:36:35,500
application directly. 
Obviously, in this means that 

685
00:36:35,500 --> 00:36:38,700
your Woods is part of the 
application and it's prone to 

686
00:36:38,700 --> 00:36:42,900
introducing new bags, but you 
can do things like activate the 

687
00:36:42,908 --> 00:36:46,100
case, engineering code, only 
behind some kind of flag and 

688
00:36:46,100 --> 00:36:49,100
activate that for some 
percentage of your traffic. 

689
00:36:49,300 --> 00:36:52,800
Make sure that there is no run 
time penalty for running that. 

690
00:36:52,800 --> 00:36:54,400
And the code is not actually 
hit. 

691
00:36:54,400 --> 00:36:57,600
So you can break things for the 
instances where it said 

692
00:36:57,600 --> 00:37:00,000
activated. 
Although in the browser like I 

693
00:37:00,000 --> 00:37:03,600
described, this is something 
that people react of a smile 

694
00:37:03,600 --> 00:37:04,600
initially do. 
So. 

695
00:37:04,600 --> 00:37:06,500
Okay. 
Well I'm going to Give us this 

696
00:37:06,500 --> 00:37:10,800
my browser software, but it 
doesn't take much to show you 

697
00:37:10,800 --> 00:37:13,100
that. 
Actually, if you just play some 

698
00:37:13,100 --> 00:37:16,100
data from the previous request 
and send data, from the new 

699
00:37:16,100 --> 00:37:18,200
request. 
You might actually trick the 

700
00:37:18,200 --> 00:37:22,200
user to do something silly 
because they have stale data or 

701
00:37:22,200 --> 00:37:25,100
inconsistent data. 
So doing things like this is 

702
00:37:25,100 --> 00:37:27,900
also important. 
Yeah, that's three examples 

703
00:37:27,900 --> 00:37:29,800
then. 
Hopefully give you the entire 

704
00:37:29,800 --> 00:37:33,200
spectrum of going all the way up
from the bottom. 

705
00:37:33,600 --> 00:37:35,800
So I think you mentioned a lot 
of these examples. 

706
00:37:35,900 --> 00:37:39,100
Your book, for those of you who 
are interested to know more or 

707
00:37:39,100 --> 00:37:40,700
even play around with some of 
these tools. 

708
00:37:40,700 --> 00:37:44,300
I think Nico, explain it clearly
in the book for this layer, for 

709
00:37:44,300 --> 00:37:45,800
this tool. 
How do you do that? 

710
00:37:45,800 --> 00:37:48,400
So make sure to read the book if
you are interested for them. 

711
00:37:48,600 --> 00:37:50,400
So, I think in the book, that's 
another thing that you 

712
00:37:50,400 --> 00:37:52,700
mentioned, which I find very 
interesting which is why you 

713
00:37:52,700 --> 00:37:56,100
called Carol's engineering. 
But for people, why do you write

714
00:37:56,100 --> 00:37:59,800
a specific section for that? 
Actually, it's a little bit 

715
00:37:59,800 --> 00:38:06,300
unappreciated, but if you think 
about your people systems, Also 

716
00:38:06,300 --> 00:38:09,900
known as teams. 
A lot of the same rules that 

717
00:38:09,900 --> 00:38:12,700
apply when you're building a 
reliable computer system. 

718
00:38:12,700 --> 00:38:16,300
Also, apply to building reliable
human systems. 

719
00:38:16,600 --> 00:38:18,800
Probably easiest way to 
illustrate that. 

720
00:38:18,800 --> 00:38:24,100
Is that every team is always 
going to have some bottlenecks, 

721
00:38:24,100 --> 00:38:29,400
and the bottlenecks might be in 
a form of throughput bottleneck,

722
00:38:29,600 --> 00:38:33,100
or they might be in a form of 
knowledge. 

723
00:38:33,100 --> 00:38:36,800
Bottleneck, if you have only one
person who's Capable of 

724
00:38:36,800 --> 00:38:40,800
debugging a particular system 
and that person is on holiday. 

725
00:38:40,900 --> 00:38:45,200
You're going to have a problem. 
So a lot of the job of having a 

726
00:38:45,200 --> 00:38:49,800
performant and good team is to 
continuously trying to find this

727
00:38:49,800 --> 00:38:53,800
bottlenecks and resolve them and
a lot of this stuff happens 

728
00:38:53,800 --> 00:38:57,600
naturally and if the people on 
the team are thinking in this 

729
00:38:57,600 --> 00:39:01,900
kind of way kind of chaos 
engineering way, it's going to 

730
00:39:01,900 --> 00:39:03,400
be good for your team, long 
term. 

731
00:39:03,700 --> 00:39:06,800
So this is something that the 
last So in the book is actually 

732
00:39:06,800 --> 00:39:10,400
talking about it in detail and 
it's going to the human aspect 

733
00:39:10,400 --> 00:39:13,400
of getting the buy-in. 
So it's touching upon some of 

734
00:39:13,400 --> 00:39:16,700
the bottleneck, some of the 
misconceptions that we just 

735
00:39:16,700 --> 00:39:20,700
described that goes into the 
curse engineering mindset. 

736
00:39:21,000 --> 00:39:24,300
Basically, the way that if you 
think of your system with the 

737
00:39:24,300 --> 00:39:27,500
expectation of failures 
happening rather than a 

738
00:39:27,500 --> 00:39:29,400
possibility of failure is 
happening. 

739
00:39:29,500 --> 00:39:32,500
It's going to help you build 
more reliable systems from the 

740
00:39:32,500 --> 00:39:34,800
ground up. 
It's kind of like, if you have a

741
00:39:34,800 --> 00:39:40,600
good CI, System that always runs
your unit tests and every time 

742
00:39:40,600 --> 00:39:45,100
you push a chance Upstream, 
you're going to automatically 

743
00:39:45,100 --> 00:39:47,100
get the feedback from the unit 
test. 

744
00:39:47,100 --> 00:39:51,200
It becomes part of the culture, 
you get this immediate feedback 

745
00:39:51,300 --> 00:39:53,000
quick, turnaround for the 
feedback. 

746
00:39:53,000 --> 00:39:54,300
It works. 
It doesn't work. 

747
00:39:54,300 --> 00:39:55,500
It detected. 
The problem. 

748
00:39:55,700 --> 00:39:59,900
It's similar with the curse 
engineering for mindset that you

749
00:39:59,900 --> 00:40:03,900
built this in into the automatic
thinking about things, and it's 

750
00:40:03,900 --> 00:40:05,700
not something that you might 
address later. 

751
00:40:05,800 --> 00:40:08,600
Later, it's something that you 
expect to happen. 

752
00:40:08,900 --> 00:40:11,300
It goes back again to the back 
of the envelope. 

753
00:40:11,300 --> 00:40:14,700
Napkin if you think about it. 
It's not a question of. 

754
00:40:14,700 --> 00:40:18,200
If it's a question of when, and 
it's fairly easy to calculate 

755
00:40:18,300 --> 00:40:20,400
depending on your scale. 
How often you're going to see 

756
00:40:20,400 --> 00:40:23,300
the kind of values. 
And then the final bit of that 

757
00:40:23,300 --> 00:40:26,900
is that was inspired by Dave 
runs and presentation and he 

758
00:40:26,900 --> 00:40:30,700
gave out one of the conference's
when he basically described the 

759
00:40:30,700 --> 00:40:32,900
teams and I love the 
description. 

760
00:40:32,900 --> 00:40:38,200
I still remember that as a set 
of By your robots, executing a 

761
00:40:38,200 --> 00:40:43,200
distributed algorithm to produce
some work and he actually went 

762
00:40:43,300 --> 00:40:47,400
much into the details of the 
games that they came up with to 

763
00:40:47,400 --> 00:40:52,700
surface and detect the problems.
I basically followed to his 

764
00:40:52,700 --> 00:40:56,200
leader to include that into the 
book and make sure that it's 

765
00:40:56,200 --> 00:40:58,800
spread through community. 
And some examples of that. 

766
00:40:58,900 --> 00:41:02,200
I talked about the bottleneck in
terms of knowledge. 

767
00:41:02,500 --> 00:41:05,700
So, one of the things that you 
can do is just tell somebody 

768
00:41:05,800 --> 00:41:09,800
Buddy, and a particular day that
they're not allowed to give 

769
00:41:09,800 --> 00:41:13,900
anybody else and help on this 
particular subject and that will

770
00:41:13,900 --> 00:41:17,300
tell you whether the rest of the
team can basically do without 

771
00:41:17,300 --> 00:41:19,400
them. 
And if they can't, you detect to

772
00:41:19,400 --> 00:41:23,800
the bottleneck, he goes into a 
lot of more advanced games to 

773
00:41:23,800 --> 00:41:28,100
all the way up to basically 
telling people a fake outage is 

774
00:41:28,100 --> 00:41:30,200
happening, and seeing what 
happens. 

775
00:41:30,300 --> 00:41:34,000
And what they do to address that
this obviously is a little bit 

776
00:41:34,000 --> 00:41:35,700
more tricky because you need to 
get the boy. 

777
00:41:35,900 --> 00:41:40,200
In from the higher-ups and 
people might get confused. 

778
00:41:40,200 --> 00:41:42,700
They might not know what's 
actually happening when they try

779
00:41:42,700 --> 00:41:43,700
to do back. 
Something. 

780
00:41:43,700 --> 00:41:47,400
That's not really happening or 
you might go all the way in and 

781
00:41:47,400 --> 00:41:50,600
actually go and break something 
on purpose without telling your 

782
00:41:50,600 --> 00:41:55,800
team to debug your reliability 
in terms of responding to that. 

783
00:41:55,800 --> 00:42:00,400
So they've really went and 
created some very interesting 

784
00:42:00,400 --> 00:42:03,800
Insight on that in that 
presentation and I just couldn't

785
00:42:03,800 --> 00:42:07,500
not include that into the book. 
Look, hopefully, he still enjoys

786
00:42:07,500 --> 00:42:09,100
it there. 
Thanks for sharing that. 

787
00:42:09,100 --> 00:42:12,500
I think it's a very interesting 
concept teams as a distributed 

788
00:42:12,500 --> 00:42:15,300
system, kind of a mindset. 
So Miko, I know that you are 

789
00:42:15,300 --> 00:42:17,200
very active in this community, 
right? 

790
00:42:17,200 --> 00:42:19,000
For me. 
All I know cool stories about 

791
00:42:19,000 --> 00:42:22,600
Kelsey engineering is all about 
Netflix that chaos monkey Simian

792
00:42:22,600 --> 00:42:24,700
Army and all that. 
Are there any other cool 

793
00:42:24,700 --> 00:42:27,800
examples that you have heard 
people showcase, maybe in the 

794
00:42:27,800 --> 00:42:30,400
conference or things that it's 
publicly available. 

795
00:42:30,700 --> 00:42:32,200
Are there some cool things like 
that? 

796
00:42:33,300 --> 00:42:36,500
So it's interesting that you 
asking them because That's what 

797
00:42:36,500 --> 00:42:38,800
I've been trying to do over. 
The last few years is actually 

798
00:42:38,800 --> 00:42:40,800
to make coats engine. 
Very boring. 

799
00:42:41,100 --> 00:42:43,300
Let me explain why I think 
boring is good. 

800
00:42:43,300 --> 00:42:47,400
What you think about the kind of
adoption curve of different new 

801
00:42:47,400 --> 00:42:49,800
technologies. 
It always goes to the same. 

802
00:42:50,000 --> 00:42:54,100
Bell-shaped curve initially. 
It's a novelty. 

803
00:42:54,100 --> 00:42:57,600
You have a very small early 
innovators population. 

804
00:42:57,600 --> 00:43:01,200
That's happy to put up with all 
the shortcomings of that. 

805
00:43:01,400 --> 00:43:03,700
Then you have like potential 
early majority. 

806
00:43:03,700 --> 00:43:07,500
They have the late majority and 
people who drag a little bit. 

807
00:43:07,600 --> 00:43:13,100
And so for technology or 
methodology to reach the white 

808
00:43:13,100 --> 00:43:17,200
audience, it has to become 
mainstream enough. 

809
00:43:17,200 --> 00:43:21,000
Basically boring enough that it 
can be adopted by a lot of 

810
00:43:21,000 --> 00:43:25,300
people because not everybody is 
working is happy to work around 

811
00:43:25,300 --> 00:43:28,800
the rough edges. 
So did the example I really 

812
00:43:28,800 --> 00:43:32,200
bring you up is the SpaceX 
records, not that long ago, 

813
00:43:32,300 --> 00:43:35,600
seeing this Rockets go all the 
way to orbit. 

814
00:43:35,800 --> 00:43:39,100
The boosters go all the way to 
the orbit and then automatically

815
00:43:39,100 --> 00:43:41,200
Land. 
Look like something from science

816
00:43:41,200 --> 00:43:44,000
fiction. 
They were the first one to pull 

817
00:43:44,000 --> 00:43:47,700
it off and it was amazing. 
I remember staying up late 

818
00:43:47,800 --> 00:43:51,900
because London Times n and 
watching the nine minutes or 

819
00:43:51,900 --> 00:43:56,200
whatever of the flight, and then
just Landing like something from

820
00:43:56,200 --> 00:44:00,000
sci-fi, but then over the time 
as they get better at it. 

821
00:44:00,000 --> 00:44:03,000
It stopped being exciting 
because they stopped blowing up.

822
00:44:03,200 --> 00:44:05,500
They just go up. 
Go down. 

823
00:44:05,700 --> 00:44:08,600
Land on the Drone ship. 
Of course, I still love you. 

824
00:44:09,100 --> 00:44:13,200
It's just becoming so mainstream
that I no longer find myself 

825
00:44:13,200 --> 00:44:16,800
staying up for that. 
So now maybe the spaceship is 

826
00:44:16,800 --> 00:44:19,800
something that I'm going to want
to look at. 

827
00:44:19,800 --> 00:44:23,600
But if they start Landing every 
test and they start doing it 

828
00:44:23,600 --> 00:44:27,600
routinely it becomes boring and 
so boring is good. 

829
00:44:27,900 --> 00:44:32,100
The same way that a smart from 
was something that was 

830
00:44:32,100 --> 00:44:36,000
outrageous not that long ago and
now everybody has one you can It

831
00:44:36,000 --> 00:44:38,500
on the chair pants outrageously,
good and quick. 

832
00:44:38,500 --> 00:44:41,400
And you have access to so many 
different apps. 

833
00:44:41,500 --> 00:44:44,500
We have a small super computer 
in your pocket all time and you 

834
00:44:44,500 --> 00:44:48,300
don't even notice that. 
So I would really like goes 

835
00:44:48,300 --> 00:44:52,300
engineering to stop being about 
the exciting stuff that you can 

836
00:44:52,300 --> 00:44:55,900
do and break things in 
production and not get fired for

837
00:44:55,900 --> 00:44:58,700
that. 
And instead become this routine 

838
00:44:58,700 --> 00:45:01,500
thing that we do, just because 
it creates a lot of value. 

839
00:45:01,700 --> 00:45:03,400
There are benefits to doing 
that. 

840
00:45:03,600 --> 00:45:05,600
Yeah. 
I'm going to go in the opposite.

841
00:45:05,700 --> 00:45:08,700
The direction of that and say 
that it's probably more about 

842
00:45:08,700 --> 00:45:11,900
making poor, and then you think 
about it, most of the 

843
00:45:11,900 --> 00:45:15,100
low-hanging fruit. 
It's like cyber security in the 

844
00:45:15,100 --> 00:45:17,800
movies. 
We see the hackers just randomly

845
00:45:17,800 --> 00:45:20,700
punching the keyboard and 
streams of data going through 

846
00:45:20,700 --> 00:45:24,500
and I'm in and probably some 
nice graphic turned out. 

847
00:45:24,700 --> 00:45:28,400
But in reality most of the 
low-hanging fruit is so boring 

848
00:45:28,400 --> 00:45:30,900
because you need to check all 
the routine things. 

849
00:45:31,100 --> 00:45:34,200
I need to stay up to date and 
you need to pass with the things

850
00:45:34,200 --> 00:45:38,000
that are already known and And 
you need to make sure that your 

851
00:45:38,000 --> 00:45:40,500
S3 bucket is not on public 
setting. 

852
00:45:41,200 --> 00:45:43,600
So that's the boring stuff. 
Doesn't make it into the movies.

853
00:45:43,600 --> 00:45:45,400
But that's where most of the 
work is done. 

854
00:45:45,700 --> 00:45:48,400
So, yeah, if you want the 
exciting stuff, there's 

855
00:45:48,400 --> 00:45:51,800
definitely things in the 
internet, but that's not really 

856
00:45:51,800 --> 00:45:54,900
where most of the values coming 
from the most of the values 

857
00:45:54,900 --> 00:45:58,000
coming from boring. 
Thanks for that, valid points. 

858
00:45:58,200 --> 00:46:00,200
Hopefully one day. 
We'll see, all these cars 

859
00:46:00,200 --> 00:46:03,500
Engineers, not unicorn. 
So only the cool companies that 

860
00:46:03,600 --> 00:46:06,400
are able to do that, but 
hopefully, All the engineering 

861
00:46:06,400 --> 00:46:09,300
team is able to introduce some 
kind of experiment chaotic 

862
00:46:09,300 --> 00:46:12,800
experiment in order to test the 
reliability of their system. 

863
00:46:13,000 --> 00:46:15,200
So Michael, thanks for spending 
your time today. 

864
00:46:15,500 --> 00:46:17,800
Eventually, we come to the end 
of this conversation. 

865
00:46:18,000 --> 00:46:20,900
But before I let you go, 
normally I would ask this one 

866
00:46:20,900 --> 00:46:23,600
question for all my guests, 
which is about three technical 

867
00:46:23,600 --> 00:46:26,800
leadership wisdom. 
So, can you share maybe some of 

868
00:46:26,800 --> 00:46:29,500
wisdom that you have, maybe from
your career, or maybe from your 

869
00:46:29,500 --> 00:46:32,800
chaos engineering experiments 
that you have so that audience 

870
00:46:32,800 --> 00:46:35,600
can learn and benefit from you. 
Wow. 

871
00:46:35,800 --> 00:46:40,000
Wisdom, that's a big word. 
Okay, that's right. 

872
00:46:40,300 --> 00:46:45,100
I think one of the things that I
learned probably gave me a lot 

873
00:46:45,100 --> 00:46:48,000
of mileage. 
Is that a lot of what we do is 

874
00:46:48,000 --> 00:46:52,600
about removing the BS from the 
equation Engineers apart from 

875
00:46:52,600 --> 00:46:56,700
the big Egos and everything. 
I really very finely tuned to 

876
00:46:56,700 --> 00:47:00,300
detect BS. 
And so, a lot of the tank 

877
00:47:00,300 --> 00:47:03,900
leadership and leadership in 
Tech in general, in my opinion, 

878
00:47:03,900 --> 00:47:06,900
is about just making sure. 
Sure that there is a stable 

879
00:47:06,900 --> 00:47:09,900
beers as possible. 
So if someone asks you a 

880
00:47:09,900 --> 00:47:12,400
question, you have few options, 
you can say. 

881
00:47:12,400 --> 00:47:14,800
Yes. 
Know if you know for sure the 

882
00:47:14,800 --> 00:47:19,300
answer you can say, I don't know
if you don't know the answer or 

883
00:47:19,300 --> 00:47:22,800
you can try to be as your way 
through it and try to come up 

884
00:47:22,800 --> 00:47:27,000
with something on the spot and I
got a lot of mileage just by 

885
00:47:27,000 --> 00:47:30,600
removing that last option. 
If you just tell people, 

886
00:47:30,600 --> 00:47:35,600
honestly, I know I don't know, 
yes or no, it builds this. 

887
00:47:35,700 --> 00:47:38,100
Ship, that's available to us 
when they also feel like they 

888
00:47:38,100 --> 00:47:41,200
can also say, I don't know. 
I'm supposed to be an expert in 

889
00:47:41,200 --> 00:47:44,600
the field and paid a lot of 
money for that and Senior and 

890
00:47:44,600 --> 00:47:48,200
everything, but there is stuff 
that I want now and that's 

891
00:47:48,300 --> 00:47:50,700
completely fine. 
You work with a lot of people 

892
00:47:50,700 --> 00:47:53,500
who are smarter than you who 
have more experience than you. 

893
00:47:53,500 --> 00:47:57,400
Who are by definition are going 
to be much better at parts of 

894
00:47:57,400 --> 00:48:00,100
your job. 
If you are humble enough to just

895
00:48:00,100 --> 00:48:04,800
say, I don't know. 
Well, what do you think it gets 

896
00:48:04,800 --> 00:48:07,200
you out of mileage? 
And I think that really helped 

897
00:48:07,200 --> 00:48:11,400
me stay out of trouble and kind 
of a corollary to that another 

898
00:48:11,400 --> 00:48:13,900
piece of an agate. 
Is that regardless of what you 

899
00:48:13,900 --> 00:48:16,400
think about yourself? 
There's always something that 

900
00:48:16,400 --> 00:48:21,000
you can learn from everybody and
we work a lot in this industry 

901
00:48:21,000 --> 00:48:25,400
and working hard and increasing 
the diversity of thought coming 

902
00:48:25,400 --> 00:48:29,100
from a mindset that okay. 
If that person disagrees with 

903
00:48:29,100 --> 00:48:33,100
me, there is probably something 
that they think about or they 

904
00:48:33,100 --> 00:48:35,500
know about that. 
I don't know. 

905
00:48:36,200 --> 00:48:40,200
So if I just try to convince 
them to my way of doing things, 

906
00:48:40,300 --> 00:48:43,200
I'm not going to come out of 
that conversation wiser. 

907
00:48:43,500 --> 00:48:47,100
But if I at least by default 
give it a shot. 

908
00:48:47,200 --> 00:48:49,200
Maybe they run. 
Maybe there's something. 

909
00:48:49,200 --> 00:48:52,600
I know they do not know, but if 
I give it a shot, I'm going to 

910
00:48:52,600 --> 00:48:55,600
get more value of that. 
And if you keep getting value 

911
00:48:55,600 --> 00:48:58,400
from every conversation, you 
really going to accumulate that 

912
00:48:58,400 --> 00:49:02,100
over time. 
So yeah, I think that basically 

913
00:49:02,200 --> 00:49:05,800
the kind of being humble and 
knowing that you can learn all 

914
00:49:05,800 --> 00:49:09,800
the things from everybody and 
not be essing is probably what 

915
00:49:09,800 --> 00:49:13,100
really helped me get where I am 
right now, because you asked for

916
00:49:13,100 --> 00:49:15,800
three. 
I'm going to also try to attempt

917
00:49:15,800 --> 00:49:18,100
the third one. 
I think you probably hear that a

918
00:49:18,100 --> 00:49:22,100
lot in your sure, but the kind 
of learning mindset, lifelong 

919
00:49:22,100 --> 00:49:25,200
learning is very important. 
This is something that some 

920
00:49:25,200 --> 00:49:28,000
people get built in and they 
start with that, and they just 

921
00:49:28,000 --> 00:49:30,000
super excited about learning by 
learning. 

922
00:49:30,000 --> 00:49:34,000
I mean, like a lot of things, 
there might be a point where you

923
00:49:34,000 --> 00:49:36,800
do everything about your knees. 
And there isn't much more to 

924
00:49:36,800 --> 00:49:38,600
learn instead of stopping 
learning. 

925
00:49:38,600 --> 00:49:42,400
You should probably start 
exploring other Niche or maybe 

926
00:49:42,400 --> 00:49:45,800
do something completely out of 
your comfort zone and go learn a

927
00:49:45,800 --> 00:49:50,100
language or to some art or 
support and see how that affects

928
00:49:50,100 --> 00:49:51,700
your brain. 
In the recent years. 

929
00:49:51,700 --> 00:49:54,400
I've been reading a lot about 
brain and how it works. 

930
00:49:54,600 --> 00:49:58,700
I found myself being able to 
influence, a lot of weird things

931
00:49:58,700 --> 00:50:01,800
in my life just by picking up 
new skills that seemed to be 

932
00:50:01,800 --> 00:50:05,100
completely unrelated. 
It's easy this days, you can get

933
00:50:05,100 --> 00:50:07,100
a nap. 
Up, you go and do a lingo and 

934
00:50:07,100 --> 00:50:09,900
you can pick up a language that 
would normally cost a lot of 

935
00:50:09,900 --> 00:50:13,900
money would be impractical, but 
you can take the few first steps

936
00:50:14,000 --> 00:50:17,700
free or very cheaply. 
Obviously the pandemic made it a

937
00:50:17,700 --> 00:50:20,200
bit more difficult to pick out a
new sports app. 

938
00:50:20,400 --> 00:50:23,900
Anyway, the point is that 
learning things might have 

939
00:50:24,000 --> 00:50:27,500
unexpected benefits to you. 
And this is something that I 

940
00:50:27,500 --> 00:50:30,900
really recommend doing. 
Yeah, I agree with that. 

941
00:50:30,900 --> 00:50:34,000
Thanks again for reminding this 
important learning mindset. 

942
00:50:34,200 --> 00:50:37,000
So me go for people. 
Interested to learn more about 

943
00:50:37,000 --> 00:50:39,800
you or maybe a recent book. 
Where can they find you online? 

944
00:50:39,800 --> 00:50:42,900
Maybe sure, sir? 
Probably the easiest way to 

945
00:50:42,900 --> 00:50:46,900
reach out to me is on LinkedIn 
if you want to interact. 

946
00:50:47,000 --> 00:50:50,400
Otherwise, I do have mailing 
list for the book. 

947
00:50:50,400 --> 00:50:53,000
If you want to go to chaos 
engineering dot news. 

948
00:50:53,100 --> 00:50:57,400
I can sign up and get updates. 
If you have any particular 

949
00:50:57,400 --> 00:51:01,300
updates about the book or I mess
something up, do reach out. 

950
00:51:01,400 --> 00:51:04,400
There is a GitHub repo out of 
the book where you can download 

951
00:51:04,400 --> 00:51:08,100
the VM, and And that's where I 
can put issues. 

952
00:51:08,300 --> 00:51:10,000
I'm pretty sure my some things 
up. 

953
00:51:10,100 --> 00:51:13,400
So looking forward to that. 
Hopefully, we can curse 

954
00:51:13,400 --> 00:51:15,700
conferences when they become in 
person. 

955
00:51:15,700 --> 00:51:18,800
Again, kind of looking forward 
to them. 

956
00:51:19,300 --> 00:51:20,700
Don't like the way I think of it
all. 

957
00:51:20,700 --> 00:51:23,900
So it's like a chaos experiment 
in our life where this pandemic 

958
00:51:23,900 --> 00:51:27,100
suddenly throws people into 
different kind of living and 

959
00:51:27,100 --> 00:51:30,300
mindset and routines. 
Hopefully, we will end this 

960
00:51:30,300 --> 00:51:33,400
pandemic soon enough so that we 
all can live through our 

961
00:51:33,400 --> 00:51:35,500
previous normal life. 
So, thanks again. 

962
00:51:35,700 --> 00:51:37,700
Go. 
I hope your cars engineering 

963
00:51:37,700 --> 00:51:40,900
mindset and you're being the 
champion of it have more people 

964
00:51:40,900 --> 00:51:43,600
being able to implement that in 
their team. 

965
00:51:43,600 --> 00:51:46,900
They are systems and also the 
companies so that we can make it

966
00:51:46,900 --> 00:51:48,100
more boring. 
So to speak. 

967
00:51:48,100 --> 00:51:50,900
Like what you said and I wish 
you good luck for your career as

968
00:51:50,900 --> 00:51:52,400
well. 
Thank you. 

969
00:51:52,600 --> 00:51:54,700
It was really fun being on the 
focus. 

970
00:51:57,400 --> 00:52:00,800
Thank you for listening to this 
episode and for staying right 

971
00:52:00,800 --> 00:52:03,600
till the end. 
If you highly enjoyed, please 

972
00:52:03,600 --> 00:52:06,500
share it with your friends and 
colleagues who you think would 

973
00:52:06,500 --> 00:52:09,200
also benefit from listening to 
this episode. 

974
00:52:09,500 --> 00:52:12,400
And if you're new to the 
podcast, make sure to subscribe 

975
00:52:12,400 --> 00:52:15,300
and leave me your valuable 
review and feedback. 

976
00:52:15,400 --> 00:52:19,100
It really, really helps me a lot
in order to grow these podcasts 

977
00:52:19,100 --> 00:52:21,700
better. 
You can also find the full show 

978
00:52:21,700 --> 00:52:25,200
notes of this conversation on 
the episode page at technology. 

979
00:52:25,200 --> 00:52:28,600
No, the death website including 
Doing the full transcript, 

980
00:52:28,600 --> 00:52:32,200
interesting quotes, and links to
the resources and mentions from 

981
00:52:32,200 --> 00:52:35,000
the conversation. 
And lastly make sure to 

982
00:52:35,000 --> 00:52:37,500
subscribe to the show's mailing 
list on technology. 

983
00:52:37,500 --> 00:52:40,900
No, the deaf to get notified for
any future episodes. 

984
00:52:41,300 --> 00:52:44,000
Stay tuned for the next 
technique Journal episode. 

985
00:52:44,100 --> 00:52:45,700
And until then. 
Goodbye.

