1
00:00:00,200 --> 00:00:02,700
Definitey is the top 
International software 

2
00:00:02,700 --> 00:00:06,700
development conference, with an 
emphasis on coding architecture 

3
00:00:06,700 --> 00:00:10,000
and Tech leadership skills. 
The lineup for this year is 

4
00:00:10,000 --> 00:00:13,700
truly stellar and features many 
Legends in software development 

5
00:00:14,000 --> 00:00:18,400
names, such as Robert Uncle Bob,
Martin can back Scott Hanselman,

6
00:00:18,700 --> 00:00:21,600
Franca subramanyam Carolyn honey
Alan. 

7
00:00:21,600 --> 00:00:25,500
Hello, Mary poppendieck and many
other prominent names including 

8
00:00:25,500 --> 00:00:28,800
some of those who have also 
appeared in this podcast before 

9
00:00:29,300 --> 00:00:31,600
the conference. 
It's takes place online so you 

10
00:00:31,600 --> 00:00:34,000
can enjoy it from the comfort of
your couch. 

11
00:00:34,400 --> 00:00:37,400
We spoke to the definitey 
organizers, and I'm happy to 

12
00:00:37,400 --> 00:00:39,900
share that technology. 
You know, has got the 10% 

13
00:00:39,900 --> 00:00:44,500
discount code for you. 
Enter the promo code, awsm 

14
00:00:44,800 --> 00:00:47,900
underscore tlj. 
When you purchase the ticket on 

15
00:00:47,900 --> 00:00:50,800
definite e.com, here's the promo
code. 

16
00:00:50,800 --> 00:00:54,900
One more time awsm underscore, 
tlj. 

17
00:00:55,700 --> 00:00:58,800
Depending on the time when you 
purchase a ticket, early price 

18
00:00:58,800 --> 00:01:01,200
is still available. 
See you there? 

19
00:01:01,600 --> 00:01:03,800
And that's a totally reasonable 
way to get started. 

20
00:01:03,900 --> 00:01:07,000
It's all about just, again, 
embracing those service truths 

21
00:01:07,000 --> 00:01:08,600
that we talked about it first, 
right? 

22
00:01:08,600 --> 00:01:10,300
Rye, Billy's, most important 
thing. 

23
00:01:10,500 --> 00:01:12,700
Your users Define your liability
not you. 

24
00:01:12,700 --> 00:01:15,800
So make sure you're measuring 
the right thing and 100% is out 

25
00:01:15,800 --> 00:01:17,500
of the question. 
So pick the right target. 

26
00:01:17,600 --> 00:01:21,500
You can Embrace those truths 
without real time monitoring and

27
00:01:21,500 --> 00:01:24,500
advanced statistics and all the 
stuff that comes along with. 

28
00:01:24,500 --> 00:01:27,400
It just get started, even if 
it's in a spreadsheet even if 

29
00:01:27,408 --> 00:01:33,800
it's only just once a month, Hey
everyone. 

30
00:01:34,300 --> 00:01:36,200
My name is Henry Surya with 
Robin. 

31
00:01:37,900 --> 00:01:40,800
And you're listening to the 
technology, you know, podcast 

32
00:01:41,100 --> 00:01:43,500
the show where I'll be bringing 
you the greatest technical 

33
00:01:43,500 --> 00:01:47,300
leaders practitioners and 
thought leaders in the industry 

34
00:01:47,700 --> 00:01:52,000
to discuss about their Journey 
ideas and practices that we all 

35
00:01:52,000 --> 00:01:55,600
can learn and apply to build a 
highly performing technical team

36
00:01:56,100 --> 00:01:58,400
and to make an impact in your 
personal work. 

37
00:01:58,900 --> 00:02:07,100
So let's dive into our Journal. 
Hello to all of you, my friends 

38
00:02:07,100 --> 00:02:09,500
and my listeners, welcome to the
technology. 

39
00:02:09,500 --> 00:02:12,300
Now, podcast the show where you 
can learn about technical 

40
00:02:12,300 --> 00:02:15,600
leadership and Excellence from 
my conversations, with great 

41
00:02:15,600 --> 00:02:18,700
thought, leaders out there. 
And today is the episode number 

42
00:02:18,700 --> 00:02:21,100
96. 
Thank you for tuning in and 

43
00:02:21,100 --> 00:02:23,800
listening to this episode. 
If this is your first time 

44
00:02:23,800 --> 00:02:26,800
listening to technology, you 
know, make sure to subscribe and

45
00:02:26,800 --> 00:02:30,200
follow the show on your podcast 
app and social media on 

46
00:02:30,200 --> 00:02:32,200
LinkedIn. 
Twitter, Other and Instagram. 

47
00:02:32,500 --> 00:02:35,500
And for those of you who enjoy 
this podcast and wanting to 

48
00:02:35,500 --> 00:02:39,000
contribute to the creation of 
the future episodes support me 

49
00:02:39,100 --> 00:02:43,400
by subscribing as a patron at 
Tech Legend, l.f / Patron, 

50
00:02:44,500 --> 00:02:47,400
implementing a sorry Concepts 
and best practices can be 

51
00:02:47,400 --> 00:02:50,000
daunting. 
Although Google released a few a

52
00:02:50,000 --> 00:02:52,800
sorry, books, including the 
famous site, reliability, 

53
00:02:52,800 --> 00:02:56,100
engineering book. 
Many of us still have some gaps 

54
00:02:56,100 --> 00:02:59,600
in terms of really understanding
the essence of the concepts and 

55
00:02:59,600 --> 00:03:03,000
practices such as this. 
Service level indicators or SL 

56
00:03:03,000 --> 00:03:06,800
is service level objectives, as 
a Lowe's and error budgets. 

57
00:03:07,100 --> 00:03:10,200
And on top of that, how can we 
start building a good SRE 

58
00:03:10,200 --> 00:03:13,900
culture and avoiding some common
pitfalls, especially when 

59
00:03:13,900 --> 00:03:17,100
communicating the benefits of 
these set of practices, for 

60
00:03:17,100 --> 00:03:19,400
example, to the business or 
stakeholders. 

61
00:03:19,900 --> 00:03:23,500
Also, do tools matter in 
implementing a sorry how 

62
00:03:23,500 --> 00:03:27,300
reliable should our service be 
and how should we measure it? 

63
00:03:27,700 --> 00:03:29,800
These are some of the common 
questions that I know. 

64
00:03:29,800 --> 00:03:32,700
People usually ask. 
When introduced with SRE 

65
00:03:32,700 --> 00:03:36,100
concept, and if you do have the 
same questions and thoughts in 

66
00:03:36,100 --> 00:03:39,300
your mind, then today's episode 
is definitely for you. 

67
00:03:39,800 --> 00:03:43,900
My guest for today's episode is 
Alex Hidalgo, Alex is the 

68
00:03:43,900 --> 00:03:48,200
principal reliability Advocate 
at Noble 9 and the author of 

69
00:03:48,200 --> 00:03:50,700
implementing service level 
objectives book. 

70
00:03:51,000 --> 00:03:54,800
Alex previously work at Google 
as a site, reliability engineer,

71
00:03:55,000 --> 00:03:58,900
and also customer reliability 
engineer and also contributed 

72
00:03:58,900 --> 00:04:01,100
multiple chapters to the site 
reliability. 

73
00:04:01,300 --> 00:04:04,100
Look book. 
In this episode, we discuss the 

74
00:04:04,100 --> 00:04:07,500
Practical guide on how to 
implement a sorry practices and 

75
00:04:07,500 --> 00:04:12,000
service level objectives or S 
ellos Alex started by explaining

76
00:04:12,000 --> 00:04:15,700
the basic concept of service 
reliability and the tree service

77
00:04:15,700 --> 00:04:19,600
truths, he then explained the 
concept of reliability stack, 

78
00:04:19,600 --> 00:04:22,900
that includes the famous SRE 
Concepts as allies as a Lowe's, 

79
00:04:22,900 --> 00:04:26,000
and error budget. 
Alex, then shared his insights 

80
00:04:26,000 --> 00:04:29,100
on how we can define a service 
reliability Target. 

81
00:04:29,200 --> 00:04:33,300
Why a higher reliability Target?
Is expensive and the risk of a 

82
00:04:33,300 --> 00:04:38,100
service of being too reliable 
towards the end, Alex shared his

83
00:04:38,100 --> 00:04:41,300
tips on how we can start 
building, a sorry culture and 

84
00:04:41,300 --> 00:04:44,600
how we can use the error budget 
as a communication tool within 

85
00:04:44,600 --> 00:04:47,700
the organization. 
I very much enjoyed my 

86
00:04:47,700 --> 00:04:50,800
conversation with Alex and even 
though I have been learning the 

87
00:04:50,800 --> 00:04:52,500
asari concepts for quite some 
time. 

88
00:04:52,500 --> 00:04:55,600
Now, there are still a number of
insights that are learned from 

89
00:04:55,600 --> 00:04:58,500
Alex in this episode. 
And if you also find this 

90
00:04:58,500 --> 00:05:01,100
episode useful, please share it 
with your friends. 

91
00:05:01,300 --> 00:05:04,700
Colleagues who can also benefit 
from listening to this episode? 

92
00:05:05,100 --> 00:05:08,400
Leave a rating and review on 
your podcast app and share your 

93
00:05:08,400 --> 00:05:11,700
comments or feedback about this 
episode on social media. 

94
00:05:12,000 --> 00:05:15,100
It is my ultimate mission to 
make this podcast available to 

95
00:05:15,100 --> 00:05:17,500
more people. 
And I need your help to support 

96
00:05:17,500 --> 00:05:19,500
me towards fulfilling my 
mission. 

97
00:05:19,900 --> 00:05:21,700
Before we continue to the 
conversation. 

98
00:05:21,900 --> 00:05:24,000
Let's hear some words from our 
sponsor. 

99
00:05:24,400 --> 00:05:27,600
Today's episode is proudly 
sponsored by skills matter. 

100
00:05:27,900 --> 00:05:31,700
The global community and events 
platform with more than 1000 

101
00:05:31,700 --> 00:05:35,400
software professionals here 
members can organize their 

102
00:05:35,400 --> 00:05:38,200
learning experiences around the 
technology topics. 

103
00:05:38,200 --> 00:05:42,000
They care about most you get 
on-demand access to their latest

104
00:05:42,000 --> 00:05:45,300
content thought, leadership 
insights, as well as the 

105
00:05:45,300 --> 00:05:49,500
exciting schedule of tech events
running across all time zones. 

106
00:05:49,900 --> 00:05:53,400
So where the devops our data 
science is your bus or you're a 

107
00:05:53,400 --> 00:05:57,300
fan of functional programming or
all things Cloud, you can make 

108
00:05:57,300 --> 00:06:01,000
real connections with people who
share your interests head-on. 

109
00:06:01,200 --> 00:06:04,100
The two scales method on cam to 
become part of the tech 

110
00:06:04,100 --> 00:06:06,200
community that matters most to 
you. 

111
00:06:06,500 --> 00:06:09,700
It's free to join and you will 
find it easy to keep up with the

112
00:06:09,700 --> 00:06:12,800
latest tech Trends. 
Are you looking for a new cool 

113
00:06:12,800 --> 00:06:15,700
swag package? 
You know now offers you some 

114
00:06:15,700 --> 00:06:19,600
swags that you can purchase 
online these wax are printed on 

115
00:06:19,600 --> 00:06:23,200
demand based on your preference 
and will be delivered safely to 

116
00:06:23,200 --> 00:06:27,000
you all over the world where 
shipping is available, check out

117
00:06:27,000 --> 00:06:29,900
all the cool tracks available by
visiting technology, know that 

118
00:06:30,000 --> 00:06:33,000
death / shop, Oh, and don't 
forget to break yourself. 

119
00:06:33,100 --> 00:06:35,200
Once you receive any of those 
tracks. 

120
00:06:38,400 --> 00:06:40,300
Hi everybody. 
Welcome back to Tech lead, you 

121
00:06:40,300 --> 00:06:44,000
know, podcast today I have a 
guest with me named Alex 

122
00:06:44,000 --> 00:06:46,900
Hidalgo. 
He's the principal reliability 

123
00:06:46,900 --> 00:06:51,100
Advocate at Noble nine and he's 
the author of a book, titled, 

124
00:06:51,200 --> 00:06:54,400
implementing service level 
objectives and contributed to 

125
00:06:54,400 --> 00:06:57,400
another book which is part of 
the asari book from Google, 

126
00:06:57,400 --> 00:07:00,100
which is titled site reliability
workbook. 

127
00:07:00,200 --> 00:07:03,800
So as you can tell Today we are 
going to talk a lot about SRE 

128
00:07:04,000 --> 00:07:07,500
and slos and things like that. 
So if you are wondering about 

129
00:07:07,500 --> 00:07:10,500
this practices SLO how to set 
it, right? 

130
00:07:10,500 --> 00:07:12,700
And things like that, we're 
going to cover it today. 

131
00:07:13,000 --> 00:07:15,100
So Alex really? 
Thank you so much for your time 

132
00:07:15,100 --> 00:07:17,300
today looking forward for this 
composition. 

133
00:07:18,100 --> 00:07:21,200
Thanks so much for having me. 
So Alex, I like to actually 

134
00:07:21,200 --> 00:07:23,900
start with my guests to tell his
or her story. 

135
00:07:24,100 --> 00:07:26,300
Any career turning points any 
highlights? 

136
00:07:26,400 --> 00:07:29,200
So maybe if you can share yours.
Sure. 

137
00:07:29,500 --> 00:07:33,100
My route to where I I am today 
has been an interesting one. 

138
00:07:33,300 --> 00:07:37,100
I grew up as a computer nerd. 
My dad helped teach me how to 

139
00:07:37,100 --> 00:07:39,600
program when I was eight or nine
years old. 

140
00:07:39,600 --> 00:07:43,200
I was on the internet before the
web existed back in was also 

141
00:07:43,200 --> 00:07:45,600
text. 
So through, most of my 

142
00:07:45,600 --> 00:07:48,200
education. 
I always assumed I wanted to 

143
00:07:48,200 --> 00:07:52,200
work with computers so I didn't 
go to college After High School.

144
00:07:52,700 --> 00:07:56,700
I went and I got a job in the 
tech industry, but that first 

145
00:07:56,700 --> 00:08:00,400
job was doing network security 
work for the government of the 

146
00:08:00,407 --> 00:08:02,600
US. 
And I hated it. 

147
00:08:02,800 --> 00:08:05,600
I was pretty good at it. 
Actually got a promotion within 

148
00:08:05,600 --> 00:08:09,800
a year but it made me miserable.
And so I thought I didn't want 

149
00:08:09,800 --> 00:08:12,400
to work with computers. 
I thought computers were just a 

150
00:08:12,400 --> 00:08:16,900
hobby for me so I quit my job 
and realize it was still time 

151
00:08:16,900 --> 00:08:20,200
for me to go back to school. 
So I started college a few 

152
00:08:20,200 --> 00:08:22,300
years. 
After most people I ended up 

153
00:08:22,300 --> 00:08:26,300
studying philosophy and history 
and I took a bunch of creative 

154
00:08:26,300 --> 00:08:28,800
writing courses. 
All the while computers were 

155
00:08:28,800 --> 00:08:32,200
still a hobby but I kind of 
decided Added computers weren't 

156
00:08:32,200 --> 00:08:35,200
for me as a career. 
Then I decided to move to New 

157
00:08:35,200 --> 00:08:37,799
York at the height of the 2008 
recession. 

158
00:08:38,100 --> 00:08:41,200
So right when the global economy
and especially here in the US 

159
00:08:41,200 --> 00:08:44,300
things are just absolutely 
tanking suddenly I'm in one of 

160
00:08:44,300 --> 00:08:47,900
the most expensive cities in the
world and I can't find a job. 

161
00:08:48,200 --> 00:08:52,100
My money's running out whatever 
I had of savings at the time 

162
00:08:52,300 --> 00:08:54,700
because in my 20s I worked in 
the service industry. 

163
00:08:54,700 --> 00:08:56,700
I was a server. 
I was a cook, I worked in a 

164
00:08:56,700 --> 00:08:59,800
warehouse, my money was a 
starting to run out. 

165
00:08:59,800 --> 00:09:04,000
I kind of I can probably still 
do this computer thing even if 

166
00:09:04,000 --> 00:09:07,500
it's just for now, even if it's 
just to make some money and get 

167
00:09:07,500 --> 00:09:12,000
back on my feet, the very first 
tech job I applied for which was

168
00:09:12,000 --> 00:09:14,700
an, IT desktop support kind of 
position. 

169
00:09:14,900 --> 00:09:17,800
I was hired right away. 
I actually had so much fun. 

170
00:09:18,000 --> 00:09:20,500
Is that the small company only 
about 10 people? 

171
00:09:20,700 --> 00:09:23,600
We were the IT department for 
companies that didn't have their

172
00:09:23,600 --> 00:09:25,700
own. 
So, every day I'd travel all 

173
00:09:25,700 --> 00:09:29,000
over New York City, every 
Borough all over the place, just

174
00:09:29,000 --> 00:09:31,000
helping people with whatever 
they need. 

175
00:09:31,100 --> 00:09:33,100
Needed help with and I really 
loved it. 

176
00:09:33,100 --> 00:09:36,600
I loved that human part that 
interacting with other people 

177
00:09:36,600 --> 00:09:39,500
and helping them learn how to 
use their computers fast 

178
00:09:39,500 --> 00:09:41,000
forward. 
A few years, I'm working for 

179
00:09:41,000 --> 00:09:44,700
this company called add meld now
and add meld is purchased by 

180
00:09:44,700 --> 00:09:49,300
Google and suddenly I'm at 
Google as a site, reliability 

181
00:09:49,300 --> 00:09:50,900
engineer. 
I don't even know what that 

182
00:09:50,900 --> 00:09:54,200
meant at first. 
But as I learned more and more 

183
00:09:54,200 --> 00:09:56,300
about it, it spoke to me so 
much. 

184
00:09:56,400 --> 00:09:59,200
It was really what I had always 
meant to do. 

185
00:09:59,500 --> 00:10:02,700
It was every bit of My 
personality, like the things I 

186
00:10:02,708 --> 00:10:05,500
truly love, it was all kind of 
reflected in it. 

187
00:10:05,600 --> 00:10:09,300
The human aspects thinking about
users, putting humans. 

188
00:10:09,300 --> 00:10:13,800
First blameless culture for 
incident response and incident 

189
00:10:13,800 --> 00:10:16,600
retrospectives, just everything 
about it. 

190
00:10:16,600 --> 00:10:19,500
I absolutely loved. 
So, I really fell into the role 

191
00:10:19,500 --> 00:10:22,100
really well fast forward. 
A few more eaters. 

192
00:10:22,300 --> 00:10:25,500
My last roll at Google before I 
left was on the customer 

193
00:10:25,500 --> 00:10:29,600
reliability engineering team or 
see reaching the Siri team was a

194
00:10:29,600 --> 00:10:34,100
group of Fairly Variants azeri, 
we were tasked with teaching, 

195
00:10:34,100 --> 00:10:38,000
Google's, largest cloud, 
customers, howl, to a sorry, how

196
00:10:38,000 --> 00:10:39,900
can people make their services 
more reliable? 

197
00:10:40,200 --> 00:10:44,300
We realize that the biggest 
thing that we needed because the

198
00:10:44,300 --> 00:10:47,500
customers we were engaging and 
were not other tech companies, 

199
00:10:47,500 --> 00:10:49,900
they were not subdivisions, a 
Google, right? 

200
00:10:49,900 --> 00:10:53,700
They were retailers and they 
were industry manufacturing 

201
00:10:53,700 --> 00:10:55,900
companies and they were all over
the place. 

202
00:10:56,100 --> 00:10:59,000
So we realized we needed a 
common vernacular, we needed a 

203
00:10:59,000 --> 00:11:03,000
shared language. 
And what we Sided was SOS 

204
00:11:03,200 --> 00:11:07,500
service level objectives would 
be that shared language so I 

205
00:11:07,500 --> 00:11:10,800
spent a good year or traveling 
all over and teaching all sorts 

206
00:11:10,800 --> 00:11:14,100
of people what s lows are and 
why they're so great and how to 

207
00:11:14,100 --> 00:11:15,800
use them. 
Because that was really the 

208
00:11:15,800 --> 00:11:19,200
building blocks that the Siri 
team wanted in order for us to 

209
00:11:19,200 --> 00:11:21,200
engage with people in the best 
possible way. 

210
00:11:21,700 --> 00:11:24,500
I had some fun doing that but it
was my time to move on from 

211
00:11:24,500 --> 00:11:26,100
Google. 
I'd been there for a long time. 

212
00:11:26,400 --> 00:11:30,000
Everyone needs a change at some 
point so I moved over to 

213
00:11:30,000 --> 00:11:33,000
Squarespace. 
When I started at Squarespace, I

214
00:11:33,000 --> 00:11:37,900
was asked, hey we want to do SL 
owes you know how to do is Loz, 

215
00:11:38,100 --> 00:11:40,600
can you help us? 
And I was like, sure, no 

216
00:11:40,600 --> 00:11:43,900
problem. 
I set this up with my manager, I

217
00:11:43,900 --> 00:11:47,800
was going to spend like, 60% my 
time on teamwork and about 40 

218
00:11:47,800 --> 00:11:50,800
percent of my time teaching, the
entire organization, including 

219
00:11:50,800 --> 00:11:54,700
my own team, how to do slos. 
I didn't realize how much work 

220
00:11:54,700 --> 00:11:56,700
that really was. 
When you're getting started from

221
00:11:56,700 --> 00:12:00,900
scratch, when you are starting 
from just the absolute bottom, 

222
00:12:01,100 --> 00:12:04,000
People barely even understand 
what all the different terms 

223
00:12:04,000 --> 00:12:06,300
mean. 
It's a lot of work. 

224
00:12:06,700 --> 00:12:08,800
You need to build the right to 
Lang. 

225
00:12:08,800 --> 00:12:10,600
Chances are you're monitoring 
systems. 

226
00:12:10,600 --> 00:12:14,200
Don't even do a solo math. 
In the first place you need to 

227
00:12:14,200 --> 00:12:19,800
do education and workshops need 
to build document repositories. 

228
00:12:20,000 --> 00:12:23,000
It was a lot of work. 
We didn't Define our first SLO 

229
00:12:23,000 --> 00:12:24,700
at Squarespace for about six 
months. 

230
00:12:24,700 --> 00:12:27,700
After I started after about a 
year and a half of that. 

231
00:12:27,900 --> 00:12:31,800
I was running a workshop every 
Friday afternoon for Four hours 

232
00:12:32,100 --> 00:12:35,900
from noon until 4 p.m. and we 
had a break, but still, it was a

233
00:12:35,908 --> 00:12:39,900
for our Workshop, every single 
Friday, and people really liked 

234
00:12:39,900 --> 00:12:42,200
them. 
They were popular, but I was 

235
00:12:42,200 --> 00:12:45,400
just tired of saying the same 
thing over and over again. 

236
00:12:45,800 --> 00:12:49,200
So, at some point, I was 
complaining to a friend of mine,

237
00:12:49,200 --> 00:12:51,800
they're like a good co-worker. 
And I was just like, I wish 

238
00:12:51,800 --> 00:12:55,100
there was a book about this, an 
entire book, not just a chapter 

239
00:12:55,100 --> 00:12:58,800
and yessiree workbook, but a 
whole book because that way I 

240
00:12:58,808 --> 00:13:02,000
could Point people at it instead
of just doing Workshops over and

241
00:13:02,000 --> 00:13:04,400
over again. 
My friend said, well you should 

242
00:13:04,400 --> 00:13:08,300
write it and I said no an 
expert, you read it and he said 

243
00:13:08,400 --> 00:13:12,100
you are the expert. 
And I cursed like I straight up 

244
00:13:12,100 --> 00:13:15,500
just like said curse words 
because I knew he was right? 

245
00:13:15,600 --> 00:13:19,100
I just didn't realize it until 
he had said it, I suddenly knew 

246
00:13:19,100 --> 00:13:21,700
I was writing a book and I'd 
always heard how difficult that 

247
00:13:21,700 --> 00:13:25,200
is a few months later, I was 
working on a book for O'Reilly 

248
00:13:25,300 --> 00:13:27,000
implementing service level 
objectives. 

249
00:13:27,200 --> 00:13:29,900
That's directly led me to where 
I am today. 

250
00:13:30,300 --> 00:13:34,400
I'm now at Noble 9, which is a 
start-up based entirely around 

251
00:13:34,500 --> 00:13:37,500
how to do service level 
objectives, how to measure them.

252
00:13:37,500 --> 00:13:39,500
We do all the tooling for you 
etc. 

253
00:13:39,500 --> 00:13:42,300
Etc. 
So I came to fall in love with s

254
00:13:42,300 --> 00:13:45,900
lows at Google because I 
realized they made my life 

255
00:13:45,900 --> 00:13:48,200
better and they made my users 
lives better. 

256
00:13:48,400 --> 00:13:50,700
They made humans happier and 
that's always been the most 

257
00:13:50,700 --> 00:13:54,100
important thing to me. 
I've been very lucky to have 

258
00:13:54,100 --> 00:13:57,400
been able to focus primarily on 
us laws for last six. 

259
00:13:57,400 --> 00:14:00,200
Almost seven years of my career.
Now, it's a great place to be 

260
00:14:00,900 --> 00:14:02,600
Thank you so much for sharing 
your story. 

261
00:14:02,700 --> 00:14:06,500
It is a very beautiful story, a 
lot of ups and downs, including 

262
00:14:06,500 --> 00:14:08,400
how you got to find your 
passion. 

263
00:14:08,500 --> 00:14:12,000
And hopefully today, you are not
tired to speak about SLO one 

264
00:14:12,000 --> 00:14:14,100
more time. 
So at least today, we're going 

265
00:14:14,100 --> 00:14:16,700
to cover some of the basics. 
Thank you so much for sharing 

266
00:14:16,700 --> 00:14:19,000
this story. 
So Alex you wrote this book, 

267
00:14:19,000 --> 00:14:22,400
implementing a sailor, there are
number of books, not many, but a

268
00:14:22,408 --> 00:14:25,200
number of books about SRE SLO 
and all that. 

269
00:14:25,500 --> 00:14:28,400
But I still find a people, find 
it difficult to grasp the 

270
00:14:28,400 --> 00:14:30,400
concept. 
What do you think us? 

271
00:14:30,600 --> 00:14:33,900
Some of the challenges that 
people face in understanding 

272
00:14:33,900 --> 00:14:36,400
these practices. 
So I think some of the biggest 

273
00:14:36,400 --> 00:14:39,900
problems with both 
understanding, what a sorry is 

274
00:14:40,100 --> 00:14:43,100
site reliability engineering, as
well as slos, what service level

275
00:14:43,100 --> 00:14:46,600
objectives are is that no one's 
really ever fully defined. 

276
00:14:46,600 --> 00:14:50,300
It there is a course, the first 
Google as to rebook. 

277
00:14:50,500 --> 00:14:53,600
But you know what? 
It's like 30-something chapters 

278
00:14:53,800 --> 00:14:57,400
and not a single team at Google 
actually does all those things. 

279
00:14:57,400 --> 00:15:02,400
Anyway, their best practices, 
their And sure there's the 

280
00:15:02,400 --> 00:15:06,000
original definition that been 
trainer /, basically, the 

281
00:15:06,000 --> 00:15:08,300
inventor of site reliability, 
engineering. 

282
00:15:08,300 --> 00:15:12,600
He said, S3 is what happens when
you ask software Engineers to 

283
00:15:12,600 --> 00:15:15,500
solve operational problems. 
That's great. 

284
00:15:15,500 --> 00:15:19,600
That's also very vague. 
And then, what is the difference

285
00:15:19,600 --> 00:15:22,800
between a sari and devops is a 
devops engineer? 

286
00:15:22,800 --> 00:15:25,900
Even a thing or is devops erson 
approach. 

287
00:15:26,100 --> 00:15:28,500
And then marketing teams, got a 
hold of it, right? 

288
00:15:28,500 --> 00:15:30,400
Suddenly the SRE book was 
selling really well. 

289
00:15:30,500 --> 00:15:34,300
Well, so now Sony, oh, we're 
gonna have a sorry, companies, 

290
00:15:34,300 --> 00:15:36,800
or gonna have tooling. 
That is s retooling. 

291
00:15:37,100 --> 00:15:39,900
There isn't one definition and I
would actually it doesn't really

292
00:15:39,900 --> 00:15:43,200
matter as long as our end goals 
are kind of the same and I see 

293
00:15:43,200 --> 00:15:46,600
the same width as the lows. 
I have my own definitions. 

294
00:15:46,900 --> 00:15:50,500
I do think there are true kind 
of definition that everyone can 

295
00:15:50,500 --> 00:15:54,900
agree upon but you know in the 
first and second Google s re 

296
00:15:54,900 --> 00:15:57,900
books, they don't even Define 
what an SLI is in. 

297
00:15:57,900 --> 00:16:00,000
The same way. 
A service level indicator which 

298
00:16:00,000 --> 00:16:02,800
we can talk More about, of 
course, a very important part of

299
00:16:02,800 --> 00:16:05,200
how to do is Loz the two 
different books, don't even 

300
00:16:05,200 --> 00:16:08,100
agree. 
So I think part of the confusion

301
00:16:08,200 --> 00:16:11,900
is just that they're often 
aren't single resources to point

302
00:16:11,900 --> 00:16:14,200
people at. 
We don't want degrees for this, 

303
00:16:14,500 --> 00:16:17,300
your purely a software 
developer, you can get a degree 

304
00:16:17,300 --> 00:16:20,800
in computer science, certain 
algorithms have certain names 

305
00:16:20,800 --> 00:16:24,200
have very strict definitions if 
you're writing code and a 

306
00:16:24,208 --> 00:16:27,300
certain language, you have to 
adhere to the syntax of that 

307
00:16:27,300 --> 00:16:31,300
language for certain languages 
very strictly Now we can very 

308
00:16:31,300 --> 00:16:35,600
clearly say this is a Java 
program versus this is a Python 

309
00:16:35,600 --> 00:16:39,400
program, but when it comes to 
more philosophical things like 

310
00:16:39,400 --> 00:16:43,000
site, reliability, engineering 
like service level objectives. 

311
00:16:43,200 --> 00:16:45,500
I think one of the reasons 
people sometimes struggle is 

312
00:16:45,500 --> 00:16:47,800
because it can be more difficult
to start from scratch. 

313
00:16:48,100 --> 00:16:50,900
Like my story about Squarespace,
where I realize, even though I 

314
00:16:50,900 --> 00:16:54,200
knew how to do these things, I 
had this 1200 person 

315
00:16:54,200 --> 00:16:57,900
organization that I had to teach
from the ground up because there

316
00:16:57,900 --> 00:17:01,000
aren't these strictly defined 
resources that people Can just 

317
00:17:01,000 --> 00:17:05,300
learn I think what you said 
speaks truth to my experience as

318
00:17:05,300 --> 00:17:07,300
well. 
It's really hard to read those 

319
00:17:07,300 --> 00:17:09,700
as our ebooks. 
By the way, it's very dense. 

320
00:17:09,700 --> 00:17:11,900
Sometimes could be dry, 
sometimes could be Google 

321
00:17:11,900 --> 00:17:14,099
related. 
And there are not many available

322
00:17:14,099 --> 00:17:16,700
experts out there, that can be 
set like certified. 

323
00:17:16,700 --> 00:17:20,099
SRE to actually tell you, this 
is how the better practice 

324
00:17:20,099 --> 00:17:23,599
should be and many tools vendors
just came out and probably like 

325
00:17:23,599 --> 00:17:26,599
what you said, somehow polluted 
the term, they may be defined 

326
00:17:26,599 --> 00:17:28,300
their own definition, and things
like that. 

327
00:17:28,500 --> 00:17:30,300
So I think that's one of the 
challenge. 

328
00:17:30,600 --> 00:17:34,300
Today, let's try to also discuss
about these Concepts from the 

329
00:17:34,300 --> 00:17:36,200
basic since you have a lot of 
experience. 

330
00:17:36,200 --> 00:17:38,300
You write this book, 
implementing slos. 

331
00:17:38,500 --> 00:17:42,000
But first of all, what is the 
definition of service and 

332
00:17:42,100 --> 00:17:44,700
reliability? 
Because some people actually use

333
00:17:44,700 --> 00:17:48,200
these terms interchangeably. 
So many different variations, 

334
00:17:48,400 --> 00:17:52,100
maybe let's start from there. 
Yeah, absolutely service is 

335
00:17:52,100 --> 00:17:55,200
actually not difficult to Define
because you probably already 

336
00:17:55,200 --> 00:17:58,100
know it's what the word service 
just means. 

337
00:17:58,400 --> 00:18:01,200
One of my favorite examples is 
that I've I'm so much my 

338
00:18:01,200 --> 00:18:05,600
experience and so much of why I 
love ASL owes is because I used 

339
00:18:05,600 --> 00:18:08,600
to be a server like server in a 
restaurant. 

340
00:18:08,600 --> 00:18:13,600
I provided a service for people 
which was to take their orders 

341
00:18:13,600 --> 00:18:17,000
and bring them food. 
A computer server is not very 

342
00:18:17,000 --> 00:18:19,900
much different. 
It is a thing that takes your 

343
00:18:19,900 --> 00:18:24,400
requests and response to its 
correctly is very similar in 

344
00:18:24,400 --> 00:18:27,000
concept. 
That's what a computer service 

345
00:18:27,000 --> 00:18:30,700
is, Computer Services, something
that listens to a request West 

346
00:18:30,700 --> 00:18:34,800
from something and response to 
it appropriately that I think is

347
00:18:34,800 --> 00:18:38,500
the best way to think about it 
instead of trying to Define 

348
00:18:38,600 --> 00:18:41,600
exactly what it means as a 
technological level because I 

349
00:18:41,600 --> 00:18:44,400
don't think that's important to 
some teams. 

350
00:18:44,800 --> 00:18:48,300
It is important to think of 
their service as a pod running 

351
00:18:48,300 --> 00:18:52,000
in kubernetes or a series of 
PODS or a Docker container 

352
00:18:52,000 --> 00:18:57,100
somewhere or a binary running on
a virtual machine or a piece of 

353
00:18:57,100 --> 00:19:02,400
Hardware even networking gear. 
Those I'd Services as well and 

354
00:19:02,400 --> 00:19:05,100
those don't fit any of those 
previous definitions. 

355
00:19:05,300 --> 00:19:08,700
For some people, the service 
they care about is though. 

356
00:19:08,700 --> 00:19:13,000
Retail website you go to a place
to buy a pair of socks. 

357
00:19:13,400 --> 00:19:16,700
Some people at that company, 
just care about that entire 

358
00:19:16,700 --> 00:19:21,800
website, even though it might be
composed of Ed individual tiny 

359
00:19:21,800 --> 00:19:25,300
micro Services can really be 
defined as anything that does 

360
00:19:25,300 --> 00:19:28,200
something for someone else, I 
don't think we have to say, 

361
00:19:28,200 --> 00:19:32,800
okay, that's a service because 
it Is a set of PODS deployed via

362
00:19:32,800 --> 00:19:34,700
a single deployments on 
kubernetes. 

363
00:19:34,800 --> 00:19:37,600
Whatever this isn't a service 
because it's a user Journey. 

364
00:19:37,700 --> 00:19:40,800
No, no, no, a service. 
You can think of it 

365
00:19:40,800 --> 00:19:43,100
holistically, you can think of 
it philosophically. 

366
00:19:43,300 --> 00:19:46,800
It's something that does 
something for and that leads us 

367
00:19:46,800 --> 00:19:48,900
right into what I think the 
correct definition of 

368
00:19:48,900 --> 00:19:52,200
reliability is because people 
often conflated with 

369
00:19:52,300 --> 00:19:54,900
availability, but they're very 
different things because the 

370
00:19:54,908 --> 00:19:58,100
service can be available and not
be reliable, Rye. 

371
00:19:58,100 --> 00:20:01,700
Billy's old term reliability. 
Frankie's and goes back to the 

372
00:20:01,700 --> 00:20:05,800
1940s. 
It's not a concept unique to SRE

373
00:20:05,800 --> 00:20:09,500
or Google or check or computers.
What reliability really means 

374
00:20:09,500 --> 00:20:14,600
is, is the system performing, 
how it was defined to perform 

375
00:20:14,900 --> 00:20:17,500
for computer services. 
That basically means is it doing

376
00:20:17,500 --> 00:20:21,100
what it needs to be doing. 
If we are cool with the concept 

377
00:20:21,100 --> 00:20:25,000
of a service, being a thing that
does something for someone else.

378
00:20:25,200 --> 00:20:28,400
Then reliability, is, is that 
thing doing that thing, it's 

379
00:20:28,400 --> 00:20:31,200
supposed to be doing the reason 
I always always try to steer 

380
00:20:31,200 --> 00:20:33,700
people away from just thinking 
about it, as being like, all 

381
00:20:33,700 --> 00:20:37,000
availability, is because you can
be very available and still be 

382
00:20:37,000 --> 00:20:40,900
doing a bad job, that example, 
of the retail website where you 

383
00:20:40,900 --> 00:20:42,500
just need to buy a pair of 
socks. 

384
00:20:42,700 --> 00:20:45,600
Well, maybe you can log into the
website and it's very quick and 

385
00:20:45,600 --> 00:20:48,700
you can search and you get 
10,000 results for socks and 

386
00:20:48,700 --> 00:20:51,400
they have every color and every 
size and everything you ever 

387
00:20:51,400 --> 00:20:55,500
wanted, but then when you go to 
check out, you can't that's not 

388
00:20:55,500 --> 00:20:57,500
being reliable. 
Even if that service is being 

389
00:20:57,500 --> 00:21:00,800
available to you at the time. 
So a service is Anything. 

390
00:21:00,800 --> 00:21:02,600
It's doing something for 
something else. 

391
00:21:02,800 --> 00:21:05,500
And reliabilities means. 
Are you doing that? 

392
00:21:05,500 --> 00:21:07,600
Well enough. 
Thanks for Expediting. 

393
00:21:07,600 --> 00:21:10,800
This is very, very simple. 
I think everyone here could 

394
00:21:10,800 --> 00:21:13,800
understand that really as part 
of the foundation, right? 

395
00:21:14,000 --> 00:21:16,400
When you understand what is 
service, what is reliability? 

396
00:21:16,400 --> 00:21:18,400
The next few things will become 
easier. 

397
00:21:18,700 --> 00:21:22,500
But before we go into mallis re 
reliability and all that, I saw 

398
00:21:22,500 --> 00:21:25,100
one section in your book where 
you mentioned about service 

399
00:21:25,100 --> 00:21:27,200
truths. 
I think this is an interesting 

400
00:21:27,200 --> 00:21:29,700
concept for me. 
Would you be able to share, 

401
00:21:29,700 --> 00:21:32,800
probably what do you You mean by
this service truths and what are

402
00:21:32,800 --> 00:21:36,500
those sure? 
So I personally believe that 

403
00:21:36,500 --> 00:21:40,300
there are three things that are 
true about any service one. 

404
00:21:40,300 --> 00:21:44,000
Is that reliability is its most 
important feature, if your 

405
00:21:44,000 --> 00:21:47,300
service is not being reliable, 
it's not doing much. 

406
00:21:47,500 --> 00:21:50,200
As we just said, if we're 
defining reliability, is are you

407
00:21:50,200 --> 00:21:51,800
doing what? 
You're supposed to be doing? 

408
00:21:52,100 --> 00:21:54,700
Well, then it kind of follows. 
If you're not be reliable, 

409
00:21:54,700 --> 00:21:56,300
you're not doing what you're 
supposed to be doing. 

410
00:21:56,600 --> 00:22:00,200
So you can always be thinking 
about reliability, first, the 

411
00:22:00,500 --> 00:22:03,800
And Truth is that you don't get 
to decide what your reliability 

412
00:22:03,800 --> 00:22:05,700
is. 
I don't care what your 

413
00:22:05,700 --> 00:22:08,400
measurements say, I don't care 
what your log say, what your 

414
00:22:08,400 --> 00:22:11,800
metrics a, I don't care if you 
have a million healthy podzol 

415
00:22:11,800 --> 00:22:16,100
reporting up if your customers. 
If your users, anyone that 

416
00:22:16,100 --> 00:22:18,200
depends on you thinks you're 
being unreliable. 

417
00:22:18,200 --> 00:22:22,400
You are if you're not meeting 
the needs of your users, you're 

418
00:22:22,400 --> 00:22:25,200
not being reliable. 
So you need to take their 

419
00:22:25,200 --> 00:22:28,900
perspective into account, and 
then the third service truth, is

420
00:22:28,900 --> 00:22:31,600
that nothing is ever 100%. 
Sent so don't aim for it. 

421
00:22:31,700 --> 00:22:34,500
This is just a truth of the 
world outside of pure 

422
00:22:34,500 --> 00:22:38,400
mathematical constructs 
nothing's ever 100% things fail 

423
00:22:38,500 --> 00:22:40,900
failures occur. 
It turns out that people are 

424
00:22:40,900 --> 00:22:43,700
actually fine with failure. 
As long as figure doesn't 

425
00:22:43,700 --> 00:22:46,900
happen, too often. 
The third truth is, just don't 

426
00:22:46,900 --> 00:22:49,900
aim for 100 percent, because 
that's a Fool's errand. 

427
00:22:50,200 --> 00:22:52,600
So instead, pick a more 
reasonable Target. 

428
00:22:52,700 --> 00:22:55,600
Understand what? 
Reliable means, understand that 

429
00:22:55,600 --> 00:22:58,100
someone else determines what 
reliable means. 

430
00:22:58,300 --> 00:23:01,300
Understand that right abilities.
Most Part of your service. 

431
00:23:01,700 --> 00:23:04,800
And then make sure you're only 
trying to be reliable enough, 

432
00:23:05,000 --> 00:23:07,600
like an achievable mount, 
something that works for both 

433
00:23:07,600 --> 00:23:11,400
you and the people that depend 
on you as you can tell, these 

434
00:23:11,400 --> 00:23:13,000
are the fundamental 
understanding. 

435
00:23:13,000 --> 00:23:16,800
So let me try to reiterate the 
first is that reliabilities, the

436
00:23:16,800 --> 00:23:19,900
most important attributes or 
characteristics of your system. 

437
00:23:20,100 --> 00:23:23,400
No matter how your system has so
many features or functional 

438
00:23:23,400 --> 00:23:27,200
requirements but if it doesn't 
perform reliably that's probably

439
00:23:27,200 --> 00:23:30,300
not a good service and the other
one is you don't Define you. 

440
00:23:30,400 --> 00:23:34,200
Reliability, but users do it for
you on behalf so when they use 

441
00:23:34,200 --> 00:23:36,600
your system they will tell you 
if your system is reliable or 

442
00:23:36,600 --> 00:23:39,200
not and the last one. 
Nothing is 100%. 

443
00:23:39,300 --> 00:23:42,900
Don't ever try to achieve that 
because people are okay with 

444
00:23:42,900 --> 00:23:45,500
failures as long as it's not 
failing frequently. 

445
00:23:45,800 --> 00:23:49,900
Let's go back to the mall. 
Deep dive definition about SL is

446
00:23:49,900 --> 00:23:52,300
Loz and all that. 
So you have this concept of 

447
00:23:52,300 --> 00:23:54,700
reliability stack, maybe let's 
start from there. 

448
00:23:54,700 --> 00:23:57,600
What are in the stack? 
How do you define them and how 

449
00:23:57,600 --> 00:23:59,900
they work with each other? 
Sure. 

450
00:23:59,900 --> 00:24:03,000
So you have three primary 
components of what I call the 

451
00:24:03,000 --> 00:24:05,700
reliability stack. 
This is what people really often

452
00:24:05,700 --> 00:24:08,000
refer to. 
When they're saying s ellos, 

453
00:24:08,300 --> 00:24:11,500
when they often use the term as 
slows, they really mean a few 

454
00:24:11,500 --> 00:24:14,600
different things. 
First is SL eyes or service 

455
00:24:14,600 --> 00:24:18,400
level, indicators service level,
indicators are measurements. 

456
00:24:18,600 --> 00:24:21,800
There are bits of telemetry 
about your system that tell you,

457
00:24:21,800 --> 00:24:23,600
is it doing what it's supposed 
to be doing. 

458
00:24:24,000 --> 00:24:27,200
And again, this should be from 
your users perspective as close 

459
00:24:27,200 --> 00:24:30,600
as you can get at least then 
Next you have slos or service 

460
00:24:30,600 --> 00:24:32,200
level objectives and service 
level. 

461
00:24:32,200 --> 00:24:35,900
Objectives are just targets for 
how often you want your SLI to 

462
00:24:35,900 --> 00:24:38,900
be true. 
So if your SLI is able to tell 

463
00:24:38,900 --> 00:24:42,000
you, yes, we're currently 
meeting the expectations of our 

464
00:24:42,000 --> 00:24:45,400
users or people to tell you. 
Oh, we're currently not meeting 

465
00:24:45,400 --> 00:24:48,900
the expectations of our users. 
You're now able to inform a 

466
00:24:48,900 --> 00:24:53,600
ratio, good events over total 
events equals a ratio equals 

467
00:24:53,600 --> 00:24:56,500
some kind of percentage. 
Your SLO is just a target for 

468
00:24:56,500 --> 00:24:58,000
what you want that percentage to
be. 

469
00:24:58,000 --> 00:25:00,200
So you can Something more 
reasonable. 

470
00:25:00,600 --> 00:25:02,500
So we just talked about a 
hundred percent is impossible. 

471
00:25:02,500 --> 00:25:06,000
No one ever hits that anyway. 
So you can say something like, 

472
00:25:06,200 --> 00:25:09,200
well, go back to the sock, 
buying scenario, we measure user

473
00:25:09,200 --> 00:25:10,200
all night. 
Why not? 

474
00:25:10,400 --> 00:25:14,000
If you can't check out when you 
try to buy socks, that's not 

475
00:25:14,000 --> 00:25:17,700
good, but it's okay. 
If that only happens one and 100

476
00:25:17,700 --> 00:25:21,000
times because you try to check 
out it doesn't work, what are 

477
00:25:21,000 --> 00:25:23,100
you going to do? 
Probably is going to click again

478
00:25:23,400 --> 00:25:26,200
and as long as it works the next
time fine. 

479
00:25:26,500 --> 00:25:29,900
So maybe you only have to make 
sure that sock check Out works 

480
00:25:29,900 --> 00:25:32,500
99% of the time. 
It lets you pick a Target. 

481
00:25:32,500 --> 00:25:34,800
That's realistic that 
accommodates failure. 

482
00:25:34,800 --> 00:25:37,400
That make sure you're not 
spending too much money and 

483
00:25:37,400 --> 00:25:38,800
trying to aim for something you 
can. 

484
00:25:39,300 --> 00:25:42,700
And then finally at the top of 
the liability SEC you have error

485
00:25:42,700 --> 00:25:46,600
budgets are budgets. 
Are just a way of thinking about

486
00:25:46,600 --> 00:25:50,000
how your SLO has performed over 
a period of time, an error 

487
00:25:50,000 --> 00:25:52,900
budget, often takes into account
a Time window. 

488
00:25:52,900 --> 00:25:56,200
That's often fairly large 
generally anywhere from a week 

489
00:25:56,200 --> 00:25:59,800
to 28 days to 30 days sometimes 
even a Quarter. 

490
00:26:00,100 --> 00:26:03,100
The idea is your error budget, 
is that other side of the 

491
00:26:03,100 --> 00:26:06,700
percentage. 
So if you say sock, check out 

492
00:26:06,700 --> 00:26:08,900
should work 99 percent of the 
time. 

493
00:26:09,200 --> 00:26:12,000
But you're also saying, is 1% of
the time sock? 

494
00:26:12,000 --> 00:26:15,700
Check out is allowed to fail. 
Your error budgie is that 1% 

495
00:26:15,800 --> 00:26:19,000
your error budget? 
Measures are we feeling more 

496
00:26:19,000 --> 00:26:22,300
often or the right amount? 
And so you are budget. 

497
00:26:22,300 --> 00:26:26,100
Lets you think about things over
periods of time over the last 30

498
00:26:26,100 --> 00:26:27,100
days? 
What? 

499
00:26:27,100 --> 00:26:28,900
Percentage of the time have we 
failed? 

500
00:26:29,300 --> 00:26:32,200
And is that helping us meet our 
SLO targets? 

501
00:26:32,200 --> 00:26:35,900
Are we exceeding it, your error 
budget, it's worded that way 

502
00:26:35,900 --> 00:26:38,900
because of budget, something you
can spend and an error budget is

503
00:26:38,900 --> 00:26:42,000
exactly that. 
We're allowing ourself this one 

504
00:26:42,000 --> 00:26:44,900
percent or whatever you've 
defined for your own service 

505
00:26:44,900 --> 00:26:47,700
course. 
But in our continuing example, 

506
00:26:47,800 --> 00:26:51,100
you have this one percent of 
checkouts are not allowed to 

507
00:26:51,100 --> 00:26:52,800
work. 
What amount of that have you 

508
00:26:52,800 --> 00:26:55,500
spent over time? 
And that then helps you make 

509
00:26:55,500 --> 00:26:58,600
better decisions about where you
need to focus your attention. 

510
00:26:59,300 --> 00:27:02,600
Thanks for mentioning about this
budget term because it is that 

511
00:27:02,600 --> 00:27:05,100
consciously Define. 
It's not something that someone 

512
00:27:05,100 --> 00:27:07,900
just take, but it's actually 
something that can be spent 

513
00:27:07,900 --> 00:27:11,300
errors is expected sometimes, 
but it also can be spent. 

514
00:27:11,500 --> 00:27:14,100
So you mentioned about defining 
reliability. 

515
00:27:14,300 --> 00:27:17,900
Many people got stuck even here,
I just like how reliable is your

516
00:27:17,900 --> 00:27:20,600
service should be? 
You said, it's not 100% some 

517
00:27:20,600 --> 00:27:23,200
business people. 
It should be 100% how can we 

518
00:27:23,200 --> 00:27:26,100
actually Define our reliability 
our service? 

519
00:27:26,500 --> 00:27:29,000
How should we go about it? 
What's the approach? 

520
00:27:29,600 --> 00:27:33,400
Unfortunately, this is one of 
those times where I invoke my 

521
00:27:33,400 --> 00:27:37,100
very senior engineer, who's been
doing this too long card and I 

522
00:27:37,100 --> 00:27:39,400
say, it depends. 
It's Unique. 

523
00:27:39,600 --> 00:27:42,900
It depends on your service. 
The important thing is to be 

524
00:27:42,900 --> 00:27:46,700
meaningful and to be thoughtful,
the important thing is think 

525
00:27:46,700 --> 00:27:51,900
about what is my service who are
my users real quick, tangent. 

526
00:27:51,900 --> 00:27:55,200
I say user a lot, but I don't 
just mean customers. 

527
00:27:55,500 --> 00:27:58,200
I mean anyone that relies on 
your service, it might be 

528
00:27:58,200 --> 00:28:00,600
another team. 
The hallway, a might be internal

529
00:28:00,600 --> 00:28:03,500
people at your company. 
It might be another service. 

530
00:28:03,700 --> 00:28:06,900
Your service might be six layers
removed from an actual human, 

531
00:28:06,900 --> 00:28:09,600
but that other service still a 
user of your service. 

532
00:28:09,900 --> 00:28:12,700
So that's why I say user a lot 
because it can meet any of those

533
00:28:12,700 --> 00:28:15,100
things. 
There isn't a single answer, be 

534
00:28:15,100 --> 00:28:17,200
thoughtful about it, take your 
time. 

535
00:28:17,400 --> 00:28:20,600
Think what does my service do? 
What is it supposed to do? 

536
00:28:20,700 --> 00:28:24,500
What do you do the users expect?
Can I go talk to those users? 

537
00:28:24,600 --> 00:28:26,700
If I can? 
Let's go talk to them. 

538
00:28:26,800 --> 00:28:29,000
Let's ask them directly. 
Is there a product? 

539
00:28:29,100 --> 00:28:31,900
Direct management team. 
Do they have a user Journeys 

540
00:28:31,900 --> 00:28:33,700
document? 
Maybe we should go look at what 

541
00:28:33,700 --> 00:28:37,000
did the find user Journeys? 
Are why is this software written

542
00:28:37,000 --> 00:28:40,000
in the first place? 
Why was it spun up at this 

543
00:28:40,000 --> 00:28:41,700
company? 
What does it do? 

544
00:28:41,900 --> 00:28:44,500
It's not the most satisfying 
answer because I don't have like

545
00:28:44,500 --> 00:28:47,200
an answer. 
I don't have a special formula. 

546
00:28:47,200 --> 00:28:50,700
People can use to instantly 
understand what it is that they 

547
00:28:50,700 --> 00:28:53,600
need to be thinking about when 
they pick how reliable they need

548
00:28:53,600 --> 00:28:56,400
to be. 
But what I can say is in every 

549
00:28:56,400 --> 00:28:59,900
single case, if you're careful 
and your thoughtful, That will 

550
00:28:59,900 --> 00:29:03,500
lead you to the best answers. 
One thing that I really love 

551
00:29:03,500 --> 00:29:06,700
about that sorry Concepts and 
slos and all that, they actually

552
00:29:06,700 --> 00:29:09,400
always put users in the first 
place. 

553
00:29:09,600 --> 00:29:12,400
It's not some random technical 
decisions, okay? 

554
00:29:12,400 --> 00:29:15,000
This is how the reliability 
should be and it's actually 

555
00:29:15,000 --> 00:29:18,000
always comes from the users 
perspective like to mention in 

556
00:29:18,000 --> 00:29:21,300
the beginning, 100% is not 
possible sometimes or even 

557
00:29:21,300 --> 00:29:23,700
things that relies on Internet 
by default. 

558
00:29:23,700 --> 00:29:26,600
They are not reliable because 
your packets could be loss or 

559
00:29:26,600 --> 00:29:29,500
you need to retry know that 
reliability and other Thing very

560
00:29:29,500 --> 00:29:32,600
important to understand, maybe 
you can help to explain is that 

561
00:29:32,600 --> 00:29:36,300
as you go higher, if you aspire 
to go higher, it becomes more 

562
00:29:36,300 --> 00:29:39,600
difficult complex, and also 
expensive, tell us more why? 

563
00:29:39,600 --> 00:29:41,700
This is the case. 
Sure? 

564
00:29:41,700 --> 00:29:46,900
I mean, if you want to ensure 
that you are being more 

565
00:29:46,900 --> 00:29:51,000
reliable, you need to ensure 
you're having fewer failures 

566
00:29:51,300 --> 00:29:54,000
individual components fail more 
often. 

567
00:29:54,200 --> 00:29:57,800
So, if you add more and more, 
you're just going to have more 

568
00:29:57,800 --> 00:30:00,100
and more failures, right? 
You have something that feels 

569
00:30:00,100 --> 00:30:03,000
99% of the time and you have 
just that one thing. 

570
00:30:03,300 --> 00:30:05,700
But if you need to be redundant,
if you want to make sure they 

571
00:30:05,700 --> 00:30:09,000
have this thing, fails thing, a 
then you also now need to have 

572
00:30:09,000 --> 00:30:12,600
thing be in place because 
nothing be needs to take over in

573
00:30:12,600 --> 00:30:16,400
case thing a fails that's all 
good and well, but thing be my 

574
00:30:16,400 --> 00:30:18,100
also only be 99 percent 
reliable. 

575
00:30:18,400 --> 00:30:22,200
So now you need something to 
determine whether or not thing a

576
00:30:22,200 --> 00:30:25,400
or thing, B is currently being 
reliable or not and where to 

577
00:30:25,408 --> 00:30:28,300
send that traffic. 
So now you have things, see we 

578
00:30:28,300 --> 00:30:29,200
can go on and on. 
On. 

579
00:30:29,200 --> 00:30:32,300
But, you know, you just end up 
having to build more and more 

580
00:30:32,300 --> 00:30:36,200
complex systems to ensure that 
you are, in fact, being more and

581
00:30:36,200 --> 00:30:38,700
more reliable time. 
This is sometimes totally 

582
00:30:38,700 --> 00:30:41,400
reasonable for some systems in 
some Services. 

583
00:30:41,400 --> 00:30:43,300
You do need to hit very high 
Targets. 

584
00:30:43,500 --> 00:30:45,700
This means you need to be 
distributed. 

585
00:30:45,700 --> 00:30:49,800
You need to ensure your built 
for high availability, and 

586
00:30:49,800 --> 00:30:53,100
quick, failover, purrs and 
redundancy. 

587
00:30:53,300 --> 00:30:56,900
There's all sorts of engineering
for reliability Concepts that we

588
00:30:56,900 --> 00:31:00,000
could spend hours and hours 
talking about W once you 

589
00:31:00,008 --> 00:31:03,200
introduce so many things, you're
also spending a lot more money 

590
00:31:03,700 --> 00:31:07,100
because, no matter if, this is 
running on your own Hardware, or

591
00:31:07,100 --> 00:31:09,700
your own data centers, or 
everything is in the cloud, 

592
00:31:09,700 --> 00:31:12,900
you're now running more things 
so it costs more money, both 

593
00:31:12,900 --> 00:31:17,100
just in a per month, building 
situation as well as the fact 

594
00:31:17,100 --> 00:31:19,600
that now, you need more 
Engineers to take care of it. 

595
00:31:19,700 --> 00:31:23,100
If you have 100 components to 
the one, you need more Engineers

596
00:31:23,100 --> 00:31:24,500
to take care of those other 
components. 

597
00:31:24,800 --> 00:31:27,700
So now we have to hire a whole 
bunch of Engineers and now we're

598
00:31:27,700 --> 00:31:29,000
trying to hit a really high 
reliability. 

599
00:31:29,100 --> 00:31:31,300
Ility Target. 
Well, what a lot of people don't

600
00:31:31,300 --> 00:31:34,100
always do the math about is what
that really means in terms of 

601
00:31:34,100 --> 00:31:38,000
like time, if you want four 
nines reliability that gives you

602
00:31:38,000 --> 00:31:41,800
only seconds per month to 
respond, you can barely have any

603
00:31:41,800 --> 00:31:45,100
downtime at all ever. 
So now, what do the on-call 

604
00:31:45,100 --> 00:31:47,400
rotations look like? 
So, you now have like a hundred 

605
00:31:47,400 --> 00:31:50,200
different components of your 
service because you have to have

606
00:31:50,200 --> 00:31:53,400
incredible redundancy in all 
sorts of extra availability 

607
00:31:53,400 --> 00:31:54,800
stuff. 
And now you have a whole bunch 

608
00:31:54,800 --> 00:31:57,300
of engineering team to take care
of all those components. 

609
00:31:57,500 --> 00:32:00,100
Now, they have to be on call. 
But they also would be on call 

610
00:32:00,100 --> 00:32:03,700
on respond immediately. 
That means no longer can those 

611
00:32:03,700 --> 00:32:07,000
teams just be singly home, do 
you need a follow the, some 

612
00:32:07,000 --> 00:32:09,000
rotation? 
So you and I are exactly 12 

613
00:32:09,000 --> 00:32:10,700
hours apart. 
So we could have a team in 

614
00:32:10,700 --> 00:32:12,300
Singapore and again with team in
New York. 

615
00:32:12,500 --> 00:32:14,400
Great. 
But now you need offices in 

616
00:32:14,400 --> 00:32:17,100
Singapore, and New York. 
Now, you don't just have X 

617
00:32:17,100 --> 00:32:19,400
number teams. 
You have to X number of teams 

618
00:32:19,600 --> 00:32:22,800
because every team now needs 
teens and Singapore and teams in

619
00:32:22,800 --> 00:32:25,000
New York to ensure that 
someone's only being on call 

620
00:32:25,000 --> 00:32:28,500
during the day, because no one 
could always wake up at 3 a.m. 

621
00:32:28,500 --> 00:32:33,000
and And in time to try and 
defend a Target as stringent as 

622
00:32:33,000 --> 00:32:38,400
99.99%, this is goes on and on, 
and on the closer, you try to 

623
00:32:38,400 --> 00:32:42,600
get 100% the more and more that 
grows its exponential if we can.

624
00:32:42,600 --> 00:32:46,100
All agree that just logically 
you cannot ever hit 100%. 

625
00:32:46,100 --> 00:32:49,600
Anyway, it turns into a limit, 
the limit approaches Infinity 

626
00:32:49,600 --> 00:32:51,400
forever. 
You'll never actually hit on 

627
00:32:51,400 --> 00:32:53,900
upset anyway and trying to get 
there. 

628
00:32:53,900 --> 00:32:56,800
We'll just cost you more and 
more money in terms of how many 

629
00:32:56,800 --> 00:33:00,000
services you need how much 
you're paying for your Providers

630
00:33:00,300 --> 00:33:02,100
got, maybe you can even be on 
one Cloud. 

631
00:33:02,100 --> 00:33:04,900
Maybe gonna go multi-cloud, 
cause what if AWS goes down, 

632
00:33:04,900 --> 00:33:06,900
will be better, be running in 
gcp as well. 

633
00:33:07,200 --> 00:33:09,900
Do you know how difficult it is 
to write stuff that lives on top

634
00:33:09,900 --> 00:33:12,900
of both of those things and 
ensures that it can route to gcp

635
00:33:12,900 --> 00:33:16,700
or AWS at the right moments in 
time how to even detect whether 

636
00:33:16,700 --> 00:33:19,200
or not you're a devious or gcp 
instances are running. 

637
00:33:19,200 --> 00:33:22,500
Correct at that point in time he
gets so complex to try to hit 

638
00:33:22,500 --> 00:33:25,100
these high Target that you're 
just gonna kill yourself trying 

639
00:33:25,100 --> 00:33:27,500
to spend too much money, turns 
out, you don't have to hit those

640
00:33:27,500 --> 00:33:31,100
targets anyway. 
It's Yeah, I was about to say 

641
00:33:31,100 --> 00:33:33,800
Indian after all these 
complexities and effort, your 

642
00:33:33,800 --> 00:33:36,200
users actually don't need that 
kind of reliability. 

643
00:33:36,200 --> 00:33:39,000
So I think the best way is again
to check with your users. 

644
00:33:39,000 --> 00:33:42,100
What is their expectations? 
And yes, sure some systems will 

645
00:33:42,100 --> 00:33:44,900
require this kind of high 
availability but yeah hopefully 

646
00:33:44,900 --> 00:33:47,200
people who listen to Alex 
explanation. 

647
00:33:47,200 --> 00:33:50,300
You could actually understand 
you need to really Define your 

648
00:33:50,300 --> 00:33:52,800
reliability. 
It's not just some nines that 

649
00:33:52,800 --> 00:33:55,500
you see okay 49 so let's just 
put it that way. 

650
00:33:56,200 --> 00:34:01,500
Yeah so often leadership Tucson 
We're going to hit this target. 

651
00:34:01,500 --> 00:34:04,500
They pick a whole bunch of nines
all in a row because they think 

652
00:34:04,500 --> 00:34:07,800
that's reasonable or because 
they know that some of Google's 

653
00:34:07,800 --> 00:34:09,500
Services hit that. 
You know what? 

654
00:34:09,500 --> 00:34:12,400
You're probably not Gogol. 
It's like Google has all those 

655
00:34:12,400 --> 00:34:14,500
things. 
Google does have teams all over 

656
00:34:14,500 --> 00:34:16,000
the world. 
Google does have to eat, it 

657
00:34:16,000 --> 00:34:18,500
starts off with for LT valerii, 
right? 

658
00:34:18,500 --> 00:34:21,900
And you probably don't, as you 
said, just be reasonable, be 

659
00:34:21,900 --> 00:34:23,800
thoughtful. 
Think about the target you're 

660
00:34:23,800 --> 00:34:25,699
picking and make sure that they 
make sense for you. 

661
00:34:26,500 --> 00:34:29,400
So let's start with the 
fundamentals of reliability 

662
00:34:29,400 --> 00:34:32,000
spec, which is SLI. 
So you measure is like a matrix 

663
00:34:32,000 --> 00:34:35,100
that defines how your users 
experience in terms of 

664
00:34:35,100 --> 00:34:37,800
reliability. 
Probably if we can relate to the

665
00:34:37,800 --> 00:34:41,300
asari best practice, commonly it
advocates for different golden 

666
00:34:41,300 --> 00:34:44,800
signals for monitoring probably 
Define the metrics of your 

667
00:34:44,800 --> 00:34:47,000
system. 
Is this the right place to 

668
00:34:47,000 --> 00:34:49,300
start? 
How do you define your SLI? 

669
00:34:50,199 --> 00:34:53,100
I have mixed feelings about the 
golden signals. 

670
00:34:53,400 --> 00:34:56,100
I'm on record does not going to 
blow any ones. 

671
00:34:56,300 --> 00:35:00,200
And his know me for a while. 
I kind of wish we hadn't in the 

672
00:35:00,200 --> 00:35:02,900
first SRE book written about 
them the way we did. 

673
00:35:03,200 --> 00:35:07,300
Because I feel that too many 
people think that by having 

674
00:35:07,300 --> 00:35:10,000
labeled them, the golden signals
that they're the only things 

675
00:35:10,000 --> 00:35:12,700
that matter. 
And I see a lot of people both 

676
00:35:12,700 --> 00:35:16,300
start and stop there. 
And there's nothing wrong with 

677
00:35:16,300 --> 00:35:18,500
measuring availability. 
There's nothing wrong with 

678
00:35:18,500 --> 00:35:20,600
measuring latency. 
There's nothing wrong with 

679
00:35:20,600 --> 00:35:23,300
measuring throughput. 
That's not my issue. 

680
00:35:23,500 --> 00:35:27,500
My issue is that people often 
start and Stop there they are. 

681
00:35:27,500 --> 00:35:30,200
In fact good starting points. 
Absolutely? 

682
00:35:30,200 --> 00:35:32,000
They are. 
So, if you're asking me, are 

683
00:35:32,000 --> 00:35:33,000
they good? 
Starting points. 

684
00:35:33,100 --> 00:35:34,500
Yes. 
If that's where you need to 

685
00:35:34,508 --> 00:35:36,000
start. 
Start there. 

686
00:35:36,300 --> 00:35:38,200
Almost everyone can measure 
those things. 

687
00:35:38,400 --> 00:35:41,300
It's a great place to start but 
those things rarely tell the 

688
00:35:41,300 --> 00:35:45,400
whole story, they rarely tell 
you what your users are actually

689
00:35:45,400 --> 00:35:49,000
experiencing as return earlier, 
reliability is what your users 

690
00:35:49,000 --> 00:35:51,000
need from you. 
And we can just tell a quick 

691
00:35:51,000 --> 00:35:53,700
little story, right? 
So, is your service available? 

692
00:35:53,800 --> 00:35:55,500
Yes, it's responding to 
requests. 

693
00:35:55,800 --> 00:35:57,500
Is it? 
Latency low, sure. 

694
00:35:57,600 --> 00:36:00,800
It's responding to requests very
quickly, is experiencing much 

695
00:36:00,800 --> 00:36:03,500
errors. 
No, it's available, and it's 

696
00:36:03,500 --> 00:36:04,800
responding to things very 
quickly. 

697
00:36:04,800 --> 00:36:08,700
And every response is HTTP, 200,
everything's great, but if 

698
00:36:08,700 --> 00:36:12,100
you're sending them the wrong 
data, if the data you're sending

699
00:36:12,100 --> 00:36:14,800
them is not what they're asking 
for, you not being reliable at 

700
00:36:14,800 --> 00:36:17,300
all, and that's not covered 
another golden signals at all. 

701
00:36:17,600 --> 00:36:22,300
So I think people can graduate. 
They can upgrade to a more user 

702
00:36:22,300 --> 00:36:24,900
Journey based Focus, which is 
again. 

703
00:36:24,900 --> 00:36:27,200
Is this doing what? 
Users Is need it to do which 

704
00:36:27,200 --> 00:36:31,200
includes many levels Beyond just
what the golden signals tell 

705
00:36:31,200 --> 00:36:33,000
you. 
So yeah I have some mixed 

706
00:36:33,000 --> 00:36:35,600
feelings. 
I think they were exid ently 

707
00:36:35,600 --> 00:36:39,900
taken too seriously or as too 
much of an end goal the way they

708
00:36:39,900 --> 00:36:42,300
were written about. 
But I do think they are good 

709
00:36:42,300 --> 00:36:45,400
starting points because almost 
everyone has that data already. 

710
00:36:45,600 --> 00:36:48,000
If you don't, it can be 
generally speaking. 

711
00:36:48,000 --> 00:36:51,500
Pretty easy to instrument. 
It can be much more difficult to

712
00:36:51,500 --> 00:36:53,900
measure. 
Did we send the data that the 

713
00:36:53,900 --> 00:36:56,100
client asked for that? 
Can be a lot more. 

714
00:36:56,400 --> 00:36:59,400
It took a long time to get there
but yeah this is my kind of 

715
00:36:59,400 --> 00:37:02,100
feelings. 
They're a really good SLI is 

716
00:37:02,100 --> 00:37:04,500
ever changing its ever adopting 
because the world is going to 

717
00:37:04,500 --> 00:37:07,800
change your user expectations 
are going to change what your 

718
00:37:07,800 --> 00:37:11,200
service is defined as even doing
is going to change over time. 

719
00:37:11,500 --> 00:37:13,800
So make sure that you're looking
at your SLI eyes. 

720
00:37:13,800 --> 00:37:16,200
You're looking at those 
measurements you're saying is 

721
00:37:16,200 --> 00:37:19,000
what we're measuring. 
Is it still telling us enough is

722
00:37:19,000 --> 00:37:21,300
it? 
So telling us what our users are

723
00:37:21,300 --> 00:37:23,400
experiencing? 
Is it still looking at things 

724
00:37:23,400 --> 00:37:26,400
from their perspective. 
So start with errors Art with 

725
00:37:26,400 --> 00:37:29,100
latency start with availability,
but move from there. 

726
00:37:29,200 --> 00:37:31,100
Think about what else your users
actually need. 

727
00:37:31,700 --> 00:37:34,100
You mentioned about checking 
like the correctness of data, 

728
00:37:34,100 --> 00:37:35,600
right? 
So this is sometimes where I am 

729
00:37:35,600 --> 00:37:38,200
confused as well. 
Personally, if you are always 

730
00:37:38,200 --> 00:37:41,500
checking, whether you send the 
correct data, that he gets like 

731
00:37:41,500 --> 00:37:43,900
infinite Loop, how do you check 
in the first place? 

732
00:37:43,900 --> 00:37:47,000
The data is correct. 
What about the testing phase of 

733
00:37:47,000 --> 00:37:49,700
your product? 
How do you actually cover this 

734
00:37:49,700 --> 00:37:52,500
correctness thing? 
Because do we really need to 

735
00:37:52,500 --> 00:37:54,700
measure the correctness of all 
requests? 

736
00:37:54,800 --> 00:37:58,700
Or is it only partially Oh maybe
only some service types that 

737
00:37:58,700 --> 00:38:01,100
need to have this correctness 
attribute measured. 

738
00:38:01,500 --> 00:38:03,200
So maybe, tell us a little bit 
more about that. 

739
00:38:03,700 --> 00:38:05,700
I think is all the things that 
uses mentioned. 

740
00:38:05,900 --> 00:38:09,500
It depends again on your 
service, a favor, a mine is 

741
00:38:09,500 --> 00:38:13,500
using synthetic checks a good 
example, is I was once 

742
00:38:13,500 --> 00:38:17,400
responsible for a very 
large-scale log system, hundreds

743
00:38:17,400 --> 00:38:20,900
of thousands of incoming logs 
every minute or maybe even four 

744
00:38:20,900 --> 00:38:24,400
S very very high volume. 
We wanted to make sure that we 

745
00:38:24,400 --> 00:38:28,300
were doing things correctly. 
What we realize is that if we 

746
00:38:28,300 --> 00:38:33,100
could craft a special log with a
special tag on it and we insert 

747
00:38:33,100 --> 00:38:36,700
it and then we waited found out 
how long it took for it to go 

748
00:38:36,700 --> 00:38:40,400
through the entire pipeline, it 
get indexed and for us to be 

749
00:38:40,408 --> 00:38:43,500
able to retrieve it on the other
side and verify that signature 

750
00:38:43,500 --> 00:38:48,700
we put in, was there that gave 
us availability latency ensuring

751
00:38:48,700 --> 00:38:50,800
there. 
Weren't errors and data 

752
00:38:50,800 --> 00:38:52,300
correctness and even data 
freshness. 

753
00:38:52,300 --> 00:38:54,900
But the thing is they told a 
pretty complete story. 

754
00:38:54,900 --> 00:38:58,600
This is what the Users of this 
log service actually needed to 

755
00:38:58,600 --> 00:39:01,200
happen. 
They needed to insert logs and 

756
00:39:01,200 --> 00:39:03,800
then they needed his logs, be 
indexed, and then they need to 

757
00:39:03,800 --> 00:39:06,400
be able to retrieve those logs. 
So we just wrote a job that 

758
00:39:06,400 --> 00:39:09,500
would just continually do this 
and luckily this logs are some 

759
00:39:09,500 --> 00:39:12,500
more decently Estonia to happen 
a few times per minute. 

760
00:39:12,600 --> 00:39:15,100
Even at a few times per minute 
that gave us more than enough 

761
00:39:15,100 --> 00:39:17,100
data to set and a slow on top 
of. 

762
00:39:17,400 --> 00:39:20,500
So we're no longer dealing with 
the 100,000 requests per second 

763
00:39:20,500 --> 00:39:22,100
at the service actually dealt 
with. 

764
00:39:22,300 --> 00:39:26,000
It was only a few a minute now, 
but we were reasonably sure. 

765
00:39:26,100 --> 00:39:30,300
Or at least close enough to 
being sure that if any step 

766
00:39:30,300 --> 00:39:33,700
along the way broke for most 
people, we would find out 

767
00:39:33,800 --> 00:39:37,300
because our special crafted log,
which are specially crafted but 

768
00:39:37,300 --> 00:39:40,100
otherwise went through the same 
workflow, the same pipelines. 

769
00:39:40,100 --> 00:39:43,400
Is everything else that gives a 
pretty good signal and that one 

770
00:39:43,400 --> 00:39:46,300
measurements told us 45 
different things because it 

771
00:39:46,300 --> 00:39:50,400
covered five different Services,
Glock producing Services, the 

772
00:39:50,400 --> 00:39:54,800
Kafka assessments at between the
log stash service that pulled 

773
00:39:54,900 --> 00:39:58,500
off the Kafka topic. 
The elasticsearch cluster that 

774
00:39:58,500 --> 00:40:01,600
log session inserted the logs 
into, which were then index by 

775
00:40:01,600 --> 00:40:04,100
elasticsearch. 
And then the check talk to 

776
00:40:04,100 --> 00:40:06,800
Cabana that sat in front of 
elasticsearch to actually 

777
00:40:06,800 --> 00:40:09,700
retrieve the log. 
So all those services and the 

778
00:40:09,700 --> 00:40:12,600
availability of all of them and 
the latency of all of them and 

779
00:40:12,600 --> 00:40:15,700
the error rates, all, and we're 
all being covered by this one 

780
00:40:15,700 --> 00:40:18,600
synthetic that just ran over and
over and over again. 

781
00:40:19,100 --> 00:40:22,000
So that's one thing. 
You don't always need as much 

782
00:40:22,000 --> 00:40:24,600
resolution as you think you do. 
Because even if we only had a 

783
00:40:24,600 --> 00:40:27,400
few of these per minute, As 
opposed to hundreds of thousands

784
00:40:27,400 --> 00:40:30,500
per second, or we were covering 
so many components of the 

785
00:40:30,508 --> 00:40:33,900
system. 
He was just as good a data, but 

786
00:40:33,900 --> 00:40:38,200
you can also set a solos off of 
actual 100% user data. 

787
00:40:38,500 --> 00:40:42,400
Now, this requires against 
racing Solutions, this requires 

788
00:40:42,400 --> 00:40:46,700
a lot of work we do not have 
time really to get into all the 

789
00:40:46,700 --> 00:40:50,500
technology involved for you to 
be able to trace from your human

790
00:40:50,500 --> 00:40:55,000
end-user clients or browser 
across the entire internet 

791
00:40:55,200 --> 00:40:59,200
through your Provider and CDN 
and load balancers into your 

792
00:40:59,200 --> 00:41:02,800
app, only back to the databases 
and whatever resources they need

793
00:41:02,800 --> 00:41:05,700
to talk to an older backups like
render time at the client or 

794
00:41:05,707 --> 00:41:07,700
whatever. 
But the fact is I've seen that 

795
00:41:07,700 --> 00:41:10,800
and it is absolutely possible 
and that is a lot of work but 

796
00:41:10,800 --> 00:41:15,400
you couldn't fact also your SLI 
could be set off of actual human

797
00:41:15,400 --> 00:41:18,800
traffic, your actual user 
request could, in fact be used 

798
00:41:18,800 --> 00:41:21,000
for that as well or anything in 
between. 

799
00:41:21,400 --> 00:41:24,000
If there's anything, I want to 
drive home, it's be thoughtful, 

800
00:41:24,000 --> 00:41:26,000
be meaningful. 
Do what works for you. 

801
00:41:26,100 --> 00:41:30,100
You don't over spend resources, 
trying to do the real user 

802
00:41:30,200 --> 00:41:32,300
Journey tracing that's not 
reasonable. 

803
00:41:32,300 --> 00:41:35,100
Farewell, My Rana has the 
resources or the knowledge or 

804
00:41:35,100 --> 00:41:39,400
the time to do that, so maybe 
set up a synthetic instead or 

805
00:41:39,800 --> 00:41:41,500
you know what? 
Let's take another step back, 

806
00:41:41,500 --> 00:41:45,200
maybe for you right now, just 
latency and error rate. 

807
00:41:45,400 --> 00:41:49,100
Just a golden signals, maybe 
that's just fine. 

808
00:41:49,900 --> 00:41:51,900
Thank you for clarifying. 
That again it's a very 

809
00:41:51,900 --> 00:41:54,300
insightful for me. 
Thanks for explaining that. 

810
00:41:54,500 --> 00:41:57,000
So you mentioned a couple of 
times about User Journey. 

811
00:41:57,300 --> 00:41:59,900
Maybe some people also confused 
about this term. 

812
00:42:00,100 --> 00:42:01,600
So what do you mean by user 
journey? 

813
00:42:01,600 --> 00:42:04,000
Is it like the entire experience
I want to buy socks? 

814
00:42:04,200 --> 00:42:07,300
Or is it like when you check out
or is it when you load a page? 

815
00:42:07,600 --> 00:42:10,200
There are so many different ways
to Define this maybe if you can 

816
00:42:10,200 --> 00:42:12,100
help. 
Also to Define what is a user 

817
00:42:12,100 --> 00:42:15,000
Journey, or some people call it 
critical usage in it sure. 

818
00:42:15,000 --> 00:42:18,300
So a critical user journey is 
generally something that's 

819
00:42:18,300 --> 00:42:23,100
defined by the product aspect of
your organization telling you 

820
00:42:23,100 --> 00:42:25,900
what needs to happen with your 
product. 

821
00:42:26,100 --> 00:42:28,100
For you to be a successful 
company. 

822
00:42:28,400 --> 00:42:31,500
Generally in this case meaning 
making money, that's the most 

823
00:42:31,500 --> 00:42:36,300
General product management 
product owner definition of what

824
00:42:36,300 --> 00:42:40,000
a user journey is, right? 
It is the expected way that a 

825
00:42:40,000 --> 00:42:45,000
feature will work and a feature,
very rarely connects directly to

826
00:42:45,000 --> 00:42:48,600
a single service in terms of a 
micro service in terms of a 

827
00:42:48,600 --> 00:42:51,100
single team owning it. 
So user Journeys. 

828
00:42:51,100 --> 00:42:54,000
Generally, span multiple 
different components of your 

829
00:42:54,000 --> 00:42:57,700
system that generally means Is 
that there are many different 

830
00:42:57,700 --> 00:43:01,500
teams, not just one responsible 
for all the components that user

831
00:43:01,500 --> 00:43:05,200
Journey travels over, its kind 
of a high level product manager,

832
00:43:05,200 --> 00:43:09,100
explanation to the user Journey.
But I think they're just a good 

833
00:43:09,100 --> 00:43:12,900
analog for what a good SLI is a 
good service level indicator. 

834
00:43:12,900 --> 00:43:16,500
Basically is a user Journey, 
have user Journey basically is a

835
00:43:16,500 --> 00:43:20,300
kpi or key performance 
indicator, which is even one 

836
00:43:20,300 --> 00:43:22,800
more step up, right? 
This is now the business side. 

837
00:43:23,000 --> 00:43:25,900
What does the business say? 
Your business operational? 

838
00:43:26,100 --> 00:43:28,200
It's not your computer 
operational team, what do they 

839
00:43:28,200 --> 00:43:29,300
say? 
We need to be measuring. 

840
00:43:29,300 --> 00:43:31,600
What do they say? 
Is important for our revenue for

841
00:43:31,600 --> 00:43:34,000
a bottom line. 
What does the chief Revenue 

842
00:43:34,000 --> 00:43:36,800
officer or the CFO? 
What do they care about? 

843
00:43:36,900 --> 00:43:39,200
But they're all similar in the 
sense that they're all 

844
00:43:39,300 --> 00:43:43,000
measurements, likely having to 
do with your customers or users.

845
00:43:43,200 --> 00:43:45,200
And none of them are ever going 
to be 100%. 

846
00:43:45,200 --> 00:43:46,600
So you can set targets and all 
of them. 

847
00:43:46,900 --> 00:43:50,500
I kind of like to say that an 
SLI is a user journey is a kpi. 

848
00:43:50,500 --> 00:43:53,300
They're all kind of the same 
thing, just different business 

849
00:43:53,300 --> 00:43:54,800
units. 
Have slightly different ways of 

850
00:43:54,808 --> 00:43:57,900
talking about them, I think the 
most important thing also again 

851
00:43:57,900 --> 00:44:00,800
is coming from the product rice,
not randomly from some 

852
00:44:00,800 --> 00:44:03,000
Engineers, okay? 
This is our critical user 

853
00:44:03,000 --> 00:44:04,800
Journey. 
You mentioned that a user 

854
00:44:04,800 --> 00:44:07,900
Journey could typically involve 
a number of services, we could 

855
00:44:07,900 --> 00:44:10,800
be components database load 
balancers and all that. 

856
00:44:10,900 --> 00:44:14,300
It could also be multiple micro 
services that span across 

857
00:44:14,300 --> 00:44:17,100
multiple teams. 
One part of a confusion here. 

858
00:44:17,100 --> 00:44:20,000
How do you define the SL eyes? 
Is it per team? 

859
00:44:20,000 --> 00:44:22,800
Is it for whole user Journey 
itself? 

860
00:44:22,900 --> 00:44:25,500
How do you advise people to 
think about this? 

861
00:44:26,300 --> 00:44:31,100
I think you need both. 
I think individual teams need SL

862
00:44:31,100 --> 00:44:35,700
is set on their own Services. 
They need SOS settle me room 

863
00:44:35,700 --> 00:44:38,600
Services, they need to 
understand how their services 

864
00:44:38,600 --> 00:44:41,700
are operating for the things 
that depend on them again. 

865
00:44:41,700 --> 00:44:45,500
Even if those things are only 
other services as well, but then

866
00:44:45,500 --> 00:44:49,500
perhaps the director of your 
organization needs to be setting

867
00:44:49,500 --> 00:44:51,700
SL eyes. 
It's not say, the director has 

868
00:44:51,700 --> 00:44:55,400
to be necessarily implementing 
them, but needs to own SLI is an

869
00:44:55,400 --> 00:44:58,100
SLO. 
For the kind of user Journey, 

870
00:44:58,100 --> 00:45:01,700
stories that go across many 
different services and perhaps, 

871
00:45:01,700 --> 00:45:05,900
the VP of engineering needs to 
be oming, the concept of like 

872
00:45:05,900 --> 00:45:09,200
King you check out. 
So, the checkout microservice 

873
00:45:09,200 --> 00:45:13,000
team has their own SLO that 
measures, how often their micro 

874
00:45:13,000 --> 00:45:14,700
service is not throwing an 
error. 

875
00:45:15,000 --> 00:45:19,300
The Commerce Department, the 
director of that department owns

876
00:45:19,300 --> 00:45:23,700
an SLO that says does the 
checkout workflow work, the VP 

877
00:45:23,700 --> 00:45:27,100
of engineering owns the slow 
that Says something like or 

878
00:45:27,100 --> 00:45:29,900
users able to use your website 
and send us money. 

879
00:45:30,100 --> 00:45:32,600
Something along those lines. 
But, yeah, I think these things 

880
00:45:32,600 --> 00:45:36,100
build on top of each other. 
I think every step of that way 

881
00:45:36,100 --> 00:45:39,600
needs its own a slow and you can
use those individual as close to

882
00:45:39,600 --> 00:45:43,300
inform an overarching. 
A slow the SLI for an es. 

883
00:45:43,300 --> 00:45:47,400
Lo could be the S low status of 
different slow you know like as 

884
00:45:47,400 --> 00:45:49,600
a big has no super hard fast 
rules here. 

885
00:45:49,900 --> 00:45:51,700
It's just are you measuring 
things? 

886
00:45:51,700 --> 00:45:54,700
Are you taking your users into 
Cal and are you trying to make 

887
00:45:54,700 --> 00:45:55,700
sure you're not try to be 
honest. 

888
00:45:56,700 --> 00:45:58,300
You mentioned about different 
departments. 

889
00:45:58,300 --> 00:46:02,400
Owning slos right now let's move
to SLO part so you have defined 

890
00:46:02,400 --> 00:46:05,500
your SLI so you have maybe a 
latency error rate availability 

891
00:46:05,500 --> 00:46:07,700
and all that and then you set 
your slos. 

892
00:46:08,000 --> 00:46:10,800
First of all how many SLO should
a service have? 

893
00:46:10,800 --> 00:46:14,200
Is it in hundreds of how many is
good enough? 

894
00:46:14,300 --> 00:46:16,000
So I think there's some art 
here. 

895
00:46:16,300 --> 00:46:18,800
I think it's mentioned in the 
SRA book as well in your book as

896
00:46:18,800 --> 00:46:20,100
well. 
What's the art? 

897
00:46:20,100 --> 00:46:24,300
How to define the number of slos
for your service to be so boring

898
00:46:24,300 --> 00:46:26,400
through the listeners because 
it's just, it depends. 

899
00:46:26,600 --> 00:46:28,600
Again right, it really is 
though. 

900
00:46:28,600 --> 00:46:30,100
It's solely depend on your 
service. 

901
00:46:30,400 --> 00:46:34,300
Make the right decisions, make 
sure it is enough that you are 

902
00:46:34,300 --> 00:46:36,600
covering all important aspects 
of your service. 

903
00:46:36,900 --> 00:46:40,000
So don't choose to few. 
Because if it's too few, you 

904
00:46:40,000 --> 00:46:41,100
can't understand what's going 
on. 

905
00:46:41,400 --> 00:46:44,600
And also make sure it's not too 
many because if you have too 

906
00:46:44,600 --> 00:46:47,300
many run in the multiple 
comparison problem, or you have 

907
00:46:47,300 --> 00:46:49,800
too much data and you don't know
what to look at and you don't 

908
00:46:49,800 --> 00:46:53,100
know what's telling you what and
you no longer have a good idea 

909
00:46:53,100 --> 00:46:54,900
of what these signals are 
anymore. 

910
00:46:55,200 --> 00:46:58,500
I think that Google has sorry. 
Book said five or six or service

911
00:46:58,600 --> 00:47:00,900
that seems reasonable to me it 
really does. 

912
00:47:01,000 --> 00:47:02,800
Yeah. 
Sure why not. 

913
00:47:03,000 --> 00:47:06,000
But I also wouldn't again. 
Like so many things. 

914
00:47:06,000 --> 00:47:07,800
I don't like hard and fast 
rules. 

915
00:47:08,100 --> 00:47:11,000
I like approaches. 
I like philosophies. 

916
00:47:11,400 --> 00:47:15,100
So don't feel bad if you don't 
have five, don't feel bad. 

917
00:47:15,100 --> 00:47:17,500
If you have many more than five,
just make sure it's the right 

918
00:47:17,500 --> 00:47:20,400
amount for you. 
There's also another thing that 

919
00:47:20,400 --> 00:47:23,700
is covered in the book, so when 
you set SLO Target, make sure 

920
00:47:23,700 --> 00:47:26,300
you are not setting it too much 
beyond what you see. 

921
00:47:26,500 --> 00:47:30,100
Expect because it could cause a 
little bit of complications. 

922
00:47:30,400 --> 00:47:33,400
Maybe you can explain, first of 
all, how do we know that we are 

923
00:47:33,400 --> 00:47:36,200
exceeding users expectations? 
Maybe we ask the users, like, 

924
00:47:36,200 --> 00:47:39,400
what you mentioned and secondly,
then what should we do? 

925
00:47:39,500 --> 00:47:42,400
Do we lower the definition, the 
target itself? 

926
00:47:42,600 --> 00:47:45,100
Maybe if you can explain the 
bit, the problem of being too 

927
00:47:45,100 --> 00:47:47,800
reliable here. 
Yeah, so the main problem you 

928
00:47:47,800 --> 00:47:50,900
run into by being to reliables, 
that people will end up 

929
00:47:50,900 --> 00:47:53,300
expecting you to continue to run
that way. 

930
00:47:53,600 --> 00:47:57,200
So if your users were 
previously, okay, With you being

931
00:47:57,200 --> 00:48:00,700
only reliable 99% of the time, 
but then you proceed to spend 

932
00:48:00,700 --> 00:48:05,000
like a year, being 99.9%, 
reliable, their expectations. 

933
00:48:05,000 --> 00:48:07,700
May have now changed and now, 
they're going to hold you to 

934
00:48:07,700 --> 00:48:10,800
that 99.9%. 
Even though they were totally 

935
00:48:10,800 --> 00:48:14,500
happy with only 99% before, it's
a lot of nights, but I think 

936
00:48:14,500 --> 00:48:16,400
everyone hopefully was able to 
follow what I was saying there. 

937
00:48:16,700 --> 00:48:18,700
So you paint yourself into a 
corner. 

938
00:48:19,100 --> 00:48:23,000
If you are two reliable too 
often because user expectations 

939
00:48:23,000 --> 00:48:26,400
will change, what you want to do
is you want to make sure that 

940
00:48:26,500 --> 00:48:30,600
You aren't being so reliable, 
you may accidentally do this, 

941
00:48:30,900 --> 00:48:33,500
you may just accidentally have a
few months where everything is 

942
00:48:33,500 --> 00:48:38,200
like super lucky, or maybe your 
whole team on a vacation. 

943
00:48:38,200 --> 00:48:42,400
He's didn't touch anything. 
So nothing broke for a while. 

944
00:48:42,500 --> 00:48:45,400
We can couple the bunch of 
different examples, but you 

945
00:48:45,400 --> 00:48:47,700
know, then you run into the 
problem of people are now going 

946
00:48:47,700 --> 00:48:51,400
to expect this moving forward. 
My favorite example, this is the

947
00:48:51,400 --> 00:48:54,200
chubby team at Google. 
This is written about in the 

948
00:48:54,200 --> 00:48:56,300
first Google history book. 
You can read the whole story. 

949
00:48:56,400 --> 00:49:00,300
See there, the Quake versions, 
chubby is a global Lock Service 

950
00:49:00,300 --> 00:49:03,800
so it holds tiny bits of data 
that are useful for various 

951
00:49:03,800 --> 00:49:07,500
highly distributed services, to 
be able to read and understand 

952
00:49:07,500 --> 00:49:09,200
at certain points in their 
operation. 

953
00:49:09,600 --> 00:49:11,900
Chubby, just generally Ram 
pretty well. 

954
00:49:12,200 --> 00:49:15,800
When I tell the story and I say 
that my old chubby SRE friends 

955
00:49:15,800 --> 00:49:18,300
often, give me a dirty look, 
because apparently they're 

956
00:49:18,300 --> 00:49:22,500
large, not always that easy, but
from a user perspective, Global 

957
00:49:22,500 --> 00:49:25,400
chubby, which was the global 
version of this that globally 

958
00:49:25,400 --> 00:49:28,300
available. 
Rain very well, I believe they 

959
00:49:28,300 --> 00:49:32,900
had 49 so a 99.99% Target every 
quarter. 

960
00:49:32,900 --> 00:49:36,000
Generally speaking chubby still 
had that because it ran pretty 

961
00:49:36,000 --> 00:49:37,400
well. 
So what they would do with the 

962
00:49:37,400 --> 00:49:39,600
end of every quarter is they 
would just shut chubby off. 

963
00:49:40,100 --> 00:49:42,900
They would just burn whatever 
budget they had remaining. 

964
00:49:43,200 --> 00:49:46,000
This would be communicated teams
emails would be sent. 

965
00:49:46,000 --> 00:49:48,600
Alice this Thursday afternoon at
3:00 p.m., we're going to be 

966
00:49:48,600 --> 00:49:51,500
shutting chubby off for exactly 
2 minutes and 17 seconds because

967
00:49:51,500 --> 00:49:53,000
that's how much error budget we 
have left. 

968
00:49:53,100 --> 00:49:56,300
Even though people were told 
about this other services would 

969
00:49:56,400 --> 00:49:59,500
always crash because someone 
would have a dependency on 

970
00:49:59,500 --> 00:50:02,600
chubby that they didn't know 
about chubby being good 

971
00:50:02,600 --> 00:50:04,900
citizens. 
Being ran by an excellent SRE. 

972
00:50:04,900 --> 00:50:08,600
Team would say, we're going to 
make sure you find out because 

973
00:50:08,600 --> 00:50:10,500
we're only promising you for 
knots. 

974
00:50:10,800 --> 00:50:13,400
If you're expecting anything 
more than four nines, you're 

975
00:50:13,400 --> 00:50:15,300
going to have trouble. 
So we're going to ensure you 

976
00:50:15,300 --> 00:50:17,400
find out if you're gonna have 
trouble because we're giving you

977
00:50:17,400 --> 00:50:21,200
exactly four nines per quarter, 
that's kind of a humorous story,

978
00:50:21,200 --> 00:50:23,900
and it's kind of out there and 
there's not a lot of teams that 

979
00:50:23,900 --> 00:50:25,500
will ever get to the point. 
There's not a lot of 

980
00:50:25,500 --> 00:50:28,500
organizations, We'll get to that
point, but it's a really good 

981
00:50:28,500 --> 00:50:32,100
example of ensuring that your 
users aren't getting too. 

982
00:50:32,100 --> 00:50:35,500
Used to you being too reliable, 
because once you paint yourself 

983
00:50:35,500 --> 00:50:39,700
into that corner, now you might 
be stuck with being held to a 

984
00:50:39,707 --> 00:50:42,300
level of reliability. 
That's otherwise, too expensive 

985
00:50:42,300 --> 00:50:45,700
and too difficult, another 
probably assumption that people 

986
00:50:45,700 --> 00:50:48,800
think of when the service being 
too, reliable is like Google 

987
00:50:48,800 --> 00:50:50,900
search. 
When an internet is down from 

988
00:50:50,900 --> 00:50:53,400
the first thing that they will 
test is Google down as well. 

989
00:50:53,600 --> 00:50:56,900
So I think when Google is Dimas 
reliable that much And so, I 

990
00:50:56,900 --> 00:51:00,000
think people expect that Google 
is like a benchmark for 

991
00:51:00,000 --> 00:51:02,200
internet, even the reliability 
of Internet. 

992
00:51:02,500 --> 00:51:04,000
So you mentioned about error 
budget. 

993
00:51:04,200 --> 00:51:06,100
This chubby story is really 
interesting. 

994
00:51:06,300 --> 00:51:09,300
How they use error budget to 
actually make sure that the 

995
00:51:09,300 --> 00:51:12,600
service doesn't perform to 
Reliable in the error budget, 

996
00:51:12,600 --> 00:51:14,800
sections of the book. 
One thing I find really 

997
00:51:14,800 --> 00:51:17,900
interesting, you mentioned an 
error budget is actually not a 

998
00:51:17,900 --> 00:51:21,100
technical term, right? 
It's a Communications framework 

999
00:51:21,200 --> 00:51:23,400
and this is probably between 
engineers and business on. 

1000
00:51:23,400 --> 00:51:26,000
It may be other departments. 
So tell us more about this 

1001
00:51:26,000 --> 00:51:28,000
community. 
Should aspect of error budget? 

1002
00:51:28,800 --> 00:51:30,600
Yeah. 
So really what error budgets let

1003
00:51:30,600 --> 00:51:33,400
you do? 
Is they let you tell others, 

1004
00:51:33,500 --> 00:51:35,200
here's how reliable we have 
been. 

1005
00:51:35,300 --> 00:51:36,800
So your s low Target tells 
others. 

1006
00:51:36,800 --> 00:51:39,900
He's how reliable we want to be 
and your error budget. 

1007
00:51:39,900 --> 00:51:41,900
Let you tell other people. 
Here's how reliable we've 

1008
00:51:41,900 --> 00:51:44,000
actually been. 
And the reason that's such a 

1009
00:51:44,000 --> 00:51:46,800
good communication tool is, 
because it helps other people 

1010
00:51:46,800 --> 00:51:48,700
figure out what their own SLO 
should be. 

1011
00:51:48,900 --> 00:51:51,900
And it also helps other people 
just basically understand what 

1012
00:51:51,900 --> 00:51:55,900
oldest actually boils down to 
because we've said the number 9 

1013
00:51:55,900 --> 00:51:58,900
and On us like probably a 
hundred times already but what 

1014
00:51:58,900 --> 00:52:03,300
does 99.9% even mean? 
Well when you translate it to 

1015
00:52:03,300 --> 00:52:06,700
meaning 40 minutes per month, 
well that's something that 

1016
00:52:06,700 --> 00:52:09,500
humans understand. 
So when you're able to go to 

1017
00:52:09,500 --> 00:52:13,200
someone and say, all right, we 
were unreliable for 17 minutes 

1018
00:52:13,200 --> 00:52:17,100
last quarter but that means we 
still had a budget left of 24 

1019
00:52:17,100 --> 00:52:19,700
minutes. 
Although unreliability periods 

1020
00:52:19,700 --> 00:52:21,800
are not always all in a row, 
they're not always about 

1021
00:52:21,800 --> 00:52:23,600
downtime than always, but 
outages. 

1022
00:52:24,100 --> 00:52:28,400
But the point is it gives you a 
more In friendly way of 

1023
00:52:28,400 --> 00:52:30,500
communicating. 
These kind of historical 

1024
00:52:30,500 --> 00:52:33,200
reports. 
What did q1 look like? 

1025
00:52:33,500 --> 00:52:38,100
What did 2021 look like from our
user perspective, from this 

1026
00:52:38,100 --> 00:52:42,200
hypothetical All-Seeing user who
never stopped watching us? 

1027
00:52:42,500 --> 00:52:45,900
What did it look like in time, 
for example, it's just an 

1028
00:52:45,900 --> 00:52:49,500
eventual output from an SLO 
based approach that helps you, 

1029
00:52:49,508 --> 00:52:52,900
then go have conversations with 
people whether it's via the 

1030
00:52:52,900 --> 00:52:56,100
time-based error, budget, 
definitions, which agains make 

1031
00:52:56,100 --> 00:52:57,900
it. 
Easier for some human heads to 

1032
00:52:57,900 --> 00:53:00,500
wrap themselves around it or 
just by being able to say like, 

1033
00:53:00,500 --> 00:53:04,300
look, we exceeded our error 
budget every quarter last year. 

1034
00:53:04,600 --> 00:53:07,600
We believe we're aiming for the 
right target, but we can't. 

1035
00:53:07,900 --> 00:53:12,300
So we need more resources or we 
need more head count or it could

1036
00:53:12,300 --> 00:53:14,300
be the opposite. 
It could be like, hey we've been

1037
00:53:14,300 --> 00:53:16,900
exceeded. 
Our budget, a ton maybe we move 

1038
00:53:16,900 --> 00:53:19,200
some of the staff over to this 
other project that's having some

1039
00:53:19,200 --> 00:53:22,100
problems or maybe we should be 
moving quicker. 

1040
00:53:22,400 --> 00:53:24,200
Maybe we should be shipping 
features more often. 

1041
00:53:24,300 --> 00:53:26,200
Because you know what? 
We're really awesome. 

1042
00:53:26,600 --> 00:53:29,900
We're almost 100% all the time. 
Let's spend that budget. 

1043
00:53:29,900 --> 00:53:32,000
A chip more features. 
Let's experiment. 

1044
00:53:32,000 --> 00:53:34,200
Let's try things. 
Lets you chaos engineering. 

1045
00:53:34,400 --> 00:53:37,000
There's so many different things
that error budget, lady do, but 

1046
00:53:37,000 --> 00:53:39,800
they almost all involve 
communicating to other people 

1047
00:53:39,800 --> 00:53:42,000
saying. 
Hey, this data is told us this 

1048
00:53:42,000 --> 00:53:44,700
thing over time. 
What can we do with that data? 

1049
00:53:45,600 --> 00:53:48,900
I really like all this concept. 
They kind of like, built on top 

1050
00:53:48,900 --> 00:53:50,900
of each other, like you 
mentioned, reliable this Tech. 

1051
00:53:51,200 --> 00:53:53,900
So, the error budget, 
specifically, if you use it, 

1052
00:53:53,900 --> 00:53:56,200
right? 
And if the people aligned in the

1053
00:53:56,300 --> 00:53:58,900
The organization that error 
budget is an important concept 

1054
00:53:58,900 --> 00:54:01,900
for them, you can use this for 
communicating priorities as 

1055
00:54:01,900 --> 00:54:03,500
well. 
Like, you mentioned to be 

1056
00:54:03,500 --> 00:54:05,500
shipped more features. 
Should we hire more people? 

1057
00:54:05,500 --> 00:54:08,200
Should we even fix the 
reliability, or should we do 

1058
00:54:08,200 --> 00:54:09,900
experiments? 
And even like chubby, right? 

1059
00:54:09,900 --> 00:54:11,900
Should we just spend it because 
we are doing good. 

1060
00:54:12,300 --> 00:54:13,700
I think that's really 
interesting. 

1061
00:54:14,100 --> 00:54:16,700
So I think many people are 
interested in this s.res. 

1062
00:54:16,700 --> 00:54:19,700
Ellos es, el eyes and all that 
but implementing it. 

1063
00:54:19,700 --> 00:54:22,400
Like you mentioned in your 
Squarespace experience is hot. 

1064
00:54:22,700 --> 00:54:25,400
Maybe tell us a little bit more 
tips, how should people start 

1065
00:54:25,400 --> 00:54:29,000
building this as a Culture or 
even defining how to get the 

1066
00:54:29,000 --> 00:54:31,100
buy-in from the people within 
the company. 

1067
00:54:31,800 --> 00:54:35,600
The best advice I can give or at
least my favorite advice because

1068
00:54:35,600 --> 00:54:37,300
it's not always possible for 
everyone. 

1069
00:54:37,500 --> 00:54:40,900
I want to be very upfront that I
understand that not everyone is 

1070
00:54:40,900 --> 00:54:44,700
in a situation where they can do
what I'm about to say but if 

1071
00:54:44,700 --> 00:54:47,500
possible just get started just 
do it. 

1072
00:54:47,700 --> 00:54:51,000
Just pick a service and pick an 
SLO and measure it and maybe 

1073
00:54:51,000 --> 00:54:53,700
it's the totally wrong target 
and maybe your SLI is a bad one.

1074
00:54:53,700 --> 00:54:55,700
That's fine. 
Pick a new one, they're not 

1075
00:54:55,700 --> 00:54:57,800
agreements. 
Like their objectives, just get 

1076
00:54:57,800 --> 00:55:00,900
started with it and start 
Gathering the data and see what 

1077
00:55:00,900 --> 00:55:03,000
you can start doing with the 
data and maybe first. 

1078
00:55:03,000 --> 00:55:06,300
It's just you and then maybe 
it's your team, or maybe you can

1079
00:55:06,308 --> 00:55:07,700
get your whole team on board 
right away. 

1080
00:55:07,900 --> 00:55:10,500
And then, you can start showing 
the teams that you work closely 

1081
00:55:10,500 --> 00:55:11,700
with. 
Hey, look at this. 

1082
00:55:11,700 --> 00:55:14,000
Cool data were getting. 
Look at the Greek 

1083
00:55:14,000 --> 00:55:16,700
decision-making, we've been able
to make and how we've been able 

1084
00:55:16,700 --> 00:55:20,100
to more effectively plan, our 
Sprint's because of our error 

1085
00:55:20,100 --> 00:55:22,200
budget data. 
And then this other teams I call

1086
00:55:22,200 --> 00:55:24,400
that seems kind of cool. 
Maybe we should try that. 

1087
00:55:24,600 --> 00:55:28,600
I really think as lows are From 
the bottom up thing, really? 

1088
00:55:28,600 --> 00:55:31,700
Honestly, I think they're very 
often organically grown. 

1089
00:55:32,100 --> 00:55:35,700
I think that people can be sold 
on them philosophically, but 

1090
00:55:35,700 --> 00:55:38,400
they don't understand why they 
need to spend time implementing 

1091
00:55:38,400 --> 00:55:41,000
them until someone's kind of 
give them the hard data. 

1092
00:55:41,200 --> 00:55:45,500
So if possible just start, just 
go just start doing these 

1093
00:55:45,500 --> 00:55:47,300
things. 
See what it gives you. 

1094
00:55:47,600 --> 00:55:50,600
If it doesn't give you what you 
want, maybe pick different 

1095
00:55:50,600 --> 00:55:53,000
targets, maybe pick different 
measurements or maybe you're not

1096
00:55:53,000 --> 00:55:55,900
quite ready for them. 
That's also totally possible but

1097
00:55:55,900 --> 00:56:00,000
every Industry out there already
in some way understands failure 

1098
00:56:00,000 --> 00:56:02,500
happens. 
It's not just Computer Services.

1099
00:56:02,800 --> 00:56:05,700
Embracing failure is always a 
good thing. 

1100
00:56:05,700 --> 00:56:08,000
By ensuring, you're not feeling 
too often. 

1101
00:56:08,200 --> 00:56:12,000
This is always going to lead 
eventually to happier engineers 

1102
00:56:12,000 --> 00:56:15,500
and happier business and happier
users and happier customers. 

1103
00:56:15,500 --> 00:56:18,000
Therefore, so that's my favorite
advice. 

1104
00:56:18,000 --> 00:56:20,000
I don't know if it's the best 
advice because like I said, I 

1105
00:56:20,000 --> 00:56:23,600
know some people are not in a 
situation where they can just go

1106
00:56:23,600 --> 00:56:26,100
do it. 
I'm sympathetic to that but 

1107
00:56:26,100 --> 00:56:28,900
that's Generally what I tell 
people if possible just give it 

1108
00:56:28,900 --> 00:56:32,600
a shot and see what happens. 
Yeah, sometimes people get stuck

1109
00:56:32,600 --> 00:56:35,500
into the tooling. 
You know, can we get the data? 

1110
00:56:35,500 --> 00:56:38,000
I know that but I think we can 
always start simple, right? 

1111
00:56:38,500 --> 00:56:42,300
Yeah, I've seen plenty of teams 
get started with s lows by 

1112
00:56:42,300 --> 00:56:44,800
manually calculating them at the
end of each quarter. 

1113
00:56:45,300 --> 00:56:48,200
Like no joke, they didn't have 
real-time alerting on the rest 

1114
00:56:48,200 --> 00:56:50,600
Lowe's or real-time error budget
status. 

1115
00:56:50,600 --> 00:56:53,700
They even have error budgets. 
They got started by at the end 

1116
00:56:53,700 --> 00:56:56,600
of every quarter when they were 
getting ready to plan, What 

1117
00:56:56,600 --> 00:56:57,900
should our priorities for next 
quarter? 

1118
00:56:57,900 --> 00:57:01,000
Be, they would go run some 
queries from their monitoring 

1119
00:57:01,000 --> 00:57:04,400
system, do math against it. 
Put it into a spreadsheet and 

1120
00:57:04,400 --> 00:57:07,700
then calculate, okay? 
Last quarter, we were X percent 

1121
00:57:07,700 --> 00:57:09,900
reliable. 
What does that mean for our next

1122
00:57:09,900 --> 00:57:11,900
quarter? 
And that's a totally reasonable 

1123
00:57:11,900 --> 00:57:14,500
way to get started. 
It's all about just, again 

1124
00:57:14,500 --> 00:57:17,200
embracing those service truths 
that we talked about at first, 

1125
00:57:17,200 --> 00:57:20,300
right, Rye, Billy's, most 
important thing, your users 

1126
00:57:20,300 --> 00:57:21,800
Define your reliability, not 
you. 

1127
00:57:21,800 --> 00:57:25,000
So make sure you're measuring 
the right thing and 100% is out 

1128
00:57:25,000 --> 00:57:26,200
of the question. 
So pick the right? 

1129
00:57:26,200 --> 00:57:30,600
You can Embrace those truths 
without real time monitoring and

1130
00:57:30,600 --> 00:57:33,600
advanced statistics and all the 
stuff that comes along with. 

1131
00:57:33,600 --> 00:57:36,600
It just get started even if it's
in a spreadsheet you know, if 

1132
00:57:36,600 --> 00:57:39,200
it's only just once a month I 
like that you men said this even

1133
00:57:39,200 --> 00:57:41,600
though you don't have the 
tooling to start because some 

1134
00:57:41,600 --> 00:57:44,100
people think after reading the 
book again, it's philosophical, 

1135
00:57:44,100 --> 00:57:45,600
right? 
We are not Google, we don't have

1136
00:57:45,600 --> 00:57:48,700
all the tools so we are stopped.
So I think that's the key 

1137
00:57:48,700 --> 00:57:50,500
message here. 
Just stuck and I think these 

1138
00:57:50,500 --> 00:57:53,000
three services and likely 
discuss in the beginning is 

1139
00:57:53,000 --> 00:57:55,400
really important. 
Once you get it right, you will 

1140
00:57:55,400 --> 00:57:58,100
find ways to to actually measure
your user happiness. 

1141
00:57:58,400 --> 00:58:00,900
So Alex, thank you so much for 
spending your time is like the 

1142
00:58:00,900 --> 00:58:03,900
crash course of SRE and SLO 
definitely. 

1143
00:58:04,200 --> 00:58:06,100
But unfortunately, we need to 
wrap up pretty soon. 

1144
00:58:06,300 --> 00:58:08,900
But before I let you go, I 
normally ask one last question 

1145
00:58:08,900 --> 00:58:11,800
for all my guests which is to 
share your tree technical 

1146
00:58:11,800 --> 00:58:12,700
leadership. 
Wisdom. 

1147
00:58:12,900 --> 00:58:15,800
So this is maybe some kind of 
advice for you to give us. 

1148
00:58:16,000 --> 00:58:19,200
Also may be based on your career
Journey experience or maybe had 

1149
00:58:19,200 --> 00:58:23,900
lessons, sometimes sure there's 
three things I have to share, be

1150
00:58:23,900 --> 00:58:26,100
kind. 
It goes a very long way. 

1151
00:58:26,300 --> 00:58:28,500
You're dealing with other 
humans, no matter what you 

1152
00:58:28,500 --> 00:58:31,500
think, your job is no matter 
what your computer services, it 

1153
00:58:31,500 --> 00:58:34,700
exists at some level for other 
humans, whether they are your 

1154
00:58:34,700 --> 00:58:39,100
customers or your end users or 
your co-workers or whatever. 

1155
00:58:39,300 --> 00:58:42,100
But be kind just be nice to each
other. 

1156
00:58:42,200 --> 00:58:44,900
Don't be pompous. 
Try to always remember that. 

1157
00:58:44,900 --> 00:58:48,600
Every decision you make impacts 
other people initiate those 

1158
00:58:48,600 --> 00:58:52,200
decisions with kindness be 
thoughtful, I'd said that a lot 

1159
00:58:52,200 --> 00:58:55,600
tonight but it really is. 
I think maybe my most meaningful

1160
00:58:55,600 --> 00:58:59,700
mantra At this stage in my 
career is think things over. 

1161
00:59:00,000 --> 00:59:02,800
Sometimes, you need to react, 
sometimes you're in an 

1162
00:59:02,808 --> 00:59:04,900
emergency. 
Sometimes you deal with an 

1163
00:59:04,900 --> 00:59:09,000
incidence or immense business 
pressures, or your company is 

1164
00:59:09,000 --> 00:59:13,600
under immense Financial strain, 
I'd been all those places but 

1165
00:59:13,600 --> 00:59:16,400
always, you always have time to 
be thoughtful, you always a time

1166
00:59:16,400 --> 00:59:19,100
to take at least a few seconds 
be like, okay. 

1167
00:59:19,400 --> 00:59:22,200
Is this the right thing I'm 
about to say is this the right 

1168
00:59:22,200 --> 00:59:24,900
thing I'm about to do is the 
correct action? 

1169
00:59:24,900 --> 00:59:28,700
I'm about to take And then 
finally, adopt blamelessness, 

1170
00:59:29,100 --> 00:59:32,800
make sure that your organization
is building an appropriate 

1171
00:59:32,800 --> 00:59:37,600
culture, where we understand 
that humans don't make mistakes 

1172
00:59:37,600 --> 00:59:40,000
on purpose, right? 
That's not the humans all make 

1173
00:59:40,000 --> 00:59:41,400
mistakes. 
A course we do. 

1174
00:59:41,600 --> 00:59:44,700
Of course every single one of us
does every single day, but 

1175
00:59:44,700 --> 00:59:47,500
generally speaking, unless 
you're a bad actor unless you're

1176
00:59:47,500 --> 00:59:50,100
literally trying to bring down 
the company from the inside, 

1177
00:59:50,300 --> 00:59:53,300
unless that's the case people 
aren't doing it on purpose, and 

1178
00:59:53,300 --> 00:59:56,100
always remember that. 
So adopt linguist list. 

1179
00:59:56,200 --> 00:59:59,300
The combined data thoughtfulness
combine that with the kindness 

1180
00:59:59,300 --> 01:00:02,500
and be better to each other, I 
love the old wisdom because it 

1181
01:00:02,500 --> 01:00:04,500
all touch the human aspect 
nothing. 

1182
01:00:04,500 --> 01:00:06,900
He about technology or slos. 
Sorry. 

1183
01:00:07,000 --> 01:00:09,300
So thank you so much for this 
beautiful message. 

1184
01:00:09,400 --> 01:00:12,700
So Alex for people who want to 
follow you, or maybe look for 

1185
01:00:12,700 --> 01:00:15,200
your product Noble 9 where they 
can find you online. 

1186
01:00:15,800 --> 01:00:19,700
So you can find me on Twitter 
primarily a dog os3. 

1187
01:00:19,700 --> 01:00:23,800
That's a hid, a lgo SRE on 
Twitter. 

1188
01:00:24,000 --> 01:00:28,000
Also, my website, Alex Dash 
Hidalgo, Dot-com and definitely 

1189
01:00:28,000 --> 01:00:30,900
go check out Noble mind. 
I believe anyone can get started

1190
01:00:30,900 --> 01:00:33,800
with s ellos, we exist to help 
you do that. 

1191
01:00:33,800 --> 01:00:36,900
We exist to help you measure s 
ellos and calculate your are 

1192
01:00:36,900 --> 01:00:38,500
budgets. 
The best possible weight no 

1193
01:00:38,500 --> 01:00:41,200
matter where your data lives. 
So come check us out at Noble. 

1194
01:00:41,200 --> 01:00:43,700
My not calm, that's and obl 
nine.com. 

1195
01:00:45,200 --> 01:00:47,700
Thank you so much again. 
I really enjoy this composition 

1196
01:00:47,700 --> 01:00:50,800
so it was a pleasure. 
Thanks Alex, thanks Henry. 

1197
01:00:50,800 --> 01:00:52,300
I had a blast. 
Thanks so much for having me. 

1198
01:00:55,400 --> 01:00:57,700
Thank you. 
Listening to this episode and 

1199
01:00:57,700 --> 01:01:00,400
for staying, right? 
Until the end, if you're highly 

1200
01:01:00,400 --> 01:01:03,100
enjoyed it, I would appreciate 
if you share it with your 

1201
01:01:03,100 --> 01:01:06,100
friends and colleagues who you 
think would also benefit from 

1202
01:01:06,100 --> 01:01:08,700
listening to this episode. 
And if you are new to the 

1203
01:01:08,700 --> 01:01:11,700
podcast, make sure to subscribe 
and leave me your valuable 

1204
01:01:11,700 --> 01:01:14,600
review and feedback. 
It helps me a lot in order to 

1205
01:01:14,600 --> 01:01:17,900
grow this podcast better. 
You can also find the full show 

1206
01:01:17,900 --> 01:01:21,300
notes of this conversation on 
the episode page at technology 

1207
01:01:21,300 --> 01:01:25,100
node, death website, including 
the full transcript interesting 

1208
01:01:25,100 --> 01:01:28,100
quotes and links To the 
resources mention from the 

1209
01:01:28,100 --> 01:01:31,000
conversation. 
And lastly, make sure to 

1210
01:01:31,000 --> 01:01:33,300
subscribe to the show's mailing 
list on pack leader. 

1211
01:01:33,300 --> 01:01:36,900
No dot f to get notified for any
future episodes. 

1212
01:01:37,300 --> 01:01:38,800
Stay tuned for the next 
technology. 

1213
01:01:38,800 --> 01:01:41,700
No episode. 
And until then goodbye.

