1
00:00:00,280 --> 00:00:04,360
The Better Business Analysis 
Institute presence, the Better 

2
00:00:04,360 --> 00:00:07,680
Business Analysis Podcast with 
Kingsman Walsh. 

3
00:00:12,360 --> 00:00:15,000
Hi everybody and welcome back to
the bit of business Analysis 

4
00:00:15,000 --> 00:00:17,360
podcast. 
And today we talk about don't 

5
00:00:17,360 --> 00:00:20,960
release on a Friday. 
We're going to be talking about 

6
00:00:20,960 --> 00:00:25,880
the crowd strike problem that 
caused wide outages across the 

7
00:00:25,880 --> 00:00:30,880
world to critical infrastructure
which affected Windows systems. 

8
00:00:31,000 --> 00:00:35,560
So I'm going to explain what 
happens and what was the cause 

9
00:00:35,560 --> 00:00:41,840
of that, what the resolution 
was, and some thoughts on what 

10
00:00:41,840 --> 00:00:45,520
we might be able to learn from 
this experience going forward. 

11
00:00:45,840 --> 00:00:50,840
So in effect, the recent crowd 
strike issue involved a faulty 

12
00:00:50,840 --> 00:00:58,000
update to their Falcon sensor. 
Okay, so the Falcon sensor is a 

13
00:00:58,920 --> 00:01:04,360
cloud controlled sensor which is
installed on Windows devices, 

14
00:01:04,360 --> 00:01:08,160
but it's managed to the cloud 
and this is what caused the 

15
00:01:08,160 --> 00:01:11,400
widespread disruption on Windows
systems. 

16
00:01:11,960 --> 00:01:16,880
The issue was identified as a 
defect and a single content 

17
00:01:16,880 --> 00:01:22,920
update for this Falcon sensor 
and this is Crowdstrike endpoint

18
00:01:22,920 --> 00:01:27,320
detection and response. 
It's called which the acronym is

19
00:01:27,320 --> 00:01:33,080
EDR software and it's actually 
focuses on in simple terms, 

20
00:01:33,680 --> 00:01:40,200
modern age, you know, modern age
virus attacks, OK. 

21
00:01:40,200 --> 00:01:45,280
And this piece of software is 
very, it's actually quite a good

22
00:01:45,280 --> 00:01:49,080
piece of software which focuses 
on looking for these types of 

23
00:01:49,080 --> 00:01:51,400
attacks. 
This is not like your general 

24
00:01:52,280 --> 00:01:54,680
run of the more antivirus 
software, just to be clear. 

25
00:01:54,680 --> 00:02:01,000
So it, it, it has to have quite 
important system kind of access 

26
00:02:01,000 --> 00:02:04,280
to the underlying running of 
your Windows system. 

27
00:02:04,560 --> 00:02:09,440
So the software itself looks in 
there for viruses effectively 

28
00:02:09,440 --> 00:02:12,600
and notifies you. 
But in order to do that, it 

29
00:02:12,600 --> 00:02:16,280
needs access deep within the 
Windows system in order to do 

30
00:02:16,280 --> 00:02:19,400
its job properly. 
Now, what that opens up is that 

31
00:02:19,400 --> 00:02:24,200
opens up to a massive risk here 
that if the software itself is 

32
00:02:24,200 --> 00:02:26,680
faulty, then you're going to 
have a problem. 

33
00:02:27,400 --> 00:02:32,200
So what actually happens or what
what the impact of that faulty 

34
00:02:32,440 --> 00:02:37,440
defect in that update for that 
piece of software, What it does 

35
00:02:37,480 --> 00:02:41,400
it did is it caused Windows 
computers to crash, displaying 

36
00:02:41,400 --> 00:02:45,280
what we call the blue screen of 
death, and it prevented them 

37
00:02:45,280 --> 00:02:49,280
from just rebooting properly. 
OK, So when your computer fails,

38
00:02:49,280 --> 00:02:51,520
this used to happen. 
We used to talk about it often 

39
00:02:51,520 --> 00:02:53,920
with Windows. 
If you end up with this blue 

40
00:02:53,920 --> 00:02:56,720
screen, it means basically 
Windows is saying something's 

41
00:02:56,720 --> 00:02:59,600
wrong, I'm broken, I can't 
continue, right? 

42
00:02:59,960 --> 00:03:04,160
And yes, you've got your own 
laptop or your own computer, but

43
00:03:04,160 --> 00:03:07,880
this is actually, remember that 
a cloud computer is exactly the 

44
00:03:07,880 --> 00:03:10,080
same, right? 
So these could be both a 

45
00:03:10,080 --> 00:03:12,080
physical computer or a cloud 
computer. 

46
00:03:12,240 --> 00:03:16,760
And a lot of the software and 
applications and infrastructure 

47
00:03:16,760 --> 00:03:19,760
that the world's run on is 
running on Windows servers. 

48
00:03:20,120 --> 00:03:22,760
So this affected them as well. 
OK. 

49
00:03:23,480 --> 00:03:26,120
Or mainly them because it was 
rolled out automatically via the

50
00:03:26,120 --> 00:03:29,320
cloud. 
And what that meant was that 

51
00:03:29,320 --> 00:03:31,960
this blue screen, so they've 
failed if you like, the computer

52
00:03:31,960 --> 00:03:36,560
just shed itself, if you like, 
into this blue screen of death. 

53
00:03:36,960 --> 00:03:41,760
And then when it tried to 
reboot, it actually ended up in 

54
00:03:41,760 --> 00:03:44,760
a loop. 
It ended up in a loop of boot 

55
00:03:44,760 --> 00:03:46,080
booting. 
It was just stuck there. 

56
00:03:46,360 --> 00:03:50,000
And of course with a blue screen
of death, never got to Windows 

57
00:03:50,440 --> 00:03:52,560
again. 
So you could use your mouse and 

58
00:03:52,560 --> 00:03:56,160
keyboard and it just rebooted 
itself again and again without 

59
00:03:56,200 --> 00:04:01,520
really understanding low level 
IT to be able to stop going to 

60
00:04:01,520 --> 00:04:04,720
safe mode, which doesn't load 
all the windows up. 

61
00:04:04,720 --> 00:04:06,320
It just loads critical factors 
up. 

62
00:04:06,600 --> 00:04:09,800
You know, go into the BIOS and 
knowing how to do that, which 

63
00:04:10,000 --> 00:04:12,040
people don't have to do that 
anymore, right. 

64
00:04:12,040 --> 00:04:14,920
You know, only if you're an IT 
professional would you need to 

65
00:04:14,920 --> 00:04:17,959
do that. 
So you to fix it straight away, 

66
00:04:18,120 --> 00:04:20,959
you actually had to have some 
knowledge of it to fix this. 

67
00:04:20,959 --> 00:04:25,600
This is why it just took so long
for there to be a resolution. 

68
00:04:25,600 --> 00:04:28,960
So straight away Crowdsource, 
you know, told people how to fix

69
00:04:28,960 --> 00:04:31,040
it. 
But I looked at the instructions

70
00:04:31,040 --> 00:04:33,840
at the time and those 
instructions as well, even 

71
00:04:33,840 --> 00:04:37,560
though they're easy to read, 
they would not be simple for any

72
00:04:38,280 --> 00:04:41,440
standard person to be able to do
without IT knowledge and even 

73
00:04:41,440 --> 00:04:43,840
first line support. 
Some of the IT people in your 

74
00:04:43,840 --> 00:04:45,120
company wouldn't know what to 
do. 

75
00:04:45,240 --> 00:04:48,760
It would involve them going to 
someone like who had some IT 

76
00:04:48,760 --> 00:04:51,000
administration knowledge. 
And remember, of course, 

77
00:04:51,000 --> 00:04:52,960
there's, you know, only a 
limited amount of people do who 

78
00:04:52,960 --> 00:04:56,240
do that around around the world.
And sometimes that's outsourced 

79
00:04:56,240 --> 00:04:59,840
to smaller companies, sorry to 
bigger companies who do that. 

80
00:05:00,520 --> 00:05:02,200
So you, you know, you have no 
control over your 

81
00:05:02,200 --> 00:05:04,280
infrastructure. 
You had to wait effectively. 

82
00:05:04,560 --> 00:05:08,840
Now in the meantime, they rolled
out some some fixes to fix this 

83
00:05:08,840 --> 00:05:10,480
problem, identified what the 
problem was. 

84
00:05:11,160 --> 00:05:14,480
Both Windows and Microsoft 
identified what the problem was.

85
00:05:14,680 --> 00:05:17,760
And so they're working on code 
to make sure that this update 

86
00:05:17,760 --> 00:05:20,720
wouldn't affect their codes. 
So writing some kind of catching

87
00:05:20,800 --> 00:05:25,200
statements in there to stop 
Windows from being so affected 

88
00:05:25,200 --> 00:05:28,680
by it and also crowdsourced in 
luck to roll out an update. 

89
00:05:28,920 --> 00:05:33,280
And they had to get all that 
stuff updated, fixed, tested and

90
00:05:33,280 --> 00:05:36,280
then rolled out, of course, to 
all these servers across the 

91
00:05:36,280 --> 00:05:38,400
world. 
And then once so servers were up

92
00:05:38,400 --> 00:05:40,680
and running, then people were 
checking things like, they're 

93
00:05:40,680 --> 00:05:43,040
like, were we hacked? 
Banks were worried about, you 

94
00:05:43,040 --> 00:05:45,360
know, this was this used as an 
opportunity? 

95
00:05:45,480 --> 00:05:48,160
Would have been a perfect time 
to kind of attack some of these 

96
00:05:48,160 --> 00:05:51,160
organizations, which is ironic 
because the software is there to

97
00:05:51,160 --> 00:05:54,400
protect them from that stuff. 
If you're technical. 

98
00:05:55,120 --> 00:05:58,000
The defect was traced 
specifically to a file within 

99
00:05:58,000 --> 00:05:59,920
the update which caused the 
crash. 

100
00:06:00,160 --> 00:06:07,160
And the file was identified as a
file called C-0000 two 9/1 

101
00:06:07,680 --> 00:06:10,640
Astros dot sys. 
So there's a there was a 

102
00:06:10,640 --> 00:06:12,120
specific file that was 
corrupted. 

103
00:06:12,120 --> 00:06:16,520
OK, so it wasn't a huge, you 
know, it wasn't anything major. 

104
00:06:16,520 --> 00:06:21,080
And this is what happens, right?
And and and this can a small 

105
00:06:21,080 --> 00:06:23,680
change or a small defecting code
can affect things. 

106
00:06:24,000 --> 00:06:26,920
And the resolution was that the 
engineering team isolated the 

107
00:06:26,920 --> 00:06:28,440
issue. 
They put it fixed. 

108
00:06:28,440 --> 00:06:29,960
So another file to replace that 
file. 

109
00:06:29,960 --> 00:06:32,800
That's all really and immediate 
relief. 

110
00:06:32,800 --> 00:06:35,400
They'd how to like what I said, 
go into safe mode. 

111
00:06:35,400 --> 00:06:40,360
Windows recovery follows, 
navigate to drivers, delete the 

112
00:06:40,360 --> 00:06:44,120
problem to drivers and drivers. 
To be clear, drivers are 

113
00:06:44,320 --> 00:06:46,960
software that interact with 
hardware on your computer. 

114
00:06:47,120 --> 00:06:49,200
If you've ever heard the word 
drivers, you need to update your

115
00:06:49,200 --> 00:06:51,280
drivers. 
And of course that's where the 

116
00:06:51,280 --> 00:06:53,880
problem was. 
So it's quite low level in terms

117
00:06:53,880 --> 00:06:56,640
of code, the code, especially 
these days with low code. 

118
00:06:57,320 --> 00:07:00,440
And so that's why it had such an
impact to the Windows machines. 

119
00:07:01,200 --> 00:07:04,560
Crowd Strike confirmed it wasn't
a security incident or a cyber 

120
00:07:04,560 --> 00:07:07,720
attack, which people thought it 
was and they were working 

121
00:07:07,720 --> 00:07:12,320
closely with affected customers.
And the CEO of Crowds Crowd 

122
00:07:12,320 --> 00:07:16,480
Strike has talked about it. 
He's one of the highest paid CE 

123
00:07:16,560 --> 00:07:18,720
OS in the world. 
I don't think he's going to be 

124
00:07:18,720 --> 00:07:20,840
around much longer, to be 
honest. 

125
00:07:21,000 --> 00:07:25,880
And what I get onto just really 
quickly seeing this is just a 

126
00:07:26,080 --> 00:07:29,680
update is the fact that what do 
we, what can we learn? 

127
00:07:29,680 --> 00:07:34,040
What can we learn about this 
right now so that we don't ever 

128
00:07:34,080 --> 00:07:38,400
do this again? 
Well, we talk about something in

129
00:07:38,400 --> 00:07:41,000
software, in the world of 
software, if you work in that 

130
00:07:41,000 --> 00:07:44,760
space and we, we call it don't 
release on a Friday, never 

131
00:07:44,760 --> 00:07:48,160
release on a Friday, or there's 
different variations. 

132
00:07:48,480 --> 00:07:53,800
And what that means is there's 
no one limited support in the 

133
00:07:53,800 --> 00:07:56,040
weekend to fix things. 
If you roll something out on a 

134
00:07:56,040 --> 00:07:58,800
Friday and something goes wrong,
there's just not many people 

135
00:07:58,800 --> 00:08:00,600
around. 
They've gone out of the office, 

136
00:08:00,600 --> 00:08:03,760
they're not there to fix. 
There's no time to resolve 

137
00:08:03,760 --> 00:08:05,320
something if something critical 
happens. 

138
00:08:05,600 --> 00:08:08,320
And what would really that's 
what's happened in this 

139
00:08:08,320 --> 00:08:11,120
situation. 
And we'll find out more as I'm 

140
00:08:11,120 --> 00:08:16,160
going to make a production is 
that this is around really bad 

141
00:08:16,280 --> 00:08:17,960
software development and 
release. 

142
00:08:18,040 --> 00:08:21,440
OK, So the fact that there's a 
couple of things here, one, that

143
00:08:21,440 --> 00:08:25,440
crowds Strike could roll this 
out and hadn't done appropriate 

144
00:08:25,440 --> 00:08:26,840
testing. 
At the end of the day, they 

145
00:08:26,840 --> 00:08:32,120
hadn't done appropriate testing 
on enough different devices or 

146
00:08:32,159 --> 00:08:35,159
enough devices in the same 
conditions in which were running

147
00:08:35,159 --> 00:08:39,120
critical Windows Server updates.
They just hadn't done that 

148
00:08:39,120 --> 00:08:40,840
because if they had, they would 
have called the buck. 

149
00:08:41,080 --> 00:08:43,000
So that means they hadn't done 
appropriate testing. 

150
00:08:43,679 --> 00:08:48,240
And equally, it was the fact 
that the software in which and 

151
00:08:48,240 --> 00:08:51,720
the hardware in which 
Crowdstrike is allowed to 

152
00:08:51,720 --> 00:08:56,160
operate on Windows servers seems
a bit extreme that a third 

153
00:08:56,160 --> 00:08:59,640
party, that a third party which 
is Crowdstrike, nothing to do 

154
00:08:59,640 --> 00:09:04,480
with Microsoft, just to be 
clear, their software, their 

155
00:09:04,480 --> 00:09:08,640
mistake was able to cause such 
an impact to Windows service, 

156
00:09:08,640 --> 00:09:11,000
which is supposed to also be 
safe, right? 

157
00:09:11,200 --> 00:09:14,000
And that can that can open up a 
whole other kind of worms that 

158
00:09:14,000 --> 00:09:16,120
this is just one piece of 
software running on Windows 

159
00:09:16,120 --> 00:09:18,000
servers, but there's actually 
thousands of them. 

160
00:09:18,000 --> 00:09:24,040
So Windows Microsoft need to 
really look at what power are 

161
00:09:24,040 --> 00:09:27,600
they allowing or divesting to 
these third parties and their 

162
00:09:27,600 --> 00:09:30,840
software needs to be better. 
So if they do identify software 

163
00:09:30,840 --> 00:09:36,040
that's bad, that could impact 
can catastrophic impact to their

164
00:09:36,280 --> 00:09:41,360
product to Windows Server, then 
they need to write code which is

165
00:09:41,360 --> 00:09:44,520
looking for this and doesn't 
allow an update in this 

166
00:09:44,520 --> 00:09:47,280
situation. 
Or it doesn't at least if there 

167
00:09:47,280 --> 00:09:50,800
is an update and it doesn't work
properly, it doesn't cause the 

168
00:09:50,800 --> 00:09:53,800
effect that we've just seen. 
So this is a hard look at 

169
00:09:53,800 --> 00:09:58,120
software engineering one O 1. 
It's, it's a good opportunity 

170
00:09:58,120 --> 00:10:02,120
for us to just look at our own 
code and our own practices and 

171
00:10:02,120 --> 00:10:04,920
software deployment. 
And it's really ironic because 

172
00:10:04,920 --> 00:10:09,160
we talked about that Y2K last 
week with business rules and now

173
00:10:09,320 --> 00:10:12,160
business rules are so important.
Well, this is exactly, this is a

174
00:10:12,160 --> 00:10:14,800
business rule, right? 
This is a business rule in terms

175
00:10:14,800 --> 00:10:18,760
of not allowing third parties to
release code that affects our 

176
00:10:18,760 --> 00:10:20,760
software. 
I mean that should be a number 

177
00:10:20,760 --> 00:10:22,600
one business rule. 
And so therefore there should 

178
00:10:22,600 --> 00:10:26,120
have been system rules in place 
to stop that from happening. 

179
00:10:26,360 --> 00:10:28,560
And so Microsoft needs to look 
at their business rules from a 

180
00:10:28,560 --> 00:10:30,440
business model point of view 
because they are going to be 

181
00:10:30,440 --> 00:10:32,880
affected from here. 
They could lose a huge amount of

182
00:10:32,880 --> 00:10:35,880
business. 
So people go to Amazon or go to 

183
00:10:35,880 --> 00:10:40,920
Google as a result, and people 
just gonna be really worried 

184
00:10:40,920 --> 00:10:44,160
about this and they're going to 
really have to explain to some 

185
00:10:44,160 --> 00:10:47,440
of the largest clients in the 
world why this happened. 

186
00:10:47,640 --> 00:10:51,320
Crowdstrike is probably the end 
of Crowdstrike in terms of a 

187
00:10:51,320 --> 00:10:54,840
company or a trust. 
And I imagine the CEO will have 

188
00:10:54,840 --> 00:10:58,120
to step down. 
Anyway, that's my kind of 

189
00:10:58,480 --> 00:11:02,120
update, critical update, just in
round Crowd strike. 

190
00:11:02,120 --> 00:11:05,240
I hope you learned what happened
and why that affected everything

191
00:11:05,240 --> 00:11:09,040
so much. 
And it's pretty amazing in 2024 

192
00:11:09,480 --> 00:11:15,240
that, you know, our biggest IT 
outage ever was caused by 

193
00:11:15,240 --> 00:11:19,640
someone writing a bad piece of 
code in one file which affected 

194
00:11:19,640 --> 00:11:23,000
everything, as opposed to a 
cyber attack or anything like 

195
00:11:23,000 --> 00:11:24,200
that. 
OK, guys. 

196
00:11:24,200 --> 00:11:24,880
See you later.
