1
00:00:00,040 --> 00:00:03,240
Welcome to the Deep dive. 
Today we're digging into 

2
00:00:03,240 --> 00:00:06,520
something pretty interesting 
from software engineering it it 

3
00:00:06,520 --> 00:00:09,040
really flips how you might think
about your tests. 

4
00:00:09,560 --> 00:00:11,160
We're talking about mutation 
testing. 

5
00:00:11,560 --> 00:00:15,120
Our mission basically is to 
unpack this technique that 

6
00:00:15,120 --> 00:00:18,320
doesn't hunt for bugs in your 
code exactly, but looks at how 

7
00:00:18,320 --> 00:00:21,080
good your tests are. 
And get this, the core idea 

8
00:00:21,080 --> 00:00:23,000
isn't new. 
It actually goes way back to 

9
00:00:23,000 --> 00:00:26,040
like 1978. 
Surprising, right? 

10
00:00:26,040 --> 00:00:27,840
It really is. 
And the whole thing starts from 

11
00:00:27,840 --> 00:00:30,400
one key assumption. 
You've already got got a set of 

12
00:00:30,400 --> 00:00:33,120
automated tests running. 
Mutation testing isn't your 

13
00:00:33,120 --> 00:00:35,040
first line of defense for 
finding bugs. 

14
00:00:35,280 --> 00:00:38,480
Your regular tests do that. 
No, this is about asking, OK, 

15
00:00:38,480 --> 00:00:40,360
are these tests I've written 
actually any good? 

16
00:00:40,600 --> 00:00:43,120
Can they really catch problems, 
especially subtle, once it's, 

17
00:00:43,560 --> 00:00:45,680
you know, putting your tests 
themselves to the test? 

18
00:00:45,680 --> 00:00:46,760
OK. 
Let's unpack that. 

19
00:00:46,760 --> 00:00:49,320
How does it actually work? 
You mentioned this idea of a 

20
00:00:49,320 --> 00:00:50,400
mutant. 
Right. 

21
00:00:50,720 --> 00:00:54,160
So a mutant is just a slightly 
tweaked version of your original

22
00:00:54,160 --> 00:00:56,800
code. 
A special tool, a mutation 

23
00:00:56,800 --> 00:00:59,760
testing tool, goes through your 
code and generates these beans 

24
00:00:59,760 --> 00:01:01,360
automatically. 
It makes small changes. 

25
00:01:02,240 --> 00:01:04,959
Think of things like maybe 
removing a line of code or 

26
00:01:04,959 --> 00:01:07,640
duplicating one. 
Or it might swap operators, you 

27
00:01:07,640 --> 00:01:11,240
know, change a plus sign to a 
minus sign or a greater than to 

28
00:01:11,240 --> 00:01:13,160
a less than. 
Sometimes it flips conditions, 

29
00:01:13,160 --> 00:01:17,600
so if X becomes if X or changes 
a constant like true to false. 

30
00:01:17,880 --> 00:01:21,560
The point is to deliberately 
introduce small, plausible bugs 

31
00:01:21,800 --> 00:01:23,680
like the kind of developer might
accidentally make. 

32
00:01:24,360 --> 00:01:25,920
OK, and here's the kicker, 
right? 

33
00:01:26,400 --> 00:01:29,120
These changes, these mutations 
should break things. 

34
00:01:29,360 --> 00:01:31,800
They should introduce bugs. 
So your existing tests, the ones

35
00:01:31,800 --> 00:01:34,920
you already have, they should 
fail when run against these 

36
00:01:34,920 --> 00:01:36,320
mutated versions. 
Exactly. 

37
00:01:36,360 --> 00:01:37,640
That's the whole idea. 
That's the test. 

38
00:01:37,920 --> 00:01:40,880
If a test runs against a mutant 
and passes, well, that tells you

39
00:01:40,880 --> 00:01:43,120
something important. 
It tells you that your test 

40
00:01:43,120 --> 00:01:45,640
isn't sensitive enough to detect
that specific change. 

41
00:01:45,960 --> 00:01:48,040
It missed the bug the mutation 
introduced. 

42
00:01:48,240 --> 00:01:51,560
Right, so it's a gap in your 
test coverage, but a specific 

43
00:01:51,560 --> 00:01:54,720
kind of gap. 
Not just did I run code, but did

44
00:01:54,720 --> 00:01:57,160
I check this behavior properly. 
Precisely. 

45
00:01:57,400 --> 00:01:59,440
It goes beyond simple line 
coverage. 

46
00:01:59,760 --> 00:02:01,400
It's about the quality of the 
test. 

47
00:02:01,760 --> 00:02:03,680
Does it actually assert the 
right things? 

48
00:02:03,960 --> 00:02:07,400
Does it fail when the logic 
changes even slightly? 

49
00:02:07,520 --> 00:02:10,520
OK, so this sounds like it needs
to really get into the weeds of 

50
00:02:10,520 --> 00:02:12,560
the code then. 
It's not like black box testing 

51
00:02:12,560 --> 00:02:14,640
where you just poke the outside.
You nailed it. 

52
00:02:14,840 --> 00:02:17,080
It's definitely what we call 
white box testing. 

53
00:02:17,320 --> 00:02:20,400
It absolutely has to have access
to the source code, the internal

54
00:02:20,400 --> 00:02:23,240
workings of the system, because 
the tools need to literally 

55
00:02:23,240 --> 00:02:27,040
rewrite parts of your code to 
create those mutants, run them, 

56
00:02:27,360 --> 00:02:29,000
and then see how the tests 
react. 

57
00:02:29,160 --> 00:02:32,000
You need to see inside. 
I saw this really interesting 

58
00:02:32,000 --> 00:02:34,200
example from Google. 
They talked about a situation. 

59
00:02:34,440 --> 00:02:37,880
Imagine you have an if 
statement, maybe checking A = B 

60
00:02:38,000 --> 00:02:41,000
and the mutation tool changes it
to AAG just flips the 

61
00:02:41,000 --> 00:02:43,280
comparison. 
And the striking thing was, in 

62
00:02:43,280 --> 00:02:45,760
their case, none of their 
existing tests failed. 

63
00:02:46,000 --> 00:02:49,760
Yeah, that's a fantastic, almost
scary example of its power. 

64
00:02:50,280 --> 00:02:53,440
When that happens, it's like a 
big red flag for the developers.

65
00:02:53,840 --> 00:02:56,160
It points directly to a weakness
in the test suite. 

66
00:02:56,680 --> 00:02:59,440
Even if that specific mutation 
isn't something a developer 

67
00:02:59,440 --> 00:03:03,760
would likely do by accident, the
fact no test caught it, well, it

68
00:03:03,760 --> 00:03:05,840
means there's a blind spot for 
that logic path. 

69
00:03:05,840 --> 00:03:08,240
So if you're the developer and 
you see that, you don't just 

70
00:03:08,240 --> 00:03:10,200
ignore it. 
You'd look at that mutation, 

71
00:03:10,640 --> 00:03:14,320
figure out OK, what behavior 
change did this actually 'cause 

72
00:03:14,400 --> 00:03:18,000
even if it's subtle. 
And then you write a new test 1 

73
00:03:18,000 --> 00:03:22,080
specifically designed to catch 
that exact problem, and that new

74
00:03:22,080 --> 00:03:23,600
test would fail against the 
mutant. 

75
00:03:23,600 --> 00:03:25,680
You'd kill the mutant, as they 
say. 

76
00:03:25,760 --> 00:03:27,840
Exactly right. 
You're forced to analyze the 

77
00:03:27,840 --> 00:03:31,400
weakness and actively strengthen
your test suite in a very 

78
00:03:31,400 --> 00:03:34,120
targeted way. 
It creates this really valuable 

79
00:03:34,120 --> 00:03:36,840
feedback loop that you just 
don't get from looking at code 

80
00:03:36,840 --> 00:03:39,680
coverage numbers alone. 
OK, so when we talk about how 

81
00:03:39,680 --> 00:03:42,480
well the tests are doing at 
catching these mutants, there's 

82
00:03:42,480 --> 00:03:44,640
a specific metric, right, The 
mutation score? 

83
00:03:44,920 --> 00:03:47,360
Yes, the mutation score. 
That's the key metric here. 

84
00:03:47,800 --> 00:03:50,760
It's pretty simple actually. 
It's the number of mutants your 

85
00:03:50,760 --> 00:03:54,120
tests managed to kill divided by
the total number of mutants the 

86
00:03:54,120 --> 00:03:56,560
tool generated. 
And a killed mutant just means 

87
00:03:56,560 --> 00:03:58,280
at least one of your tests 
failed. 

88
00:03:58,280 --> 00:04:01,200
When run against that mutated 
code, the test detected the 

89
00:04:01,200 --> 00:04:03,040
change. 
And ideally you want that score 

90
00:04:03,040 --> 00:04:06,000
to be perfect 100%. 
That's the goal. 

91
00:04:06,000 --> 00:04:09,640
Yep, A 100% score would mean 
every single deliberate bug 

92
00:04:09,640 --> 00:04:12,440
introduced by the mutations was 
caught by your tests. 

93
00:04:13,000 --> 00:04:15,440
That suggests a really, really 
robust test suite. 

94
00:04:15,640 --> 00:04:19,320
It's the ideal you're shooting 
for, but you know, reaching 100%

95
00:04:19,320 --> 00:04:22,520
on a big complex system? 
That's tough, often not 

96
00:04:22,520 --> 00:04:24,720
practical. 
Still, a high score tells you 

97
00:04:24,720 --> 00:04:27,640
your tests have real depth. 
They can spot subtle problems. 

98
00:04:27,880 --> 00:04:31,240
It's quality over just quantity.
So doing this manually sounds 

99
00:04:31,240 --> 00:04:33,320
impossible. 
What kind of tools actually 

100
00:04:33,320 --> 00:04:35,240
exist for this? 
So definitely need tools for 

101
00:04:35,240 --> 00:04:37,240
Java. 
For example, a very popular one 

102
00:04:37,240 --> 00:04:40,840
is called π test or PIT test, 
and it's pretty smart. 

103
00:04:40,840 --> 00:04:43,080
It uses some clever tricks to 
make this whole process 

104
00:04:43,080 --> 00:04:45,440
feasible, because otherwise it 
can take forever. 

105
00:04:45,680 --> 00:04:48,360
Right, because generating maybe 
thousands of mutants and running

106
00:04:48,360 --> 00:04:51,080
tests for each one sounds 
incredibly slow. 

107
00:04:51,080 --> 00:04:53,840
Exactly. 
So one thing pytest does is it 

108
00:04:53,840 --> 00:04:57,480
works directly on the compiled 
Java code, the bytecode. 

109
00:04:57,680 --> 00:05:00,320
It doesn't have to recompile 
your entire application every 

110
00:05:00,320 --> 00:05:02,120
single time it creates a new 
mutant version. 

111
00:05:02,600 --> 00:05:04,400
That saves a huge amount of 
time. 

112
00:05:05,080 --> 00:05:08,880
OK, that makes sense. 
Avoids that whole build step 

113
00:05:09,000 --> 00:05:10,000
each time. 
Right. 

114
00:05:10,240 --> 00:05:12,920
And another really clever thing 
it does is test selection. 

115
00:05:13,360 --> 00:05:16,320
It figures out which of your 
tests are actually relevant to 

116
00:05:16,320 --> 00:05:18,320
the piece of code that just got 
mutated. 

117
00:05:18,880 --> 00:05:21,160
So instead of running maybe 
thousands of tests for one 

118
00:05:21,160 --> 00:05:23,640
mutant, it might only need to 
run a handful, the ones that 

119
00:05:23,640 --> 00:05:25,280
actually cover that specific 
area. 

120
00:05:25,440 --> 00:05:28,280
That cuts down the execution 
time massively, makes it 

121
00:05:28,280 --> 00:05:30,280
practical, or at least more 
practical. 

122
00:05:30,600 --> 00:05:33,160
But even with those smart 
optimizations, the scale can 

123
00:05:33,160 --> 00:05:36,120
still be daunting, can't it? 
Let's look at that J free chart 

124
00:05:36,120 --> 00:05:38,640
example you mentioned. 
That's a real world Java 

125
00:05:38,640 --> 00:05:40,520
library, right? 
For charts. 

126
00:05:40,840 --> 00:05:43,640
Yeah, exactly. 
Open source, widely used. 

127
00:05:44,000 --> 00:05:46,880
The study looked at version 
1.0.19. 

128
00:05:47,000 --> 00:05:52,280
It had about 47,000 lines of 
code, which is, you know, decent

129
00:05:52,280 --> 00:05:54,960
size but not enormous by today's
standards, and it already had 

130
00:05:54,960 --> 00:05:57,880
over 1300 tests, so it wasn't 
like it was untested. 

131
00:05:57,880 --> 00:06:00,400
OK, So what happened when they 
ran π test on it? 

132
00:06:00,400 --> 00:06:02,720
Well, this is where it gets 
really interesting π test 

133
00:06:02,720 --> 00:06:07,120
generated around 256,000 mutants
for that code base quarter. 

134
00:06:07,120 --> 00:06:08,440
Of a million mutants. 
Wow. 

135
00:06:08,440 --> 00:06:10,200
Yeah. 
And running the tests against 

136
00:06:10,200 --> 00:06:14,160
all of them, even with π test 
optimizations, took 109 minutes,

137
00:06:14,480 --> 00:06:17,280
so almost two hours. 
Nearly two hours for, as the 

138
00:06:17,280 --> 00:06:20,240
source called it, a relatively 
small system that really drives 

139
00:06:20,240 --> 00:06:24,000
home the computational cost. 
It absolutely does. 2 hours is a

140
00:06:24,000 --> 00:06:25,920
significant chunk of time, 
especially if you want to 

141
00:06:25,920 --> 00:06:27,840
integrate this into a regular 
development workflow. 

142
00:06:28,200 --> 00:06:30,520
And here's the other kicker, the
mutation score. 

143
00:06:30,720 --> 00:06:35,760
It was only 19%. 19% after all 
that with over 1000 tests 

144
00:06:35,760 --> 00:06:41,320
already there. 19% So despite 
having 1320 tests, they only 

145
00:06:41,320 --> 00:06:44,880
caught less than 1/5 of the 
potential subtle bugs introduced

146
00:06:44,880 --> 00:06:48,520
by the mutations. 
Well, it's a bit of a wake up 

147
00:06:48,520 --> 00:06:51,160
call, isn't it? 
It shows that even in a mature, 

148
00:06:51,160 --> 00:06:54,760
seemingly well tested library, 
the effectiveness of those tests

149
00:06:54,760 --> 00:06:56,160
might be much lower than you 
think. 

150
00:06:56,360 --> 00:06:58,800
Yeah, that 19% really tells the 
story. 

151
00:06:58,800 --> 00:07:01,600
Huge room for improvement in 
making those tests more 

152
00:07:01,600 --> 00:07:03,640
sensitive, more robust. 
Exactly. 

153
00:07:03,640 --> 00:07:06,440
It highlights potential blind 
spots you'd never see just by 

154
00:07:06,440 --> 00:07:08,280
looking at, say, line coverage 
numbers. 

155
00:07:08,320 --> 00:07:11,120
That 19% points to untested 
behavior behavior. 

156
00:07:11,680 --> 00:07:13,520
Now, you mentioned things can 
get a bit tricky. 

157
00:07:13,520 --> 00:07:16,000
There's this concept of 
equivalent mutants. 

158
00:07:16,000 --> 00:07:19,480
Sounds a bit philosophical. 
Yeah, I can feel that way 

159
00:07:19,480 --> 00:07:21,800
sometimes. 
Equivalent mutants are, well, 

160
00:07:21,800 --> 00:07:23,800
they're a real challenge in 
mutation testing. 

161
00:07:23,960 --> 00:07:26,800
Basically, an equivalent mutant 
is a change made by the tool 

162
00:07:26,800 --> 00:07:29,400
that, although the code is 
different, doesn't actually 

163
00:07:29,400 --> 00:07:31,800
change the program's behavior. 
Doesn't introduce a bug. 

164
00:07:31,840 --> 00:07:34,320
The code changes but the outcome
is identical. 

165
00:07:34,320 --> 00:07:36,040
How does it happen? 
Can you give an example? 

166
00:07:36,240 --> 00:07:38,840
Sure. 
The classic one is mutation in 

167
00:07:38,840 --> 00:07:40,800
dead code. 
Code that just never gets run 

168
00:07:40,800 --> 00:07:42,520
anymore. 
Maybe it's leftover from an old 

169
00:07:42,520 --> 00:07:44,520
feature. 
If the tool mutates something 

170
00:07:44,520 --> 00:07:46,920
inside that dead code, well, it 
doesn't matter, does it? 

171
00:07:47,160 --> 00:07:49,760
Because that code never 
executes, so the program's 

172
00:07:49,760 --> 00:07:52,840
behavior is unchanged. 
Or another example, imagine a 

173
00:07:52,840 --> 00:07:56,480
line of code that just logs a 
message to a file if the 

174
00:07:56,480 --> 00:07:59,600
mutation removes that line. 
OK, the log file changes, but 

175
00:07:59,600 --> 00:08:02,520
the actual function of the 
program, what the user sees 

176
00:08:02,520 --> 00:08:04,040
stays the same. 
Exactly. 

177
00:08:04,040 --> 00:08:07,760
The core functionality isn't 
affected, so no test that checks

178
00:08:07,760 --> 00:08:09,160
functionality will fail. 
Right. 

179
00:08:09,240 --> 00:08:13,040
And the problem then is these 
equivalent mutants, they can't 

180
00:08:13,040 --> 00:08:16,880
be killed by tests by 
definition, because there's no 

181
00:08:16,880 --> 00:08:19,560
bug for the test to find. 
They just sit there dragging 

182
00:08:19,560 --> 00:08:22,640
your mutation score down, making
it look like your tests are 

183
00:08:22,640 --> 00:08:25,360
worse than they might be for 
actual behavioral changes. 

184
00:08:25,520 --> 00:08:28,400
Precisely. 
You can't write a test to fail 

185
00:08:28,400 --> 00:08:31,240
against something that isn't 
broken, so the best way to 

186
00:08:31,240 --> 00:08:33,720
handle them isn't trying to 
write impossible tests. 

187
00:08:33,919 --> 00:08:35,640
It's usually about refactoring 
the code. 

188
00:08:35,840 --> 00:08:39,080
If it's dead code, just delete 
it, clean it up, get rid of the 

189
00:08:39,080 --> 00:08:41,480
situation that allowed the 
equivalent mutant in the 1st 

190
00:08:41,480 --> 00:08:43,799
place. 
Some modern tools are also 

191
00:08:43,799 --> 00:08:47,600
getting better at automatically 
identifying and ignoring common 

192
00:08:47,600 --> 00:08:49,920
patterns of equivalent mutants, 
which helps. 

193
00:08:50,040 --> 00:08:53,400
OK, so wrapping this up, what's 
the take away for someone 

194
00:08:53,400 --> 00:08:55,240
listening, thinking about their 
own projects? 

195
00:08:55,240 --> 00:08:57,440
We called it the test of the 
tests. 

196
00:08:57,600 --> 00:09:00,600
I think that's a good summary. 
It's a powerful diagnostic. 

197
00:09:00,840 --> 00:09:03,600
It's real value comes in 
situations where you absolutely 

198
00:09:03,600 --> 00:09:06,080
need the highest possible 
confidence in your tests. 

199
00:09:06,560 --> 00:09:10,720
Think safety, critical systems, 
planes, medical gear, or maybe 

200
00:09:10,720 --> 00:09:13,160
core financial systems, high 
security stuff. 

201
00:09:13,440 --> 00:09:15,840
Places where even a tiny 
regression caught late could be 

202
00:09:15,840 --> 00:09:18,680
disastrous. 
There, the high cost might be 

203
00:09:18,680 --> 00:09:20,880
worth a deeper assurance it 
provides about your test 

204
00:09:20,880 --> 00:09:23,040
quality. 
Right, but it's not necessarily 

205
00:09:23,040 --> 00:09:25,760
for everyone or every project. 
Probably not. 

206
00:09:26,120 --> 00:09:28,760
Let's be honest, in many 
projects, developers kind of 

207
00:09:28,760 --> 00:09:31,920
know their tests aren't perfect.
They often have a good sense of 

208
00:09:31,920 --> 00:09:34,800
where the weak spots are, where 
coverage is thin. 

209
00:09:35,280 --> 00:09:38,480
Yeah, you might know that the 
whole checkout process needs way

210
00:09:38,480 --> 00:09:41,120
more integration tests. 
For example, you don't need 

211
00:09:41,120 --> 00:09:42,720
mutation testing to tell you 
that. 

212
00:09:42,800 --> 00:09:45,440
Exactly. 
If you have those obvious bigger

213
00:09:45,440 --> 00:09:49,680
gaps, tackling them directly, 
writing more unit tests, adding 

214
00:09:49,680 --> 00:09:52,920
integration tests is probably a 
much better use of your time and

215
00:09:52,920 --> 00:09:55,560
resources. 
First, mutation testing is more 

216
00:09:55,560 --> 00:09:59,520
like a a fine tuning tool for 
when you've already got good 

217
00:09:59,600 --> 00:10:02,440
basic coverage and now you want 
to really harden it, make it 

218
00:10:02,440 --> 00:10:04,600
incredibly robust against subtle
errors. 

219
00:10:04,600 --> 00:10:07,800
So it's powerful, but you need 
to be strategic about when and 

220
00:10:07,800 --> 00:10:10,320
where you deploy it because of 
the cost involved. 

221
00:10:10,400 --> 00:10:12,520
That sums it up perfectly. 
Know what you're trying to 

222
00:10:12,520 --> 00:10:14,080
achieve. 
Well, thank you for joining us 

223
00:10:14,080 --> 00:10:17,440
for this deep dive into the 
fascinating world of mutation 

224
00:10:17,440 --> 00:10:19,640
testing. 
We really appreciate you tuning 

225
00:10:19,640 --> 00:10:19,840
in.