1
00:00:00,040 --> 00:00:04,680
All data scientists and all 
analysts should spend more time 

2
00:00:04,680 --> 00:00:08,200
in the business outside of the 
data sets, just in the actual 

3
00:00:08,200 --> 00:00:11,840
business to see how it works. 
They should be shadowing their 

4
00:00:11,840 --> 00:00:15,680
colleagues who are in charge of 
either entering the data or just

5
00:00:15,760 --> 00:00:18,360
doing business operations 
because then you have the 

6
00:00:18,360 --> 00:00:21,720
context and then you understand 
the columns that you're seeing 

7
00:00:21,720 --> 00:00:24,120
in the data. 
So just understanding that data 

8
00:00:24,120 --> 00:00:26,400
generating process is really 
important. 

9
00:00:31,600 --> 00:00:36,800
Hey everyone, my name is Henry 
Surya Virawan and you're 

10
00:00:36,800 --> 00:00:40,360
listening to the Technically 
Journal Podcast, the show where 

11
00:00:40,360 --> 00:00:42,600
I'll be bringing you the 
greatest technical leaders, 

12
00:00:42,880 --> 00:00:46,440
practitioners and thought 
leaders in the industry to 

13
00:00:46,440 --> 00:00:50,720
discuss about their journey, 
ideas and practices that we all 

14
00:00:50,720 --> 00:00:54,200
can learn and apply to build a 
highly performing technical team

15
00:00:54,720 --> 00:00:56,920
and to make an impact in your 
personal work. 

16
00:00:57,560 --> 00:01:04,760
So let's dive into our journal. 
Hello guys. 

17
00:01:04,760 --> 00:01:07,600
Welcome to another episode of 
technicianal podcast. 

18
00:01:07,600 --> 00:01:11,960
Today I have David Aspart here. 
He is the author of soft any 

19
00:01:11,960 --> 00:01:15,560
data analysis problem. 
So any data analysis, I'm sure 

20
00:01:15,560 --> 00:01:18,720
today you know that the topic we
are going to cover is about data

21
00:01:18,720 --> 00:01:22,520
analytics or maybe data science 
or whatever data problems that 

22
00:01:22,520 --> 00:01:25,440
you're facing right now, right. 
So I think I hope today David 

23
00:01:25,440 --> 00:01:28,200
will be able to give some 
insights how we can actually 

24
00:01:28,440 --> 00:01:32,280
learn from his experience. 
So David also have a podcast, 

25
00:01:32,440 --> 00:01:35,080
probably we will touch on a 
little bit about his podcast. 

26
00:01:35,280 --> 00:01:37,440
So welcome to the show, David. 
Hi Henry. 

27
00:01:37,440 --> 00:01:38,920
It's great to be here. 
Right. 

28
00:01:38,920 --> 00:01:41,400
David, in the beginning, I 
always love to ask my guests to 

29
00:01:41,400 --> 00:01:43,320
maybe share a little bit more 
about yourself, right. 

30
00:01:43,320 --> 00:01:45,840
So if you can mention any 
highlights or turning points 

31
00:01:45,840 --> 00:01:47,520
that you think we all can learn 
from you. 

32
00:01:48,160 --> 00:01:51,960
Sure. 
So I've changed careers a little

33
00:01:51,960 --> 00:01:55,120
bit a few times. 
I mean, it's always been in the 

34
00:01:55,120 --> 00:01:57,640
tech space. 
I started off as a software 

35
00:01:57,640 --> 00:01:59,960
developer. 
My undergraduate degree was 

36
00:01:59,960 --> 00:02:02,760
actually in video games 
programming because I thought 

37
00:02:02,760 --> 00:02:05,200
that's what I want to do. 
I like video games, so I 

38
00:02:05,200 --> 00:02:06,960
thought, well, obviously I'd 
love to make them. 

39
00:02:07,360 --> 00:02:08,919
And it turns out it's really 
difficult. 

40
00:02:08,919 --> 00:02:11,400
You have to program some really 
difficult things around, like 

41
00:02:11,400 --> 00:02:14,920
graphics and there's actually a 
lot more maths involved. 

42
00:02:14,920 --> 00:02:17,520
And so you know, that was, it 
was a very interesting degree to

43
00:02:17,520 --> 00:02:18,920
do. 
And the thing I definitely 

44
00:02:18,920 --> 00:02:20,920
learned from it is that I really
like coding. 

45
00:02:21,120 --> 00:02:24,040
That's something I I wasn't 
really exposed to before that 

46
00:02:24,040 --> 00:02:25,600
degree. 
And so I became a software 

47
00:02:25,600 --> 00:02:27,960
developer and I did that for a 
few years. 

48
00:02:27,960 --> 00:02:31,360
I was really enjoying it writing
sort of enterprise software as 

49
00:02:31,360 --> 00:02:33,960
it sort of happens if you're in 
a a small team. 

50
00:02:34,200 --> 00:02:37,240
I was also in charge of the 
reporting. 

51
00:02:37,880 --> 00:02:41,320
Eventually that became one of my
roles as well is that the sort 

52
00:02:41,320 --> 00:02:43,840
of we didn't have a data team as
such in the company. 

53
00:02:43,840 --> 00:02:47,560
And so I took on a lot of that 
responsibility and ended up 

54
00:02:47,800 --> 00:02:52,480
delivering answers to internal 
customers that they had by 

55
00:02:52,480 --> 00:02:56,880
pulling data from our database. 
And over time, I sort of started

56
00:02:56,880 --> 00:03:00,960
to prefer doing that part of the
job because I found that I was 

57
00:03:00,960 --> 00:03:05,680
closer to the value generating 
aspect of the business. 

58
00:03:05,680 --> 00:03:08,720
I was closer to real business 
problems where software 

59
00:03:08,720 --> 00:03:11,680
development in a lot of cases is
a little bit devolved somehow 

60
00:03:11,680 --> 00:03:13,200
from the business. 
Like if you're a software 

61
00:03:13,200 --> 00:03:15,760
developer, you don't necessarily
have to understand how the 

62
00:03:15,760 --> 00:03:18,080
business runs, right? 
You don't necessarily have to 

63
00:03:18,080 --> 00:03:20,800
understand how does the business
make profit? 

64
00:03:21,000 --> 00:03:23,120
Who are the customers? 
How did the customers make 

65
00:03:23,120 --> 00:03:24,800
profit? 
You know, what is the operating 

66
00:03:24,800 --> 00:03:26,760
model? 
Those things are not that 

67
00:03:26,760 --> 00:03:29,200
relevant to software developers 
a lot of the time. 

68
00:03:29,520 --> 00:03:32,560
Whereas if you're a data person,
I mean, you can't provide any 

69
00:03:32,560 --> 00:03:34,440
value unless you know how the 
business works. 

70
00:03:34,800 --> 00:03:37,200
And that's something I learned 
in that role. 

71
00:03:37,200 --> 00:03:39,720
And so I thought, Oh well, maybe
I should make a career of 

72
00:03:39,720 --> 00:03:42,840
working with data instead. 
So that was my, I guess that was

73
00:03:42,840 --> 00:03:45,960
my second pivot from games into 
software and then from software 

74
00:03:45,960 --> 00:03:48,600
into data. 
And then I did a masters in data

75
00:03:48,600 --> 00:03:51,200
science because turns out it's 
called data science. 

76
00:03:51,200 --> 00:03:54,160
That's what I found at the time.
So that's something that I 

77
00:03:54,160 --> 00:03:56,120
should do. 
I was promised it was the 

78
00:03:56,120 --> 00:03:59,200
sexiest job of the 21st century 
and all that kind of stuff. 

79
00:03:59,200 --> 00:04:01,280
And so I thought, OK, I'll study
that. 

80
00:04:01,600 --> 00:04:04,000
And so I sort of did a master's 
degree and then transitioned 

81
00:04:04,000 --> 00:04:06,480
into being a data scientist in 
industry. 

82
00:04:06,480 --> 00:04:09,560
And I did that for a few years, 
learned some very important, 

83
00:04:09,880 --> 00:04:12,880
very interesting things about 
the difference between data 

84
00:04:12,880 --> 00:04:15,880
science education and data 
science in practice, which I'm 

85
00:04:15,880 --> 00:04:19,200
sure we can talk about. 
And then again, the sort of 

86
00:04:19,440 --> 00:04:23,240
undercurrent this whole time in 
my career changes was that I 

87
00:04:23,240 --> 00:04:26,000
always wanted to teach. 
Like, education is one of the 

88
00:04:26,000 --> 00:04:27,280
things I'm really passionate 
about. 

89
00:04:27,360 --> 00:04:30,200
And I I was trying to find 
various opportunities over the 

90
00:04:30,200 --> 00:04:32,360
years. 
But it wasn't until I landed in 

91
00:04:32,360 --> 00:04:36,080
the data world that I found my 
niche of data science and 

92
00:04:36,080 --> 00:04:38,760
education. 
And when the pandemic hit, it 

93
00:04:38,760 --> 00:04:41,040
was the part of it that was 
fortuitous for me. 

94
00:04:41,040 --> 00:04:43,720
It was that a lot of teaching. 
Well, all teaching became online

95
00:04:44,000 --> 00:04:46,440
and so I suddenly had all these 
teaching opportunities that 

96
00:04:46,440 --> 00:04:47,840
meant I didn't have to leave the
house. 

97
00:04:48,360 --> 00:04:51,000
It just made it things 
logistically a lot easier. 

98
00:04:51,400 --> 00:04:55,280
And so in late 2020 I quit my 
job and started teaching full 

99
00:04:55,280 --> 00:04:58,240
time as a consultant. 
And so that's sort of what I do.

100
00:04:58,240 --> 00:05:01,160
These days I call myself a data 
generalist because I've done all

101
00:05:01,160 --> 00:05:03,720
these different things and I 
haven't really pigeon holed 

102
00:05:03,720 --> 00:05:07,360
myself into any particular role.
Mostly these days I do 

103
00:05:07,360 --> 00:05:10,960
educational work, so designing 
and delivering workshops, 

104
00:05:11,120 --> 00:05:14,280
anything from half a day 
lectures to A10 week 

105
00:05:14,320 --> 00:05:17,640
accelerators in data science and
Python And things like that. 

106
00:05:18,080 --> 00:05:20,560
And I'm really enjoying it 
because there's a a variety of 

107
00:05:20,560 --> 00:05:23,040
clients to work with a variety 
of problems for the people are 

108
00:05:23,040 --> 00:05:24,400
trying to solve that I can help 
with. 

109
00:05:24,800 --> 00:05:26,000
That's where I've sort of 
landed. 

110
00:05:26,000 --> 00:05:28,200
I mean I don't know if anybody 
can learn from that. 

111
00:05:28,200 --> 00:05:31,800
I mean what what I've learned is
that just follow your interests 

112
00:05:31,800 --> 00:05:34,800
like whatever has interested me.
I I just put everything else 

113
00:05:34,800 --> 00:05:37,320
down and just went towards that.
And that's how I've sort of 

114
00:05:37,320 --> 00:05:42,040
ended up in my like fourth job. 
Hey, thank you for being part of

115
00:05:42,040 --> 00:05:45,360
the Techno Journal community. 
This show wouldn't be the same 

116
00:05:45,360 --> 00:05:48,800
without your ears, and you are 
the reason this show exists. 

117
00:05:49,560 --> 00:05:52,560
If you're loving TLJ and want to
see it keep on growing. 

118
00:05:52,960 --> 00:05:57,400
Consider becoming a patron at 
techledjournal dot dev Patron or

119
00:05:57,400 --> 00:06:00,880
buying me a coffee at 
techledjournal dot dev coffee. 

120
00:06:01,720 --> 00:06:05,520
Every little bit helps field the
research, editing, and sleepless

121
00:06:05,520 --> 00:06:08,520
nights that go into making this 
show the best it can be. 

122
00:06:09,320 --> 00:06:12,120
Thanks for being the best 
listeners any podcast could ask 

123
00:06:12,120 --> 00:06:14,200
for. 
And now let's get back to our 

124
00:06:14,200 --> 00:06:16,160
episode. 
Thank you for sharing your 

125
00:06:16,160 --> 00:06:18,080
story. 
I think it is very interesting, 

126
00:06:18,080 --> 00:06:19,800
right? 
I think many people started 

127
00:06:19,800 --> 00:06:23,240
their computer science study 
because of the interest in 

128
00:06:23,240 --> 00:06:25,520
gaming, right. 
So maybe people love playing 

129
00:06:25,520 --> 00:06:28,160
games. 
I also did my computer graphics 

130
00:06:28,160 --> 00:06:30,760
course back then. 
I think it was, yeah, difficult 

131
00:06:30,760 --> 00:06:32,720
if you didn't get the math, I 
guess. 

132
00:06:33,320 --> 00:06:35,800
So I think the career that you 
took probably is also quite 

133
00:06:35,960 --> 00:06:38,680
common for some people, right? 
So they started by being a 

134
00:06:38,680 --> 00:06:41,880
generalist software developer, 
but found into a specific area. 

135
00:06:41,880 --> 00:06:44,080
Dive deep into that and become a
specialist. 

136
00:06:44,520 --> 00:06:47,520
And you also host a podcast 
house Tech Data Science. 

137
00:06:47,520 --> 00:06:49,440
So maybe tell us a little bit 
more about that. 

138
00:06:49,440 --> 00:06:51,440
What can we learn from that 
podcast? 

139
00:06:52,120 --> 00:06:55,600
Yeah, so that podcast. 
Grew from my first real data 

140
00:06:55,600 --> 00:06:59,520
science job after finishing my 
degree and my Co host Sean was 

141
00:06:59,520 --> 00:07:01,960
actually the guy who hired me. 
He was the hiring manager who 

142
00:07:02,040 --> 00:07:03,720
hired me into that role at the 
time. 

143
00:07:04,040 --> 00:07:08,800
And you know, very early on in 
that job I realized and Sean was

144
00:07:08,800 --> 00:07:10,600
the same. 
He sort of came from academia 

145
00:07:10,600 --> 00:07:14,000
into this kind of job. 
And we both quickly realized 

146
00:07:14,000 --> 00:07:17,680
that what we thought the job was
going to be is not at all what 

147
00:07:17,680 --> 00:07:21,000
the job is like in reality. 
You know, a lot of data science 

148
00:07:21,000 --> 00:07:24,120
education is focused on tools, 
techniques, algorithms. 

149
00:07:24,120 --> 00:07:26,480
And so you get this picture 
that, OK, well, I'm going to be 

150
00:07:26,880 --> 00:07:29,560
deriving formulas and doing all 
this complicated machine 

151
00:07:29,560 --> 00:07:32,080
learning at work. 
And then you go in and it turns 

152
00:07:32,080 --> 00:07:35,080
out, you know, a lot of the job 
is navigating the complexities 

153
00:07:35,080 --> 00:07:39,000
of an enterprise environment, 
working with so office politics 

154
00:07:39,000 --> 00:07:42,920
and things like, oh, the data's 
not actually available and no 

155
00:07:42,920 --> 00:07:45,080
one actually has any solid 
research questions. 

156
00:07:45,080 --> 00:07:46,720
So we have to find those as 
well. 

157
00:07:46,720 --> 00:07:51,080
And so we were very, very 
quickly hit on this difference 

158
00:07:51,240 --> 00:07:54,120
between education and reality. 
And so we we started having 

159
00:07:54,120 --> 00:07:57,120
these conversations internally 
about OK, what are the things 

160
00:07:57,120 --> 00:07:59,920
we've learned about how industry
is different, What are the 

161
00:07:59,920 --> 00:08:03,280
skills that people should 
actually be trained on or at 

162
00:08:03,280 --> 00:08:06,520
least be warned about upfront so
people have a better picture of 

163
00:08:06,520 --> 00:08:09,840
what the job looks like. 
And one day we just said, well 

164
00:08:09,840 --> 00:08:11,960
we've had these conversations 
quite a lot and every time we 

165
00:08:11,960 --> 00:08:14,840
went to a meet up, we would talk
to a like minded people and have

166
00:08:14,840 --> 00:08:17,080
the same conversations and we 
thought well we might as well 

167
00:08:17,080 --> 00:08:20,080
just put them on the Internet 
for other people to learn from. 

168
00:08:20,400 --> 00:08:22,680
And so initially the podcast 
started off with the two of us 

169
00:08:22,680 --> 00:08:25,800
booking a meeting room at work 
and taking our work Samsung 

170
00:08:25,800 --> 00:08:28,520
phone, putting it on the table 
and just having a chat. 

171
00:08:28,880 --> 00:08:31,440
And then eventually we became a 
little more professional and got

172
00:08:31,440 --> 00:08:33,960
some proper tooling and proper 
microphones and things. 

173
00:08:34,159 --> 00:08:36,840
But that's how it started. 
And so currently we're running a

174
00:08:36,840 --> 00:08:39,919
season where we're talking to 
educators in the data space, 

175
00:08:39,919 --> 00:08:44,720
because I think at this point in
time, getting the education of 

176
00:08:44,720 --> 00:08:48,200
future analysts and future data 
scientists right is really 

177
00:08:48,200 --> 00:08:50,560
important. 
So we've spoken to people who 

178
00:08:50,560 --> 00:08:54,080
are Python trainers, but also 
people who are spreading the 

179
00:08:54,080 --> 00:08:56,800
idea of data literacy. 
You know, we've talked to a 

180
00:08:56,800 --> 00:08:59,760
variety of people and it's been 
very interesting to see what 

181
00:08:59,760 --> 00:09:01,720
their perspective is. 
And there is a lot of 

182
00:09:01,720 --> 00:09:05,600
commonality with our philosophy,
which is trying to morph 

183
00:09:05,600 --> 00:09:08,760
education into something that is
much more applied and much more 

184
00:09:09,000 --> 00:09:12,440
ready for the real world. 
I'm quite interested in the name

185
00:09:12,440 --> 00:09:15,280
itself, like half stack. 
Why are you calling it half 

186
00:09:15,280 --> 00:09:17,680
stack? 
I mean, is that the opposite of 

187
00:09:17,680 --> 00:09:20,000
full stack, right? 
Why data science is half stack? 

188
00:09:20,000 --> 00:09:21,440
So probably you can explain a 
little. 

189
00:09:21,560 --> 00:09:23,000
Bit no. 
That's a good question. 

190
00:09:23,000 --> 00:09:25,280
It it's a response to the full 
stack idea. 

191
00:09:25,640 --> 00:09:29,440
I mean, one thing we never liked
was this idea that a single data

192
00:09:29,440 --> 00:09:32,080
scientist has to be this Unicorn
who does everything in a 

193
00:09:32,080 --> 00:09:34,320
company. 
I mean that's just that's not 

194
00:09:34,320 --> 00:09:37,560
the reality that would never 
function in a business, 

195
00:09:37,560 --> 00:09:41,320
especially a solid decades old 
enterprise. 

196
00:09:41,560 --> 00:09:44,040
You can't just drop a single 
data science Unicorn in there 

197
00:09:44,040 --> 00:09:47,360
and hope that they'll make the 
company lots of money with a 

198
00:09:47,360 --> 00:09:48,840
machine learning model. 
Immediately I. 

199
00:09:49,160 --> 00:09:50,760
Just remember there was 1 
presentation. 

200
00:09:50,760 --> 00:09:54,400
We gave somewhere and Sean we 
talked about things like the 

201
00:09:54,400 --> 00:09:57,000
difference between academic data
science and business data 

202
00:09:57,000 --> 00:09:58,800
science. 
And just somewhere on those 

203
00:09:58,800 --> 00:10:01,320
slides he coined the phrase half
stack data science. 

204
00:10:01,320 --> 00:10:03,800
You just put it on the slide and
it's sort of it just sort of 

205
00:10:03,800 --> 00:10:06,600
stuck. 
And the idea is that because 

206
00:10:06,800 --> 00:10:10,240
data science is so different in 
the real world as opposed to in 

207
00:10:10,240 --> 00:10:14,200
education, then you need people 
with a sort of this hybrid skill

208
00:10:14,200 --> 00:10:15,920
set and more generalist skill 
set. 

209
00:10:16,280 --> 00:10:20,200
And I we don't really think that
having someone be full stack is 

210
00:10:20,200 --> 00:10:23,240
realistic, at least you know, 
being an expert in everything 

211
00:10:23,240 --> 00:10:28,000
from data cleaning to statistics
to software development to 

212
00:10:28,160 --> 00:10:31,880
business strategy development. 
I think these days is pretty 

213
00:10:31,880 --> 00:10:35,280
hard to be full stack and even 
the stack gets deeper and deeper

214
00:10:35,280 --> 00:10:36,800
right? 
So I think there's so many 

215
00:10:36,800 --> 00:10:39,000
technologies that these days 
people need to learn. 

216
00:10:39,000 --> 00:10:42,640
So I think your hands probably 
half stack, 1/4 stack. 

217
00:10:42,720 --> 00:10:43,840
Probably makes. 
Yeah, exactly. 

218
00:10:44,040 --> 00:10:48,040
Fractional stack. 
So let's go into the topic of 

219
00:10:48,120 --> 00:10:51,000
today's conversation which is 
about the data analysis. 

220
00:10:51,120 --> 00:10:53,880
So I think in the beginning you 
mentioned you realized there is 

221
00:10:53,880 --> 00:10:57,960
a big gap between what normally 
data analysts or data scientists

222
00:10:57,960 --> 00:11:01,720
learn throughout their education
or maybe boot camp courses, 

223
00:11:01,720 --> 00:11:05,240
whatever that is compared with 
the real life problems, right. 

224
00:11:05,560 --> 00:11:09,440
So what are the typical gaps 
that you see challenges for 

225
00:11:09,440 --> 00:11:13,480
people from Academy or maybe 
learning from their study and 

226
00:11:13,480 --> 00:11:16,000
thrown into the deep end into 
real world problems. 

227
00:11:16,000 --> 00:11:19,600
So what are typical gaps or 
challenges that we have to think

228
00:11:19,600 --> 00:11:22,000
about? 
Yeah, so I've taught 

229
00:11:22,200 --> 00:11:26,160
accelerators and boot camps and 
I was faced with this problem as

230
00:11:26,160 --> 00:11:29,360
well of trying to teach the 
right skills, but also within 

231
00:11:29,360 --> 00:11:30,360
the framework. 
Of what I was. 

232
00:11:30,360 --> 00:11:32,560
Expected to teach. 
And so you know there's a list 

233
00:11:32,560 --> 00:11:35,520
of technical topics that you 
absolutely have to teach in 

234
00:11:35,520 --> 00:11:37,160
order to make someone a data 
analyst, right? 

235
00:11:37,160 --> 00:11:40,600
You need to be able to read 
data, combine data, clean it, 

236
00:11:40,840 --> 00:11:43,640
identify missing values, 
outliers, all this kind of 

237
00:11:43,640 --> 00:11:46,080
technical stuff. 
You know, we usually teach like 

238
00:11:46,080 --> 00:11:49,480
Sequel, some kind of business 
intelligence tool like Power BI 

239
00:11:49,480 --> 00:11:52,840
or Tableau, maybe Python, 
ideally have some kind of 

240
00:11:52,840 --> 00:11:55,520
programming language in there, 
or maybe R See, these are sort 

241
00:11:55,520 --> 00:11:57,960
of technical skills that you 
just have to teach in the 

242
00:11:57,960 --> 00:12:00,520
foundational training, because 
otherwise you can't do the job 

243
00:12:00,520 --> 00:12:02,560
right. 
Excel counts as well, so all all

244
00:12:02,560 --> 00:12:04,080
the different things you can do 
in Excel. 

245
00:12:04,640 --> 00:12:07,000
And then sometimes, depending on
the course, you'd also teach 

246
00:12:07,000 --> 00:12:09,280
machine learning. 
You know how to build a machine 

247
00:12:09,280 --> 00:12:11,800
learning model, how to do some 
of the the sort of the 

248
00:12:11,800 --> 00:12:15,080
practitioner, things like cross 
validation, other things like 

249
00:12:15,080 --> 00:12:17,680
that. 
But technical skills are not the

250
00:12:17,680 --> 00:12:20,840
whole job, right? 
And I'm sure multiple guests on 

251
00:12:20,840 --> 00:12:23,000
your podcast probably said the 
same thing, that software 

252
00:12:23,000 --> 00:12:26,080
engineers don't spend most of 
their time writing code and data

253
00:12:26,080 --> 00:12:29,160
analysts don't necessarily spend
most of their time analysing 

254
00:12:29,160 --> 00:12:31,040
data. 
There's all sorts of other 

255
00:12:31,040 --> 00:12:33,840
things to do, like identifying 
problems to solve in the first 

256
00:12:33,840 --> 00:12:36,440
place. 
That's not something we teach in

257
00:12:36,440 --> 00:12:39,120
foundational training. 
How do you have a conversation 

258
00:12:39,120 --> 00:12:42,400
with another human and read 
between the lines of what 

259
00:12:42,400 --> 00:12:44,160
problems they're actually trying
to solve? 

260
00:12:44,480 --> 00:12:48,280
And you don't even necessarily 
realize that that's a skill 

261
00:12:48,280 --> 00:12:51,200
you're going to need. 
And my problem with that kind of

262
00:12:51,200 --> 00:12:54,000
thing is that we often hand wave
it away and say, Oh well, of 

263
00:12:54,000 --> 00:12:57,440
course people would just learn 
that on the job, but how, right?

264
00:12:57,440 --> 00:12:59,840
There's no one to actually teach
them in any sort of formal. 

265
00:12:59,840 --> 00:13:03,160
Way it it's just sort of, oh, 
you just pick up these skills as

266
00:13:03,160 --> 00:13:05,640
you go. 
So anything about navigating 

267
00:13:05,640 --> 00:13:09,360
like priorities in a company 
where five different 

268
00:13:09,360 --> 00:13:12,520
stakeholders ask you for seven 
different projects, How do you 

269
00:13:12,600 --> 00:13:15,280
know what to work on? 
How do you generate a value 

270
00:13:15,280 --> 00:13:17,320
statement for any analytical 
work? 

271
00:13:17,680 --> 00:13:22,520
So how do you think about the 
actual quantifiable value that 

272
00:13:22,520 --> 00:13:25,320
this project is going to have, 
what kind of impact it's going 

273
00:13:25,320 --> 00:13:27,200
to have? 
And then some of the other 

274
00:13:27,200 --> 00:13:29,280
things is, so you've got 
building all these machine 

275
00:13:29,280 --> 00:13:31,880
learning models, but then how 
are these models going to be 

276
00:13:31,880 --> 00:13:35,600
used by the company? 
And so sort of planning after 

277
00:13:35,600 --> 00:13:39,040
the first model component as 
well, that's something that you 

278
00:13:39,040 --> 00:13:41,640
just sort of have to learn on 
the job rather than being 

279
00:13:41,640 --> 00:13:43,480
prepared for it. 
And then finally, the other 

280
00:13:43,480 --> 00:13:47,680
thing that I try and teach as 
much as possible is like the 

281
00:13:47,680 --> 00:13:51,400
most realistic data sets that 
you can work on rather than the 

282
00:13:51,400 --> 00:13:54,520
toy examples that we teach. 
Now almost everyone who's taken 

283
00:13:54,520 --> 00:13:57,200
some kind of data science course
has predicted the survivors of 

284
00:13:57,200 --> 00:14:00,200
the Titanic, right? 
That's one of the classic data 

285
00:14:00,320 --> 00:14:02,080
science machine learning 
examples. 

286
00:14:02,600 --> 00:14:05,600
And I I think the real world 
applicability of that problem is

287
00:14:05,600 --> 00:14:08,560
not that high. 
So the fact that the data sets 

288
00:14:08,720 --> 00:14:11,000
in education are not that 
realistic is one of the 

289
00:14:11,000 --> 00:14:12,200
problems. 
And the other one is often 

290
00:14:12,200 --> 00:14:13,600
they're they're actually quite 
clean. 

291
00:14:14,080 --> 00:14:17,200
You know, there's a meme in data
science which is 80% of data 

292
00:14:17,200 --> 00:14:20,680
sciences spent cleaning data and
the other 20% is spent 

293
00:14:20,680 --> 00:14:22,920
complaining about the fact that 
we have to clean data. 

294
00:14:23,360 --> 00:14:26,480
But it's true that often, you 
know, we're the ones who have to

295
00:14:26,480 --> 00:14:29,680
either collect the data or find 
where it is in the 1st place, 

296
00:14:29,680 --> 00:14:32,560
combine it, document it for the 
first time. 

297
00:14:32,760 --> 00:14:34,320
And these are all things that 
take time. 

298
00:14:34,320 --> 00:14:37,040
And these are all things that we
don't think about much in the 

299
00:14:37,040 --> 00:14:39,120
classroom. 
And you know, one reason for 

300
00:14:39,120 --> 00:14:41,320
that is we just don't have the 
time we have to teach these 

301
00:14:41,320 --> 00:14:44,400
technical skills. 
But I think there is room to 

302
00:14:44,400 --> 00:14:47,120
make education better at the 
foundational level by 

303
00:14:47,120 --> 00:14:49,040
incorporating more of these real
world elements. 

304
00:14:49,760 --> 00:14:52,600
Yeah, I didn't study data 
science, data analytics, but I 

305
00:14:52,600 --> 00:14:56,880
did some kind of data projects. 
People write data reporting, 

306
00:14:56,880 --> 00:14:58,560
real time analytics and things 
like that. 

307
00:14:58,880 --> 00:15:03,040
I think the level of complexity 
and ambiguity I guess increases 

308
00:15:03,040 --> 00:15:05,000
as the amount of data sources, 
right. 

309
00:15:05,120 --> 00:15:07,920
And also depending on the 
cleanliness of the data, right. 

310
00:15:08,280 --> 00:15:11,760
So I think when you study in 
maybe, I don't know, Bootcamp or

311
00:15:11,840 --> 00:15:14,440
a course, right, typically 
you're given a a data set which 

312
00:15:14,440 --> 00:15:17,040
is kind of like clean enough and
you know, well defined and 

313
00:15:17,040 --> 00:15:19,240
things like that. 
But I think as soon as you hit 

314
00:15:19,240 --> 00:15:21,880
the real industry, right, so you
realize that it's actually not 

315
00:15:21,920 --> 00:15:25,160
as simple as that. 
Sometimes also like identifying 

316
00:15:25,160 --> 00:15:28,600
the problems to solve, right, 
the question becomes much more 

317
00:15:28,720 --> 00:15:31,240
abstract probably, right? 
There's no, like, real formula 

318
00:15:31,520 --> 00:15:34,400
and maybe there's no even like 
100% solution that you can 

319
00:15:34,480 --> 00:15:37,560
actually come up with. 
And hence the first thing that 

320
00:15:37,560 --> 00:15:39,640
you mentioned is actually 
identifying the problem. 

321
00:15:39,920 --> 00:15:44,160
So how can people who learn more
about technical stuff, right? 

322
00:15:44,280 --> 00:15:46,440
Because typically it's quite 
straightforward in the boot 

323
00:15:46,440 --> 00:15:49,400
camp, like, OK, here's the data 
set, here's what I want you to 

324
00:15:49,400 --> 00:15:50,680
find. 
And it's kind of like 

325
00:15:50,680 --> 00:15:52,360
straightforward. 
But actually in the real 

326
00:15:52,360 --> 00:15:54,480
business world, sometimes 
identifying the problem is a 

327
00:15:54,480 --> 00:15:56,440
challenge. 
Not to mention as well that you 

328
00:15:56,440 --> 00:15:58,280
don't understand the domain of 
the business. 

329
00:15:58,520 --> 00:16:01,440
So maybe from your experience 
some tips that you can teach us 

330
00:16:01,440 --> 00:16:03,720
here. 
Yeah, I mean I was lucky in the 

331
00:16:03,760 --> 00:16:06,120
the company I worked for. 
So we worked in the used car 

332
00:16:06,120 --> 00:16:11,120
industry and like on I think my 
first day was at an actual used 

333
00:16:11,120 --> 00:16:13,320
car auction that the company was
holding. 

334
00:16:13,360 --> 00:16:16,800
So I actually got to see the 
company in operation. 

335
00:16:17,040 --> 00:16:18,840
There's no mention of data on 
that day. 

336
00:16:18,840 --> 00:16:21,000
It was all about like this is 
how the business operates. 

337
00:16:21,200 --> 00:16:22,680
And I think that's a really good
model. 

338
00:16:22,680 --> 00:16:27,360
I think all data scientists and 
all analysts should spend more 

339
00:16:27,360 --> 00:16:31,040
time in the business outside of 
the data sets, just in the 

340
00:16:31,040 --> 00:16:33,160
actual business to see how it 
works. 

341
00:16:33,480 --> 00:16:36,960
They should be shadowing their 
colleagues who are in charge of 

342
00:16:36,960 --> 00:16:40,480
either entering the data or just
doing business operations, the 

343
00:16:40,480 --> 00:16:44,240
sales people, the customer 
engagement people, everyone 

344
00:16:44,280 --> 00:16:46,120
who's contributing some way to 
the company. 

345
00:16:46,120 --> 00:16:49,320
I think data science should 
understand all those functions 

346
00:16:49,720 --> 00:16:52,960
because then you have the 
context and then you understand 

347
00:16:52,960 --> 00:16:54,760
the columns that you're seeing 
in the data. 

348
00:16:55,160 --> 00:16:57,880
Again, in education, we normally
say, OK, here's the data set and

349
00:16:57,880 --> 00:17:00,400
then here's something called the
data dictionary where each 

350
00:17:00,400 --> 00:17:02,840
column is labeled and then we we
tell you what each of the 

351
00:17:02,840 --> 00:17:05,359
columns means, which which is 
great, right? 

352
00:17:05,359 --> 00:17:07,040
For for educational purposes, 
great. 

353
00:17:07,319 --> 00:17:09,119
That thing doesn't exist in the 
real world. 

354
00:17:09,119 --> 00:17:12,200
Usually we have to write our own
data dictionary, but even then 

355
00:17:12,200 --> 00:17:15,520
that's a very narrow view of 
looking at it just as a table of

356
00:17:15,520 --> 00:17:17,800
numbers. 
What we should look at it as is 

357
00:17:17,800 --> 00:17:20,240
the wider context in the 
business, right. 

358
00:17:20,240 --> 00:17:21,640
So where does this data come 
from? 

359
00:17:21,640 --> 00:17:24,359
Who enters it, when do these 
records get generated? 

360
00:17:24,359 --> 00:17:27,520
So just understanding that data 
generating process is really 

361
00:17:27,520 --> 00:17:29,280
important. 
And yeah, you said about 

362
00:17:29,280 --> 00:17:31,800
understanding the domain, it's 
really important for a data 

363
00:17:31,800 --> 00:17:34,160
scientist to understand the 
domain they're working in. 

364
00:17:34,440 --> 00:17:36,800
And that's the kind of thing you
can learn that on the job for 

365
00:17:36,800 --> 00:17:39,080
sure, like you don't need to 
spend a. 

366
00:17:39,080 --> 00:17:43,040
Year. 
As a used car salesman before 

367
00:17:43,040 --> 00:17:46,160
you go into a data science job 
in the used car industry. 

368
00:17:46,520 --> 00:17:49,760
But once you're in there, you 
know having that additional 

369
00:17:50,120 --> 00:17:53,040
curiosity and context about the 
domain is definitely important 

370
00:17:53,040 --> 00:17:56,240
and not something that you can 
delegate to other colleagues. 

371
00:17:56,920 --> 00:17:59,480
Yeah, sometimes. 
I think also like data 

372
00:17:59,480 --> 00:18:01,320
scientists or data analysts 
typically, right? 

373
00:18:01,320 --> 00:18:04,960
They love playing with data or 
their tools, their sequel or BI 

374
00:18:04,960 --> 00:18:06,920
tools, whatever. 
Don't forget that you should 

375
00:18:06,920 --> 00:18:09,360
also collaborate, right? 
You're good with crunching the 

376
00:18:09,360 --> 00:18:11,840
data, cleaning the data. 
But if you collaborate with the 

377
00:18:11,840 --> 00:18:14,920
domain expert, maybe sitting 
side by side, show the data and 

378
00:18:14,920 --> 00:18:18,840
ask questions why this matters 
or what kind of data that you 

379
00:18:18,840 --> 00:18:21,000
are dealing with, right? 
So maybe if you collaborate 

380
00:18:21,000 --> 00:18:24,600
more, you'll get to learn much 
better, because otherwise you'll

381
00:18:24,600 --> 00:18:27,600
probably won't be stuck into 
identifying the problem and also

382
00:18:27,600 --> 00:18:30,280
giving a solutions. 
Which brings me to the next 

383
00:18:30,280 --> 00:18:32,680
topic of discussion. 
Right in your book you mentioned

384
00:18:32,680 --> 00:18:35,960
this result driven approach 
being pragmatic when you come up

385
00:18:35,960 --> 00:18:39,280
with data analysis because 
sometimes I can see we crunch 

386
00:18:39,280 --> 00:18:43,120
data, we use sophisticated tools
and techniques, right, but 

387
00:18:43,120 --> 00:18:44,920
doesn't necessarily bring 
results. 

388
00:18:44,920 --> 00:18:47,680
You know after a few weeks of 
time you come up with the 

389
00:18:47,680 --> 00:18:51,000
result, maybe the business or 
the stakeholders don't really 

390
00:18:51,160 --> 00:18:54,000
get it right or don't feel 
satisfied with the answers. 

391
00:18:54,320 --> 00:18:57,280
So tell us more a little bit 
about your result driven 

392
00:18:57,280 --> 00:19:00,680
approach so that we can be more 
pragmatic in our data analysis. 

393
00:19:01,600 --> 00:19:04,240
Yeah, I'm glad you mentioned 
collaboration because that's 

394
00:19:04,240 --> 00:19:08,000
partly what it comes down to is 
as data people, we need to 

395
00:19:08,000 --> 00:19:13,840
remember that our primary goal 
in any company is to provide 

396
00:19:13,840 --> 00:19:17,840
value of some sort, whether 
that's clearly monetary or 

397
00:19:18,040 --> 00:19:21,280
saving time through automation. 
Whatever it is, that is our 

398
00:19:21,280 --> 00:19:24,120
primary goal. 
And something I say to students 

399
00:19:24,120 --> 00:19:28,480
is if you can solve a 
stakeholders problem, answer the

400
00:19:28,480 --> 00:19:31,720
question with a single bar 
chart, then fine. 

401
00:19:32,000 --> 00:19:34,160
It's the problem solving that 
matters. 

402
00:19:34,160 --> 00:19:36,760
It's not the level of 
sophistication in your tools. 

403
00:19:37,160 --> 00:19:39,880
And that's what I'm trying to 
get across in the book, is that 

404
00:19:40,120 --> 00:19:43,480
the key thing is to have an end 
goal to start with. 

405
00:19:43,480 --> 00:19:46,000
Like it's one of the first 
things you need to do is define 

406
00:19:46,000 --> 00:19:47,640
an end goal that you want to 
reach. 

407
00:19:47,880 --> 00:19:50,440
And it might be like the 
simplest version of the problem.

408
00:19:50,440 --> 00:19:52,920
It might be the smallest 
possible answer. 

409
00:19:52,920 --> 00:19:55,920
I call it the minimum viable 
answer in the book, which is you

410
00:19:55,920 --> 00:19:58,680
know, what is the absolute 
minimum amount of work that you 

411
00:19:58,680 --> 00:20:02,120
can do to get something. 
A result that will take you to 

412
00:20:02,120 --> 00:20:03,920
the next step. 
And then you know, it's usually 

413
00:20:03,920 --> 00:20:06,200
then a conversation, as you 
said, a collaboration with 

414
00:20:06,200 --> 00:20:08,760
stakeholders to say, look, this 
is what I did based on our 

415
00:20:08,760 --> 00:20:11,880
conversation, what do you think,
what direction should we take 

416
00:20:11,880 --> 00:20:13,920
this in? 
And just having that at the 

417
00:20:13,920 --> 00:20:16,480
forefront of your mind 
throughout the analysis I think 

418
00:20:16,480 --> 00:20:20,440
is really helpful because as 
data people, we can go down the 

419
00:20:20,440 --> 00:20:22,720
rabbit holes. 
If you're working on the data 

420
00:20:22,720 --> 00:20:25,720
set, usually you get more 
questions generated than 

421
00:20:25,720 --> 00:20:27,640
answers. 
Every time you look at a new 

422
00:20:27,640 --> 00:20:29,000
column. 
You're like, oh, there's missing

423
00:20:29,000 --> 00:20:31,040
values here, oh, there's a 
relationship between these 

424
00:20:31,040 --> 00:20:34,080
columns and so you can keep 
exploring it forever and as you 

425
00:20:34,080 --> 00:20:36,840
say, never get to a result. 
So if you already know upfront 

426
00:20:36,840 --> 00:20:40,320
what your results should be, the
whole exploration has a goal 

427
00:20:40,320 --> 00:20:43,320
that you're directing towards, 
which you know makes it quicker 

428
00:20:43,320 --> 00:20:46,160
to get to an answer, and it will
also make the answer more useful

429
00:20:46,240 --> 00:20:49,120
and in textual. 
So very interesting that you 

430
00:20:49,120 --> 00:20:51,920
mentioned we should start with 
an end goal in mind, right. 

431
00:20:51,920 --> 00:20:55,840
So I think typically maybe from 
my experience I I didn't see 

432
00:20:55,840 --> 00:21:00,000
many data analysis work that 
way, you know giving an end, end

433
00:21:00,000 --> 00:21:02,760
goal 1st and typically it's like
stages, right. 

434
00:21:02,760 --> 00:21:05,200
So you start with the first 
crunching of the data, the 

435
00:21:05,200 --> 00:21:07,800
milestone and then you give the 
preliminary result, right. 

436
00:21:08,160 --> 00:21:10,920
But I think, understanding the 
end goal, how should probably 

437
00:21:10,920 --> 00:21:13,600
the result look like? 
Confirming with stakeholder. 

438
00:21:13,600 --> 00:21:15,640
That's what you advise in your 
book actually, right? 

439
00:21:16,000 --> 00:21:19,480
So you do the first iteration, 
do as minimum as possible, which

440
00:21:19,480 --> 00:21:22,240
is the minimum viable answer 
that you have, and ask for 

441
00:21:22,240 --> 00:21:24,480
feedback and then shape from 
there, right? 

442
00:21:24,480 --> 00:21:26,000
Rather than going the rabbit 
hole. 

443
00:21:26,200 --> 00:21:29,520
I think many people, especially 
for me when I work with data, 

444
00:21:29,520 --> 00:21:32,960
it's always fun to crunch data, 
you know, firing different 

445
00:21:33,160 --> 00:21:37,000
sequel statements and wait for 
the results and storing it to 

446
00:21:37,000 --> 00:21:39,000
somewhere, right? 
It's always fun and maybe 

447
00:21:39,000 --> 00:21:42,600
generating reports or charts 
which are fancy, but I think 

448
00:21:42,600 --> 00:21:44,240
sometimes it doesn't solve the 
problem. 

449
00:21:44,240 --> 00:21:45,800
So I think it's very very 
important. 

450
00:21:46,120 --> 00:21:48,040
Have a goal in mind, the end 
goal in mind. 

451
00:21:48,080 --> 00:21:50,280
Clarify that with stakeholders 
and iterate right. 

452
00:21:50,520 --> 00:21:51,840
The iteration here. 
How? 

453
00:21:52,080 --> 00:21:54,080
How would you suggest people to 
do right? 

454
00:21:54,080 --> 00:21:56,000
How short should the iteration 
be? 

455
00:21:56,280 --> 00:22:00,040
Or what is too long for you so 
that you have to be wary about? 

456
00:22:00,200 --> 00:22:02,320
So maybe a little bit of tips on
the iteration. 

457
00:22:03,000 --> 00:22:05,960
Yeah, that's a great question 
and and something that we often 

458
00:22:06,120 --> 00:22:09,560
wrangle with on our podcast. 
We had a whole episode dedicated

459
00:22:09,560 --> 00:22:14,560
to estimating time because in 
software world what I was used 

460
00:22:14,560 --> 00:22:17,480
to is that we got pretty good at
estimating how long a task would

461
00:22:17,480 --> 00:22:19,680
take, right? 
It's like, oh, we need to add 2 

462
00:22:19,680 --> 00:22:22,680
buttons to a web page and you're
like, OK, I need to write some 

463
00:22:22,680 --> 00:22:24,760
functions in the background, 
maybe I need to like create a 

464
00:22:24,760 --> 00:22:27,080
new database column, I need to 
write this code. 

465
00:22:27,080 --> 00:22:29,920
It's probably going to take me a
day, and we were pretty close 

466
00:22:29,920 --> 00:22:32,880
most of the time for these 
little chunks of work. 

467
00:22:33,320 --> 00:22:36,400
In the world of data analysis, 
there's so much uncertainty, 

468
00:22:36,720 --> 00:22:39,160
partly because the problem is 
ill defined, partly because we 

469
00:22:39,160 --> 00:22:42,200
don't know the data very well, 
partly because we could find 

470
00:22:42,200 --> 00:22:45,400
anything of interest and go down
all these rabbit holes that it 

471
00:22:45,400 --> 00:22:48,280
becomes very difficult to 
estimate how long something will

472
00:22:48,280 --> 00:22:50,480
take. 
But you can't just say that to a

473
00:22:50,480 --> 00:22:52,200
stakeholder, right? 
You can't just say, I have 

474
00:22:52,200 --> 00:22:54,320
absolutely no idea when I'm 
going to get back to you on 

475
00:22:54,320 --> 00:22:56,400
this. 
It's not, unfortunately, not 

476
00:22:56,400 --> 00:23:00,760
viable in the real world. 
So what we usually said was, 

477
00:23:01,040 --> 00:23:03,720
again, we didn't call it a 
minimum viable answer at the 

478
00:23:03,720 --> 00:23:05,720
time. 
We just said we'll do some work 

479
00:23:05,920 --> 00:23:09,520
to get towards this particular 
answer, which we've all agreed 

480
00:23:09,520 --> 00:23:11,360
on. 
Looks like a plausible first 

481
00:23:11,360 --> 00:23:13,640
step. 
So in a week's time we'll report

482
00:23:13,640 --> 00:23:15,520
back or in a few days we'll 
report back. 

483
00:23:15,920 --> 00:23:18,840
And then rather than saying when
the work will be finished, we 

484
00:23:18,840 --> 00:23:20,760
would just check in at 
intervals. 

485
00:23:20,880 --> 00:23:23,240
That's one way to do it is to 
just say we're going to tackle 

486
00:23:23,240 --> 00:23:24,960
this problem, we're going to 
work on it for a while. 

487
00:23:25,040 --> 00:23:27,320
This is the end goal we have in 
mind and we'll check in in a few

488
00:23:27,320 --> 00:23:29,560
days and the check in might be 
here it is. 

489
00:23:29,560 --> 00:23:32,600
Here's your minimum viable 
answer or the check in is this 

490
00:23:32,600 --> 00:23:35,200
question that you had actually 
is very difficult to answer with

491
00:23:35,200 --> 00:23:37,800
the data we have. 
Here's the other kind of data 

492
00:23:37,800 --> 00:23:40,520
that we need to collect before 
we can give you an answer. 

493
00:23:40,800 --> 00:23:43,400
Again, it's about collaborating 
and keeping your stakeholders in

494
00:23:43,400 --> 00:23:46,560
the loop so you have shorter 
iteration cycles basically are 

495
00:23:46,560 --> 00:23:49,680
what you're after. 
So I think in the typical 

496
00:23:49,680 --> 00:23:53,040
software development project or 
maybe product development, 

497
00:23:53,040 --> 00:23:57,280
right, many teams actually don't
start with a good data design so

498
00:23:57,280 --> 00:23:58,720
to speak, right. 
So they come from the 

499
00:23:58,720 --> 00:24:02,080
operational point of view, you 
know, they just do transactions,

500
00:24:02,080 --> 00:24:03,960
store the data and that's it, 
right. 

501
00:24:03,960 --> 00:24:07,600
So there will be tables, mostly 
relational tables probably, and 

502
00:24:07,600 --> 00:24:11,160
then these tables will be given 
to the data analyst to derive 

503
00:24:11,160 --> 00:24:14,440
some insights, right. 
So I think in your book, in all 

504
00:24:14,440 --> 00:24:17,640
the problems that you have in 
each chapter, right, you always 

505
00:24:17,640 --> 00:24:19,360
come up with the data 
dictionary. 

506
00:24:19,600 --> 00:24:21,960
I know probably it's a bit 
luxury in the real world to 

507
00:24:21,960 --> 00:24:25,080
actually see a good data 
dictionary, but tell us really 

508
00:24:25,080 --> 00:24:28,120
the importance of this data 
dictionary and is it just 

509
00:24:28,320 --> 00:24:32,400
defining, you know, table column
types and column description or 

510
00:24:32,400 --> 00:24:34,760
is there something else beyond 
just that? 

511
00:24:35,520 --> 00:24:38,920
Yeah, in the real world it is 
rare to have a document like 

512
00:24:38,920 --> 00:24:41,240
that. 
I mean, you might have pieces of

513
00:24:41,240 --> 00:24:45,240
it scattered around but not 
collected together in something 

514
00:24:45,240 --> 00:24:46,880
as coherent as a data 
dictionary. 

515
00:24:47,240 --> 00:24:50,040
I mean, the purpose of the data 
dictionary is at a surface 

516
00:24:50,040 --> 00:24:53,120
level, to record all the 
different columns in the data 

517
00:24:53,120 --> 00:24:55,720
and what they mean, what data 
type they should be and then 

518
00:24:55,720 --> 00:24:58,120
what they represent. 
Sometimes it's because the 

519
00:24:58,120 --> 00:25:01,280
column names are abbreviated, so
it just tells you sort of what 

520
00:25:01,400 --> 00:25:04,160
the column actually means or 
what what the abbreviation 

521
00:25:04,160 --> 00:25:06,960
stands for. 
But deeper than that, what you 

522
00:25:06,960 --> 00:25:09,920
want a data dictionary to tell 
you as well is the process that 

523
00:25:09,920 --> 00:25:13,640
generates those columns. 
So for example, just to give you

524
00:25:13,640 --> 00:25:16,880
an example, we had sale data 
from the used car industry. 

525
00:25:16,880 --> 00:25:20,840
Every time there was a sale at 
an auction that was recorded and

526
00:25:20,840 --> 00:25:22,240
we had a couple of different 
columns. 

527
00:25:22,240 --> 00:25:24,320
One of them was called the sold 
date and the other one was 

528
00:25:24,320 --> 00:25:27,840
called the date sold, which just
sounds like we have just this 

529
00:25:27,840 --> 00:25:31,000
redundancy of two columns that 
measure the same thing, but it 

530
00:25:31,000 --> 00:25:32,760
turned out they don't measure 
the same thing. 

531
00:25:33,120 --> 00:25:36,840
What it turned out was one of 
them was the date that the sale 

532
00:25:37,080 --> 00:25:40,120
happened, so when the auction 
happened, but there was another 

533
00:25:40,120 --> 00:25:42,400
date that could be in the future
because sometimes there's a 

534
00:25:42,400 --> 00:25:45,160
dispute around a used car. 
You know, like they buy it and 

535
00:25:45,160 --> 00:25:47,200
they look at it and there's a 
scratch and they didn't see it 

536
00:25:47,200 --> 00:25:48,720
before. 
And so they dispute with the 

537
00:25:48,720 --> 00:25:50,200
vendor. 
Maybe they'll take the 

538
00:25:50,200 --> 00:25:53,480
negotiation offline and agree on
a different price in a couple of

539
00:25:53,480 --> 00:25:56,560
days and then that date get 
stamped separately. 

540
00:25:57,040 --> 00:26:00,800
And so if you have a very high 
level surface level data 

541
00:26:00,800 --> 00:26:03,080
dictionary that just says this 
is the date sold. 

542
00:26:03,120 --> 00:26:05,400
And then for the second one it 
also says something like this is

543
00:26:05,400 --> 00:26:07,040
the sold date, that's not 
useful. 

544
00:26:07,040 --> 00:26:10,280
It needs to give that deeper 
context of why we have these 

545
00:26:10,280 --> 00:26:13,320
columns in the 1st place and 
what are the possibilities of 

546
00:26:14,040 --> 00:26:16,080
the different ways that the 
values could be filled in. 

547
00:26:16,360 --> 00:26:19,520
So ideally the data dictionary 
also talks about the data 

548
00:26:19,520 --> 00:26:22,880
generating process and how the 
sort of the business operations 

549
00:26:22,880 --> 00:26:24,960
translated to this particular 
data set. 

550
00:26:25,640 --> 00:26:28,720
And I think in particular it's 
becoming more important if let's

551
00:26:28,720 --> 00:26:31,680
say you have a multi stages kind
of process that derives the 

552
00:26:31,680 --> 00:26:34,040
data, right? 
Hence probably these days people

553
00:26:34,200 --> 00:26:35,960
refer to it as data lineage, 
right? 

554
00:26:35,960 --> 00:26:38,640
Where you start from a typical 
business process that generates 

555
00:26:38,640 --> 00:26:40,320
the data, but then it goes 
through different 

556
00:26:40,320 --> 00:26:43,480
transformations, maybe different
systems, different processes 

557
00:26:43,800 --> 00:26:47,320
until it gets to the final sync 
or data, the last place where 

558
00:26:47,320 --> 00:26:50,240
the data gets stored, right. 
So I think the data dictionary 

559
00:26:50,240 --> 00:26:53,160
will be much more important 
because you don't just see you 

560
00:26:53,160 --> 00:26:55,240
know the column names and the 
values right? 

561
00:26:55,480 --> 00:26:57,600
Which is sometimes misleading 
just like what you mentioned, 

562
00:26:57,600 --> 00:26:58,680
right? 
I think date. 

563
00:26:58,680 --> 00:27:01,200
So So date is not just a common 
thing, right? 

564
00:27:01,200 --> 00:27:04,040
But it's also like probably you 
can see it all over different 

565
00:27:04,040 --> 00:27:07,040
data sources because different 
people maybe creating the 

566
00:27:07,040 --> 00:27:09,680
column, different teams creating
a column and also maybe 

567
00:27:09,680 --> 00:27:12,440
different department using the 
same term, but actually means 

568
00:27:12,440 --> 00:27:14,920
that different things, right? 
Hence, probably the Domain 

569
00:27:14,920 --> 00:27:18,960
Driven design kind of a practice
makes more sense and I think not

570
00:27:18,960 --> 00:27:21,480
just data dictionary. 
In your book you also mentioned 

571
00:27:21,480 --> 00:27:24,920
this thing called data modeling.
So you mentioned this is 

572
00:27:25,040 --> 00:27:26,560
probably the most important step
as well. 

573
00:27:26,560 --> 00:27:29,520
Before you start data analysis, 
tell us a bit more about data 

574
00:27:29,520 --> 00:27:31,880
modeling. 
What is this step and what 

575
00:27:31,880 --> 00:27:34,040
should people do in the data 
modeling exercise? 

576
00:27:34,760 --> 00:27:37,600
So one of the ways I think about
this is that there's a good 

577
00:27:37,600 --> 00:27:40,400
definition of data science. 
It's been around for a while, 

578
00:27:40,400 --> 00:27:43,840
which is the process of turning 
data into information and 

579
00:27:43,840 --> 00:27:47,040
information into insights. 
And it's this first step of 

580
00:27:47,040 --> 00:27:50,720
turning data into information. 
And again the terminology is 

581
00:27:50,720 --> 00:27:53,880
very sort of blurred. 
But when I think of data turning

582
00:27:53,880 --> 00:27:57,520
into information, it means the 
data is whatever you have lying 

583
00:27:57,520 --> 00:28:00,080
around. 
As you said earlier, you alluded

584
00:28:00,080 --> 00:28:04,120
to operational transactions that
happen have to be stored in a 

585
00:28:04,120 --> 00:28:06,280
database. 
They weren't collected with 

586
00:28:06,280 --> 00:28:09,160
analytics in mind necessarily. 
They're just database records 

587
00:28:09,160 --> 00:28:12,000
that power some kind of customer
facing application. 

588
00:28:12,400 --> 00:28:15,200
And as analysts we want to come 
in and analyse that data. 

589
00:28:15,200 --> 00:28:18,000
But in it's raw form, it's 
almost never usable. 

590
00:28:18,000 --> 00:28:21,400
We need to do something to it. 
And when we do things like what 

591
00:28:21,400 --> 00:28:25,040
we might call data cleaning as 
part of an analysis, what we 

592
00:28:25,040 --> 00:28:27,960
would like to do is to do that 
data cleaning once and have a 

593
00:28:27,960 --> 00:28:30,080
cleaned version of that data 
stored somewhere. 

594
00:28:30,520 --> 00:28:33,320
Part of data modelling is doing 
that data cleaning once, so that

595
00:28:33,320 --> 00:28:36,640
the logic of the cleaning is 
encoded already in the data that

596
00:28:36,640 --> 00:28:38,440
we use. 
I mean, lots of companies will 

597
00:28:38,440 --> 00:28:40,480
have this problem where for 
example you have a bunch of 

598
00:28:40,480 --> 00:28:44,040
Tableau dashboards and in every 
dashboard there's a formula that

599
00:28:44,040 --> 00:28:47,200
calculates some relevant metric 
but that. 

600
00:28:47,600 --> 00:28:50,600
Calculation is duplicated across
every dashboard, so if you ever 

601
00:28:50,600 --> 00:28:52,800
make a change to it. 
You need to remember which 

602
00:28:52,800 --> 00:28:55,520
dashboard it's also in and it 
becomes the sort of mess that 

603
00:28:55,520 --> 00:28:57,360
you you don't necessarily have a
handle on. 

604
00:28:57,800 --> 00:29:00,640
If you have a clean data model 
where that metric is already pre

605
00:29:00,640 --> 00:29:03,600
calculated and the dashboards 
just read from that clean data 

606
00:29:03,600 --> 00:29:06,200
model, then you've got that 
problem in one place. 

607
00:29:06,280 --> 00:29:09,000
So if you need to change the 
metric, all the dashboards will 

608
00:29:09,000 --> 00:29:11,880
update and that's the very sort 
of simplistic way to look at it.

609
00:29:12,240 --> 00:29:15,480
And the other reason to do data 
modelling is to sort of capture.

610
00:29:15,880 --> 00:29:17,880
Business entities in the right 
way. 

611
00:29:18,360 --> 00:29:23,440
So one of the problems we had to
work with was what is a customer

612
00:29:23,480 --> 00:29:26,600
in our business and that sounds 
like a very simple question like

613
00:29:26,600 --> 00:29:29,600
obviously business knows what 
their customer is, but we had 

614
00:29:29,600 --> 00:29:32,400
different business areas that 
worked with individuals. 

615
00:29:32,480 --> 00:29:34,320
So individual used car. 
Dealers. 

616
00:29:34,680 --> 00:29:37,240
So they were people. 
But then we also had customers 

617
00:29:37,240 --> 00:29:39,480
who were entities. 
And they might be again, they 

618
00:29:39,480 --> 00:29:42,120
might be a single dealership, or
they might be like a parent. 

619
00:29:42,120 --> 00:29:44,280
Group. 
And so of those different 

620
00:29:44,280 --> 00:29:46,080
entities, which one is a 
customer? 

621
00:29:46,560 --> 00:29:49,480
Well, that depends on who you 
ask, and it depends on the 

622
00:29:49,480 --> 00:29:52,920
purpose that you want to use the
data for, And So what you need 

623
00:29:52,920 --> 00:29:55,680
to do then is have some sort of.
Customer data model where 

624
00:29:55,680 --> 00:29:58,640
everybody agrees on the 
definition of a customer and 

625
00:29:58,640 --> 00:30:01,640
then if somebody needs to know 
how many customers do we have, 

626
00:30:01,640 --> 00:30:04,960
ideally all they have to do is 
just count star from that table.

627
00:30:05,360 --> 00:30:07,880
That's the sort of dream 
scenario where you've done all 

628
00:30:07,880 --> 00:30:11,040
the business logic and all the 
work up front to have a clean 

629
00:30:11,040 --> 00:30:13,600
data model that can then be 
analyzed much more easily. 

630
00:30:13,880 --> 00:30:16,840
And this again, is something 
that we don't really talk about 

631
00:30:16,840 --> 00:30:19,120
in foundational training. 
We say, yes, you need to clean 

632
00:30:19,120 --> 00:30:21,840
your data, but we don't say once
you've cleaned your data, you 

633
00:30:21,840 --> 00:30:24,480
should probably have a clean 
version of it somewhere and stop

634
00:30:24,480 --> 00:30:27,280
cleaning your data multiple 
times, stop repeating yourself. 

635
00:30:27,680 --> 00:30:30,600
And so that's why I dedicated a 
whole project in the book to 

636
00:30:30,600 --> 00:30:34,360
data modelling to sort of 
practice this idea of taking raw

637
00:30:34,360 --> 00:30:37,400
data and turning it into a 
specific structure which is 

638
00:30:37,400 --> 00:30:40,560
tailored, again specifically 
tailored to the questions you're

639
00:30:40,560 --> 00:30:44,280
going to ask in the business. 
Specifically about data 

640
00:30:44,280 --> 00:30:46,680
cleaning, actually this is 
probably like what you said, the

641
00:30:46,680 --> 00:30:49,440
meme, right? 
80% of your effort probably is 

642
00:30:49,560 --> 00:30:53,720
spent first understanding how 
dirty your data is and then like

643
00:30:53,720 --> 00:30:55,440
because there are many 
variations, right? 

644
00:30:55,440 --> 00:30:57,960
Sometimes it could be the user 
input that is probably not 

645
00:30:57,960 --> 00:30:59,720
clean. 
Second thing is there's no 

646
00:30:59,720 --> 00:31:01,880
validation in the software that 
captures it. 

647
00:31:02,400 --> 00:31:04,400
And the third thing, for 
whatever reason, right, people 

648
00:31:04,400 --> 00:31:06,960
put different formats, you know,
like from different systems, 

649
00:31:06,960 --> 00:31:10,080
probably they don't have a 
uniform format, so they just use

650
00:31:10,080 --> 00:31:11,760
whatever format that makes sense
for them. 

651
00:31:12,160 --> 00:31:14,720
So data cleaning probably is 
something that is really, really

652
00:31:14,720 --> 00:31:16,240
hard. 
And first of all, right, if you 

653
00:31:16,240 --> 00:31:19,400
have millions of records, for 
example, you probably won't 

654
00:31:19,400 --> 00:31:22,800
understand how clean because you
might look at the first, I don't

655
00:31:22,800 --> 00:31:26,000
know, 100 rows and you just 
deduce, OK, this is typically 

656
00:31:26,000 --> 00:31:28,960
the data, but actually there are
many other columns or many other

657
00:31:28,960 --> 00:31:31,880
data that you don't see, right? 
So maybe a little bit of tips, 

658
00:31:32,240 --> 00:31:36,040
how can we actually do this data
cleaning much much efficiently 

659
00:31:36,320 --> 00:31:40,400
so that we don't fall into the 
gotcha where actually you clean 

660
00:31:40,760 --> 00:31:44,160
maybe 50% of the data, but the 
other 50% is something else, you

661
00:31:44,160 --> 00:31:46,120
know, like a different rubbish 
altogether. 

662
00:31:46,320 --> 00:31:48,800
So maybe from your practical 
world example. 

663
00:31:49,600 --> 00:31:52,520
Yeah, I think what's funny is we
used to have this term, big 

664
00:31:52,520 --> 00:31:56,080
data, right, to describe data 
that cannot be processed on your

665
00:31:56,080 --> 00:31:58,160
laptop. 
And you don't see that term 

666
00:31:58,160 --> 00:32:00,960
around very much. 
And that's because processing 

667
00:32:00,960 --> 00:32:04,160
power, even on individual 
laptops, has grown so much. 

668
00:32:04,160 --> 00:32:08,240
And even getting access to a 
remote cluster that has a lot 

669
00:32:08,240 --> 00:32:11,200
more resources in your laptop is
is pretty easy these days. 

670
00:32:11,600 --> 00:32:15,080
So I don't think we often have 
that problem where you can only 

671
00:32:15,080 --> 00:32:18,800
clean half of your data and the 
other half you can only revisit 

672
00:32:19,040 --> 00:32:21,280
when you run some code later or 
something. 

673
00:32:21,760 --> 00:32:24,440
I mean, some companies obviously
have data that's so huge that 

674
00:32:24,440 --> 00:32:27,560
they need special methods, but I
think most cases is not the case

675
00:32:27,560 --> 00:32:29,120
anymore. 
But when it comes to data 

676
00:32:29,120 --> 00:32:33,240
cleaning, I mean one thing I 
would tell students is don't try

677
00:32:33,240 --> 00:32:36,080
to clean the whole thing at once
before you do anything with it 

678
00:32:36,600 --> 00:32:38,640
because you're going to find 
some issues down the line 

679
00:32:38,640 --> 00:32:40,280
anyway. 
So again, it's just this 

680
00:32:40,320 --> 00:32:43,760
pragmatism of figure out what 
part of the data you need right 

681
00:32:43,760 --> 00:32:45,960
now and have a look at it. 
And there are some checks you 

682
00:32:45,960 --> 00:32:48,520
can do, right? 
There are some surface level 

683
00:32:48,520 --> 00:32:51,640
checks you can check, like what 
are the unique values in this 

684
00:32:51,640 --> 00:32:54,440
column, are there missing 
values, are there outliers? 

685
00:32:54,440 --> 00:32:59,280
You can do those things, but 
some of the more complex 

686
00:32:59,640 --> 00:33:03,600
problems or patterns in the data
that will either invalidate your

687
00:33:03,600 --> 00:33:06,160
analysis or require you to to 
redo some of your work. 

688
00:33:06,160 --> 00:33:09,440
You won't notice them until you 
start working, and so again, 

689
00:33:09,440 --> 00:33:11,720
don't be wedded to this idea 
that you have to make a 

690
00:33:11,720 --> 00:33:13,600
perfectly clean data set before 
you start. 

691
00:33:13,600 --> 00:33:15,640
Working. 
Do again the minimum that you 

692
00:33:15,640 --> 00:33:19,400
need and just start doing the 
analysis with the knowledge that

693
00:33:19,400 --> 00:33:22,080
you're probably going to have to
go back to step one again and 

694
00:33:22,080 --> 00:33:24,160
again. 
So if you would ever watch me 

695
00:33:24,160 --> 00:33:27,440
doing an analysis, it's never 
like I have the analysis in my 

696
00:33:27,440 --> 00:33:28,920
head and I just have to type it 
out. 

697
00:33:29,200 --> 00:33:33,560
It's an active process where you
do some stuff and you go, Oh no,

698
00:33:33,560 --> 00:33:36,000
this doesn't make sense at all 
because I found something in the

699
00:33:36,000 --> 00:33:37,360
column. 
I have to go right the way back 

700
00:33:37,360 --> 00:33:40,440
to the top and start again. 
And so you might have to rewrite

701
00:33:40,440 --> 00:33:42,120
some of your code. 
You might just have to rewrite a

702
00:33:42,120 --> 00:33:45,480
bit at the top and keep going. 
And there are cases in the book 

703
00:33:45,480 --> 00:33:48,560
where you know, I have example 
solutions for each project and I

704
00:33:48,560 --> 00:33:52,320
go down specific particular 
rabbit hole that I followed to 

705
00:33:52,320 --> 00:33:55,680
get my particular answer. 
And you'll see things in there 

706
00:33:55,680 --> 00:33:58,600
where I say, oh, it turns out 
this is the case. 

707
00:33:59,000 --> 00:34:01,360
There's an e-commerce data set 
in there and then we have some 

708
00:34:01,360 --> 00:34:03,000
products that are 
miscategorized. 

709
00:34:03,360 --> 00:34:06,680
And in the way that I wrote the 
example solution, it's trying to

710
00:34:06,680 --> 00:34:09,880
be as realistic as possible. 
So I haven't done the analysis 

711
00:34:09,880 --> 00:34:13,280
and then written it up cleanly. 
I sort of write it up as the 

712
00:34:13,280 --> 00:34:14,840
real process. 
And so halfway through, you're 

713
00:34:14,840 --> 00:34:16,679
like, oh, these labels don't 
make sense. 

714
00:34:16,679 --> 00:34:19,520
We need to go back and fix this 
quality issue in the data and 

715
00:34:19,520 --> 00:34:22,679
before we can carry on. 
That's the realistic way to 

716
00:34:22,679 --> 00:34:25,480
think about it is you will find 
data issues throughout the 

717
00:34:25,480 --> 00:34:28,760
process, so don't worry about 
getting it perfect the first 

718
00:34:28,760 --> 00:34:30,760
time. 
Yeah, I think it's worth to 

719
00:34:30,760 --> 00:34:34,320
emphasize, right, probably 100% 
accuracy, sometimes it's not 

720
00:34:34,320 --> 00:34:36,679
possible, right, especially if 
you're dealing with really large

721
00:34:36,679 --> 00:34:39,320
data, right. 
So maybe some kind of percentage

722
00:34:39,320 --> 00:34:41,880
where you would accept, OK, 
maybe these are the normally 

723
00:34:41,880 --> 00:34:45,159
kind of a data, maybe state that
assumption or maybe state that 

724
00:34:45,320 --> 00:34:46,880
signal that you can see from the
data. 

725
00:34:47,000 --> 00:34:49,440
And I think there are plenty of 
useful tools these days that can

726
00:34:49,440 --> 00:34:51,840
actually give you a sense of 
like a distribution, for 

727
00:34:51,840 --> 00:34:54,800
example, in a column, how 
different or how is the variance

728
00:34:54,800 --> 00:34:57,760
of the data inside, right. 
It can give you a statistical 

729
00:34:57,760 --> 00:35:00,520
distribution or it can give you 
some kind of patterns that you 

730
00:35:00,520 --> 00:35:03,200
can probably deduce what kind of
data is inside. 

731
00:35:03,480 --> 00:35:06,960
So use that kind of tools. 
I think speaking about big data,

732
00:35:07,120 --> 00:35:10,840
I think these days people want 
to build like a data lake in a 

733
00:35:10,840 --> 00:35:14,320
company where you put everything
together into a data lake. 

734
00:35:14,760 --> 00:35:17,680
Maybe in your practical 
experience, is there some kind 

735
00:35:17,680 --> 00:35:20,320
of different challenges that 
people have to deal with dealing

736
00:35:20,320 --> 00:35:23,560
with data lake or maybe big data
in general as well? 

737
00:35:24,440 --> 00:35:27,280
Yeah. 
I think it's very tempting to 

738
00:35:27,280 --> 00:35:30,880
take absolutely any data that 
you have lying around and 

739
00:35:30,880 --> 00:35:34,440
dumping it somewhere and saying,
oh, we'll come back to it when 

740
00:35:34,440 --> 00:35:37,440
we need it. 
And so the technology is there 

741
00:35:37,440 --> 00:35:39,400
to allow people to do that quite
easily. 

742
00:35:39,760 --> 00:35:43,600
And the problem with that is 
that there's no thought given 

743
00:35:43,600 --> 00:35:46,320
again to the end product of like
what are we going to use this 

744
00:35:46,320 --> 00:35:49,000
data for? 
And dumping stuff into a data 

745
00:35:49,000 --> 00:35:51,840
lake because at some point in 
the future we might need it, is 

746
00:35:51,920 --> 00:35:53,440
not necessarily the best 
approach. 

747
00:35:53,720 --> 00:35:55,360
It can create a lot of problems 
down the line. 

748
00:35:55,800 --> 00:35:58,560
And if you think about the 
phrase data science, half of it 

749
00:35:58,560 --> 00:36:01,680
is the word science. 
So that's not how science works,

750
00:36:01,680 --> 00:36:02,800
right? 
Science. 

751
00:36:02,800 --> 00:36:06,240
When you need to collect data, 
you have a hypothesis, you set 

752
00:36:06,240 --> 00:36:09,480
up an experiment, you actually 
have sort of theoretical 

753
00:36:09,480 --> 00:36:12,840
framework to build around before
you even think about the data 

754
00:36:12,840 --> 00:36:14,400
part. 
And I think some of that could 

755
00:36:14,400 --> 00:36:15,840
be applied. 
To sort of. 

756
00:36:15,960 --> 00:36:19,400
The business world where we 
don't dump stuff into a data 

757
00:36:19,400 --> 00:36:22,560
lake for the sake of it, we 
think a bit more about, you 

758
00:36:22,560 --> 00:36:24,480
know, what is the problem we're 
actually trying to solve. 

759
00:36:24,720 --> 00:36:26,640
Therefore, what is the data that
we need? 

760
00:36:26,760 --> 00:36:29,800
Therefore, where should we store
what information? 

761
00:36:30,640 --> 00:36:32,560
Thanks for the tips. 
I think, yeah, because of these 

762
00:36:32,560 --> 00:36:35,480
cloud technologies and 
potentially storage cost is 

763
00:36:35,480 --> 00:36:36,920
cheap, right. 
So they will just dump 

764
00:36:36,920 --> 00:36:39,800
everything and maybe think about
it later how we can use the 

765
00:36:39,800 --> 00:36:41,600
data. 
But I think sometimes it's not 

766
00:36:41,600 --> 00:36:44,160
why simply because yeah, the 
amount of data is just large, 

767
00:36:44,160 --> 00:36:45,720
right. 
And how to deal with it, what 

768
00:36:45,720 --> 00:36:48,520
kind of insights probably is 
just difficult if you start with

769
00:36:48,520 --> 00:36:51,560
that big amount of data. 
And I think these kind of 

770
00:36:51,560 --> 00:36:54,320
challenges are quite typical in 
a day-to-day world. 

771
00:36:54,360 --> 00:36:57,800
But maybe from your experience, 
what are the typically business 

772
00:36:57,800 --> 00:37:01,640
problems that a data analyst 
should know about should equip 

773
00:37:01,640 --> 00:37:04,240
themselves with? 
Maybe in your book you mentioned

774
00:37:04,240 --> 00:37:07,040
things like categorization, 
dealing with time series, or 

775
00:37:07,040 --> 00:37:10,480
maybe What are some of the 
favorite typical problems that 

776
00:37:10,480 --> 00:37:14,800
people should be aware of? 
Yeah, I picked the projects in 

777
00:37:14,800 --> 00:37:18,520
the book specifically to address
topics that I thought were 

778
00:37:18,520 --> 00:37:22,760
missing from foundational data 
training but that actually come 

779
00:37:22,760 --> 00:37:26,280
up a lot in the business world. 
You mentioned time series 

780
00:37:26,280 --> 00:37:28,200
forecasting. 
That's one of the things I talk 

781
00:37:28,200 --> 00:37:30,920
about a lot with students is 
that, you know, we usually have 

782
00:37:30,920 --> 00:37:33,920
like maybe one session on time 
series forecasting and we'll 

783
00:37:33,920 --> 00:37:37,240
teach them a little bit about 
how to reshape time data, how to

784
00:37:37,240 --> 00:37:40,320
think about time data 
differently from tabular data 

785
00:37:40,640 --> 00:37:42,400
and how some of the methods are 
different. 

786
00:37:42,600 --> 00:37:45,640
We don't spend a lot of time 
talking about like econometrics 

787
00:37:45,640 --> 00:37:47,360
or anything, which is where. 
There's a lot of. 

788
00:37:47,520 --> 00:37:49,120
Time series forecasting 
problems. 

789
00:37:49,400 --> 00:37:53,280
But I think the opportunity to 
forecast things in the real 

790
00:37:53,280 --> 00:37:56,280
world is actually that there's a
lot of those opportunities and 

791
00:37:56,280 --> 00:37:58,880
it's actually much bigger than 
we let on in basic training. 

792
00:37:58,880 --> 00:38:02,280
So that's why I have a project 
dedicated to time series data. 

793
00:38:02,680 --> 00:38:05,400
And then there's also this other
idea of working with categorical

794
00:38:05,400 --> 00:38:07,640
data. 
Now that's something that we 

795
00:38:07,640 --> 00:38:10,600
mentioned as an aside in 
foundational training, we'll 

796
00:38:10,600 --> 00:38:12,560
say, yes, sometimes your data is
categorical. 

797
00:38:12,560 --> 00:38:15,480
And here are a couple of methods
that you can use to transform 

798
00:38:15,480 --> 00:38:19,960
that data into something else. 
But if you work with operational

799
00:38:19,960 --> 00:38:23,320
data like people filling in 
forms and entering things in 

800
00:38:23,360 --> 00:38:26,840
records into a system, anytime 
there's a drop down, you've got 

801
00:38:26,840 --> 00:38:29,160
categorical data. 
And so it it's actually a lot 

802
00:38:29,160 --> 00:38:32,040
more prevalent than we let on. 
We spend a lot of time talking 

803
00:38:32,040 --> 00:38:35,440
about correlation. 
We spend a lot of time looking 

804
00:38:35,440 --> 00:38:39,200
at distributions and things for 
continuous data, but we don't 

805
00:38:39,200 --> 00:38:41,880
talk about methods for 
categorical data enough. 

806
00:38:42,320 --> 00:38:45,120
And one problem with that is 
then you're not equipped to deal

807
00:38:45,120 --> 00:38:47,080
with all these columns that 
you'll actually see in the real 

808
00:38:47,080 --> 00:38:48,800
world. 
But the other problem is that 

809
00:38:49,040 --> 00:38:52,920
people accidentally shoehorn 
continuous methods into 

810
00:38:53,400 --> 00:38:56,640
categorical data. 
So I even break down an example 

811
00:38:56,800 --> 00:39:00,800
in that chapter in the book. 
There's a famous heart disease 

812
00:39:00,800 --> 00:39:03,360
data set where you're trying to 
predict whether someone has 

813
00:39:03,360 --> 00:39:06,160
heart disease based on various 
different measurements. 

814
00:39:06,640 --> 00:39:09,000
And like the entire data set is 
numeric. 

815
00:39:09,400 --> 00:39:11,880
So it looks like, oh great, we 
have all this continuous data, 

816
00:39:11,880 --> 00:39:13,520
we can just throw correlation at
it. 

817
00:39:13,520 --> 00:39:16,160
We can throw all these 
continuous methods at it. 

818
00:39:16,520 --> 00:39:19,280
But if you actually read the 
data dictionary and going back 

819
00:39:19,280 --> 00:39:21,840
to what we said before, you 
actually see that most of those 

820
00:39:21,840 --> 00:39:24,920
values are categories. 
And they're like one of them is 

821
00:39:24,920 --> 00:39:28,440
something like the slope of, I 
guess like a table or or 

822
00:39:28,440 --> 00:39:30,360
treadmill or something that was 
during the test. 

823
00:39:30,360 --> 00:39:32,960
And it's not a measurement, it's
not an angle of the slope, it's 

824
00:39:32,960 --> 00:39:36,680
just one of some values. 
So they're not on a continuous 

825
00:39:36,680 --> 00:39:38,800
scale. 
So if you start applying methods

826
00:39:38,800 --> 00:39:41,360
that are meant for continuous 
data on that column, you're 

827
00:39:41,360 --> 00:39:43,880
going to make incorrect 
inferences from it. 

828
00:39:44,320 --> 00:39:47,720
And the really difficult thing 
about this is that you don't get

829
00:39:47,720 --> 00:39:50,880
an error message if you do that.
The data analysis tools are not 

830
00:39:50,880 --> 00:39:53,600
going to tell you. 
Are you sure your methodology is

831
00:39:53,600 --> 00:39:55,400
correct? 
No, because you've just said I 

832
00:39:55,400 --> 00:39:58,120
want an average of this column, 
but it doesn't make sense. 

833
00:39:58,480 --> 00:40:01,000
It doesn't make sense in that 
context to average that column. 

834
00:40:01,320 --> 00:40:05,120
And so the difficulty here is to
remember that you need to think 

835
00:40:05,120 --> 00:40:07,920
through your methodology harder,
because the computer is not. 

836
00:40:07,920 --> 00:40:09,960
Going to tell you otherwise. 
Yeah. 

837
00:40:09,960 --> 00:40:12,040
Hence, I think the data 
dictionary again that you 

838
00:40:12,040 --> 00:40:13,600
mentioned is very, very 
important, right. 

839
00:40:13,680 --> 00:40:17,000
Understand where the data gets 
generated, right, which business

840
00:40:17,000 --> 00:40:21,080
process, which system, what kind
of inputs that can be possible, 

841
00:40:21,240 --> 00:40:23,880
not just looking at the data and
create your own assumption. 

842
00:40:23,880 --> 00:40:25,360
So I think that's pretty 
dangerous. 

843
00:40:25,760 --> 00:40:28,080
And I think when you mentioned 
about prediction, there are a 

844
00:40:28,080 --> 00:40:30,720
lot of problems that data 
analyst has to come up with, 

845
00:40:30,720 --> 00:40:34,560
which is to actually derive 
predictions or maybe models to 

846
00:40:34,560 --> 00:40:36,640
actually predict a result, 
right. 

847
00:40:37,000 --> 00:40:40,200
And this is typically unknown 
problem where you don't actually

848
00:40:40,200 --> 00:40:42,560
know the accuracy of what you 
come up with. 

849
00:40:43,000 --> 00:40:46,240
So how do you deal with that 
kind of ambiguity, first of all?

850
00:40:46,440 --> 00:40:49,560
And how second thing is that you
can come up with a much better 

851
00:40:49,560 --> 00:40:51,560
prediction. 
So maybe something in the 

852
00:40:51,560 --> 00:40:55,040
typical real world, do you do 
much more rapid iteration and 

853
00:40:55,040 --> 00:40:58,440
test it in the production before
you actually come back and 

854
00:40:58,520 --> 00:41:00,920
derive a second derivation of 
what you did? 

855
00:41:01,200 --> 00:41:05,120
So maybe some tips here as well.
Yeah, that's a great question 

856
00:41:05,120 --> 00:41:07,720
because prediction is obviously 
something everybody says they 

857
00:41:07,720 --> 00:41:09,840
want. 
The question is, you know, what 

858
00:41:09,840 --> 00:41:13,320
is the output of that work? 
That's something we found out 

859
00:41:13,320 --> 00:41:15,680
the hard way on a project. 
As you know, I built a 

860
00:41:15,680 --> 00:41:19,240
predictive model for something 
that was from a technical point 

861
00:41:19,240 --> 00:41:20,960
of view, it was accurate enough 
to use. 

862
00:41:20,960 --> 00:41:23,840
And when we tried to put it into
production, we found various 

863
00:41:23,840 --> 00:41:27,920
organizational barriers to it, 
like the data that the 

864
00:41:27,920 --> 00:41:31,360
predictive model requires 
doesn't arrive in time, so we 

865
00:41:31,360 --> 00:41:32,480
can only make the prediction 
when. 

866
00:41:32,480 --> 00:41:35,080
It's too late. 
And then the clients that we 

867
00:41:35,080 --> 00:41:38,480
would use this with didn't 
actually have the levers in 

868
00:41:38,480 --> 00:41:40,840
their business to change 
anything based on our 

869
00:41:40,840 --> 00:41:43,440
predictions. 
So it was a sort of twofold 

870
00:41:43,520 --> 00:41:45,840
failure from an organizational 
point of view. 

871
00:41:45,840 --> 00:41:48,680
And from then on, we were much 
more strict about, again, 

872
00:41:48,680 --> 00:41:51,840
starting with the end of like 
why do you want us to make these

873
00:41:51,840 --> 00:41:54,600
predictions. 
And I think as a data person, 

874
00:41:54,840 --> 00:41:57,920
that's a question you should ask
immediately when somebody says, 

875
00:41:58,080 --> 00:42:00,720
I want you to build a predictive
model or we should be able or we

876
00:42:00,720 --> 00:42:03,680
should be predicting this thing 
is OK. 

877
00:42:03,840 --> 00:42:05,680
But what are you going to do 
with the predictions? 

878
00:42:05,960 --> 00:42:08,400
What is going to change in the 
business? 

879
00:42:08,400 --> 00:42:11,080
How, how are you going to 
respond to these predictions? 

880
00:42:11,280 --> 00:42:14,160
And so it's nice to have that 
conversation up front because 

881
00:42:14,160 --> 00:42:17,320
then you know, your stakeholders
forced to think about, OK, if we

882
00:42:17,320 --> 00:42:20,000
had this predictive model, what 
would we actually do with it. 

883
00:42:20,400 --> 00:42:23,120
So again, it's not a technical 
challenge because I think the 

884
00:42:23,120 --> 00:42:25,960
technical challenge of 
prediction is pretty well 

885
00:42:25,960 --> 00:42:28,600
catered for. 
There's lots of libraries to do 

886
00:42:28,840 --> 00:42:31,880
machine learning. 
There's lots of tips and tricks 

887
00:42:31,880 --> 00:42:34,920
out there, but the 
organizational side of it is 

888
00:42:34,920 --> 00:42:37,400
really where these projects are 
won or lost. 

889
00:42:37,600 --> 00:42:39,920
So my biggest advice would be 
again to have that human 

890
00:42:39,920 --> 00:42:42,520
conversation of what are you 
actually going to do with your 

891
00:42:42,640 --> 00:42:46,200
predictions first and foremost? 
Yeah, you mentioned something 

892
00:42:46,200 --> 00:42:48,800
very interesting, right. 
So organizational challenge. 

893
00:42:48,800 --> 00:42:52,240
So not necessarily all the time 
is a technical problem or data 

894
00:42:52,240 --> 00:42:54,600
problem, but actually 
organizational challenge. 

895
00:42:55,000 --> 00:42:57,920
And I think what you mentioned 
also very very insightful in my 

896
00:42:57,920 --> 00:42:59,920
opinion, right? 
Don't just build any predictive 

897
00:42:59,920 --> 00:43:03,600
model as if like you just want 
to learn different algorithms 

898
00:43:03,600 --> 00:43:07,080
and tools right, And use 
whatever fancy techniques. 

899
00:43:07,360 --> 00:43:10,000
But actually thinking about how 
is the model going to be used in

900
00:43:10,000 --> 00:43:12,920
the real life scenario, what 
kind of value can be derived 

901
00:43:12,920 --> 00:43:14,920
from there? 
Is it even possible to be used 

902
00:43:14,960 --> 00:43:17,680
by the business? 
And speaking about predictive 

903
00:43:17,840 --> 00:43:20,280
model machine learning, I mean 
the topic of AI. 

904
00:43:20,280 --> 00:43:24,640
These days there are so many 
discussions about using AI to do

905
00:43:24,640 --> 00:43:28,840
some kind of mundane analysis. 
Can AI be used also for data 

906
00:43:28,840 --> 00:43:30,640
analysis? 
What have you seen in the 

907
00:43:30,640 --> 00:43:35,120
industry typically how AI is 
going to change the landscape of

908
00:43:35,120 --> 00:43:37,760
data analysis? 
That's a very interesting 

909
00:43:37,760 --> 00:43:39,880
question. 
I played around with the data 

910
00:43:39,880 --> 00:43:42,960
analysis capabilities of these 
various tools. 

911
00:43:43,280 --> 00:43:46,000
Some of them are more 
sophisticated at this point in 

912
00:43:46,000 --> 00:43:48,000
time. 
Anyone listening in a few months

913
00:43:48,000 --> 00:43:50,600
time is going to change anyway, 
so there's no point naming tools

914
00:43:50,600 --> 00:43:52,680
specifically. 
But, you know, some tools are 

915
00:43:52,680 --> 00:43:55,600
more advanced in data analysis 
than others. 

916
00:43:55,960 --> 00:43:59,720
And on the one hand, it's great 
to democratize the ability to 

917
00:43:59,720 --> 00:44:03,200
say, Here's a somewhat messy 
spreadsheet, give me some 

918
00:44:03,200 --> 00:44:07,040
information about it, give me 
some insights, give me the 

919
00:44:07,120 --> 00:44:11,520
biggest drivers to success or to
a sale or something to, you 

920
00:44:11,520 --> 00:44:14,120
know, what drives property 
prices based on the spreadsheet 

921
00:44:14,120 --> 00:44:16,320
of retail transactions, that 
kind of thing. 

922
00:44:16,600 --> 00:44:19,480
On the one hand, that's great 
because you don't have to put 

923
00:44:19,480 --> 00:44:22,320
people through technical 
training to get there. 

924
00:44:22,720 --> 00:44:27,280
But I think what it does create 
is the necessity that everybody 

925
00:44:27,280 --> 00:44:31,320
understands how data analysis is
done from a more sort of 

926
00:44:31,320 --> 00:44:35,000
theoretical point of view, to 
understand what is possible, 

927
00:44:35,120 --> 00:44:38,680
what are the limitations, what 
are the biases to look out for? 

928
00:44:38,880 --> 00:44:42,720
What are the biases, societal 
biases that will be baked into 

929
00:44:42,720 --> 00:44:45,080
the data. 
And this is true for any output 

930
00:44:45,440 --> 00:44:48,880
these AI tools generate, but 
also for any analysis that comes

931
00:44:48,880 --> 00:44:52,880
out, and also any analysis you 
do, regardless of AI or not, as 

932
00:44:52,880 --> 00:44:54,680
you know, there's going to be 
these biases in there. 

933
00:44:55,040 --> 00:44:59,560
One very, very basic example 
that I showed on a course was 

934
00:44:59,720 --> 00:45:02,840
uploading survey results, right.
So imagine you've done some kind

935
00:45:02,840 --> 00:45:07,520
of online survey and you've got 
a CSV or a spreadsheet of some 

936
00:45:07,520 --> 00:45:10,520
sort of the responses that you 
can download from this tool. 

937
00:45:10,800 --> 00:45:13,840
And so yes, you can upload it to
one of these AI tools and say, 

938
00:45:14,000 --> 00:45:15,320
you know, give me some 
information. 

939
00:45:15,800 --> 00:45:19,440
And one of the things I demoed 
was telling the AI to give me 

940
00:45:19,440 --> 00:45:23,040
the average response time. 
So what is the average time that

941
00:45:23,040 --> 00:45:24,720
people took to fill in this 
survey? 

942
00:45:25,080 --> 00:45:27,840
And So what the AI tool did is 
it identified that there is a 

943
00:45:28,080 --> 00:45:31,160
start time and end time column. 
It understood that those are 

944
00:45:31,160 --> 00:45:33,400
supposed to be dates. 
It understood that they should 

945
00:45:33,400 --> 00:45:36,040
be different so you can tell 
what the difference is and then 

946
00:45:36,040 --> 00:45:37,480
that difference should be 
averaged. 

947
00:45:38,000 --> 00:45:40,760
And so the answer we got from 
this, it was a realistic, it was

948
00:45:40,760 --> 00:45:43,280
a real survey data set. 
And the answer, it was like, oh,

949
00:45:43,280 --> 00:45:46,760
the average response time was 
something like 49 minutes from 

950
00:45:46,760 --> 00:45:49,200
this survey data. 
And you know, I said to the 

951
00:45:49,200 --> 00:45:52,920
participants, you should never 
believe what comes out from the 

952
00:45:52,920 --> 00:45:56,240
data analysis because you should
think about, does that answer 

953
00:45:56,240 --> 00:45:58,800
make sense in context? 
We were all very sceptical. 

954
00:45:58,800 --> 00:46:01,080
It shouldn't be 49 minutes. 
It was a very short survey. 

955
00:46:01,360 --> 00:46:03,480
And you go into the data and 
sure enough, there's one 

956
00:46:03,480 --> 00:46:05,840
outlier, somebody who left the 
computer on all day. 

957
00:46:05,840 --> 00:46:09,040
And so their particular response
time was 8 hours, Everybody 

958
00:46:09,040 --> 00:46:12,200
else's was like 5 to 10 minutes.
But when you said to the 

959
00:46:12,200 --> 00:46:14,920
computer, you know, I want the 
average, it just took the mean 

960
00:46:15,080 --> 00:46:17,040
of that column. 
And so that was heavily skewed 

961
00:46:17,040 --> 00:46:19,640
by that one outlier. 
And obviously as an analyst you 

962
00:46:19,640 --> 00:46:21,720
should have maybe taken the 
median something that's more 

963
00:46:21,720 --> 00:46:24,040
robust to outliers and got a 
more realistic. 

964
00:46:24,280 --> 00:46:26,320
Value. 
On the one hand, it's good that 

965
00:46:26,320 --> 00:46:28,040
you there is transparency in 
these tools. 

966
00:46:28,040 --> 00:46:30,800
It would actually gives you the 
code it ran to get to that 

967
00:46:31,240 --> 00:46:33,760
analytical answer. 
So you can double check and you 

968
00:46:33,760 --> 00:46:36,600
can check it's homework. 
So in that sense it's less of a 

969
00:46:36,600 --> 00:46:39,600
black box. 
But if you don't have the 

970
00:46:39,600 --> 00:46:43,040
required data, literacy or 
statistical training, even that 

971
00:46:43,040 --> 00:46:45,600
little bit of statistical 
training to understand what does

972
00:46:45,600 --> 00:46:47,760
an outlier mean? 
What is the difference between a

973
00:46:47,760 --> 00:46:50,080
mean and a median? 
You wouldn't necessarily see 

974
00:46:50,080 --> 00:46:53,680
what was wrong, and you might be
trapped into just believing the 

975
00:46:53,680 --> 00:46:56,560
answer that comes out. 
So just like with any response 

976
00:46:56,560 --> 00:47:00,840
from an AI tool, and people just
need the healthy skepticism of 

977
00:47:01,160 --> 00:47:04,640
just reviewing the answer and 
checking whether it makes sense.

978
00:47:05,440 --> 00:47:07,520
Very interesting example that 
you gave, right? 

979
00:47:07,520 --> 00:47:11,480
So I typically use AI these days
to generate code, you know, like

980
00:47:11,480 --> 00:47:13,880
a coding assistant. 
I think it's much more well 

981
00:47:13,880 --> 00:47:15,760
defined problem, right? 
You can actually test it 

982
00:47:15,760 --> 00:47:17,880
straight away. 
Given an input you can see the 

983
00:47:17,880 --> 00:47:19,840
output. 
But dealing with data, I think 

984
00:47:19,840 --> 00:47:22,360
it's a different kind of a 
problem because maybe you don't 

985
00:47:22,360 --> 00:47:25,520
even know the answer, right? 
So if you just believe what AI 

986
00:47:25,520 --> 00:47:27,320
is giving you, I think that's a 
danger. 

987
00:47:27,320 --> 00:47:29,560
That's the first thing. 
Second thing, I think all these 

988
00:47:29,560 --> 00:47:33,560
LLM tools is not definitive. 
So maybe you ask the first time 

989
00:47:33,560 --> 00:47:35,640
it gives you this answer, maybe 
second time is a different 

990
00:47:35,640 --> 00:47:37,680
thing. 
How can you actually work with 

991
00:47:37,680 --> 00:47:40,720
that kind of request response 
model, which is probably 

992
00:47:40,720 --> 00:47:43,400
different every time you ask? 
And the third thing is the 

993
00:47:43,400 --> 00:47:47,480
assumption that is baked in into
how AI actually processes the 

994
00:47:47,480 --> 00:47:49,600
data, right? 
Which is why the critical 

995
00:47:49,600 --> 00:47:51,400
thinking aspect is very, very 
important. 

996
00:47:51,800 --> 00:47:54,920
Do you think that data analysts 
should feel that their job is 

997
00:47:54,920 --> 00:47:57,360
safer because of this? 
Or how should they equip 

998
00:47:57,360 --> 00:48:01,200
themselves with the AI so that 
they can have AAI assistant that

999
00:48:01,200 --> 00:48:04,120
can power their data analysis 
process much better? 

1000
00:48:04,920 --> 00:48:08,040
I like the way you phrased it, 
which is is the job safer? 

1001
00:48:08,040 --> 00:48:10,640
Most most of the time people 
will ask whether the jobs are 

1002
00:48:10,640 --> 00:48:15,960
going to be replaced by AII 
think it's just the same as 

1003
00:48:15,960 --> 00:48:18,480
with. 
Programming is for very basic 

1004
00:48:18,480 --> 00:48:21,360
tasks. 
I can see automation happening 

1005
00:48:21,360 --> 00:48:24,200
and I can see some of these 
tools taking some of the work 

1006
00:48:24,360 --> 00:48:28,640
away, potentially generating 
boilerplate code generating like

1007
00:48:28,880 --> 00:48:31,480
I want to create a chart to do 
this. 

1008
00:48:31,840 --> 00:48:34,200
I'm not familiar with this 
particular visualization 

1009
00:48:34,200 --> 00:48:37,080
library, getting up to speed 
with that and and getting the 

1010
00:48:37,080 --> 00:48:38,720
right chart out of the other 
end. 

1011
00:48:39,040 --> 00:48:42,240
You can definitely see AI 
accelerating that process, but 

1012
00:48:42,240 --> 00:48:43,960
just like with writing code. 
If. 

1013
00:48:44,080 --> 00:48:47,360
You don't understand 
fundamentally how that task is 

1014
00:48:47,360 --> 00:48:51,040
done. 
You can't check the outputs and 

1015
00:48:51,040 --> 00:48:54,120
you can't debug any problems. 
And the problems with data 

1016
00:48:54,120 --> 00:48:57,280
analysis, as we said earlier, 
are even trickier than with 

1017
00:48:57,280 --> 00:49:00,280
programming because you don't 
get an error message that you'll

1018
00:49:00,280 --> 00:49:02,640
get some kind of answer. 
You can average a numeric column

1019
00:49:02,640 --> 00:49:05,480
and still get an answer, whether
or not that makes sense from a 

1020
00:49:05,480 --> 00:49:09,400
methodological point of view. 
So I think the future might be 

1021
00:49:09,400 --> 00:49:13,120
that we have AI built in to help
us accelerate those little bits 

1022
00:49:13,120 --> 00:49:16,360
of code and little bits of 
manual tasks that we might want 

1023
00:49:16,360 --> 00:49:20,040
to automate away. 
But I don't see an analyst's job

1024
00:49:20,440 --> 00:49:25,240
changing fundamentally because 
we're still supposed to be 

1025
00:49:25,400 --> 00:49:28,160
trusted business advisors, we're
still supposed to be generating 

1026
00:49:28,160 --> 00:49:30,880
value for the business, and AI 
is just going to be another tool

1027
00:49:30,880 --> 00:49:32,200
in our. 
Toolbox to do that. 

1028
00:49:33,000 --> 00:49:36,360
Yeah, I think in the, I don't 
know like in the non tech world 

1029
00:49:36,360 --> 00:49:39,920
some people predict that they 
can replace all people by AI 

1030
00:49:40,120 --> 00:49:43,000
maybe like a question and answer
model right, where you can just 

1031
00:49:43,000 --> 00:49:47,080
give a data and then you start 
questioning them and they just 

1032
00:49:47,080 --> 00:49:48,560
give you an answer. 
But I think this is quite 

1033
00:49:48,560 --> 00:49:51,320
dangerous if you actually don't 
really understand the analysis 

1034
00:49:51,320 --> 00:49:54,360
and like for example it's a very
simple example, you know survey 

1035
00:49:54,560 --> 00:49:58,000
average response time, right. 
While you have outlier, the 

1036
00:49:58,000 --> 00:49:59,680
result can be really really 
different. 

1037
00:50:00,040 --> 00:50:02,960
So I think here the critical 
thinking is very very important,

1038
00:50:02,960 --> 00:50:04,400
right? 
Don't just assume that 

1039
00:50:04,400 --> 00:50:06,720
everything AI generates is 
actually valid. 

1040
00:50:07,120 --> 00:50:09,920
So maybe from your experience, I
don't know how much you have 

1041
00:50:09,920 --> 00:50:13,000
applied AI. 
So any kind of problems, like 

1042
00:50:13,000 --> 00:50:15,680
your favorite problems that can 
be solved by AI more 

1043
00:50:15,680 --> 00:50:19,240
effectively, Maybe you can share
some of your power AI user 

1044
00:50:19,240 --> 00:50:23,520
experience I guess. 
Yeah, I do have some AI examples

1045
00:50:23,560 --> 00:50:27,880
in the book as well. 
I try to identify places where 

1046
00:50:28,000 --> 00:50:30,800
AI will actually accelerate the 
process. 

1047
00:50:31,080 --> 00:50:34,560
One of the examples there's a 
chapter where the the data set 

1048
00:50:34,560 --> 00:50:38,640
is actually a bunch of PDF files
and so the task is to extract 

1049
00:50:38,640 --> 00:50:41,040
data from these PDFs and then do
the analysis. 

1050
00:50:41,400 --> 00:50:43,960
And that's not something that we
usually train people on. 

1051
00:50:44,160 --> 00:50:47,440
It's quite a niche thing. 
Although PDFs are everywhere, it

1052
00:50:47,440 --> 00:50:50,080
is quite a niche thing to have 
to analyse data from PDFs. 

1053
00:50:50,360 --> 00:50:53,760
Coming into that problem with an
AI assistant means that you can 

1054
00:50:53,760 --> 00:50:56,160
accelerate the process of 
finding, for example, the right 

1055
00:50:56,160 --> 00:50:58,640
Python library. 
That's one of the examples I 

1056
00:50:58,640 --> 00:51:01,680
have in the book is I want to 
extract data from PDFs. 

1057
00:51:01,680 --> 00:51:04,560
What are my options in Python? 
Like I could go away and Google 

1058
00:51:04,560 --> 00:51:07,960
it when it'd take me a lot more 
time, so I I do like it using it

1059
00:51:07,960 --> 00:51:10,040
for that if it's a domain I'm 
not familiar with. 

1060
00:51:10,560 --> 00:51:12,520
If there's a problem I'm trying 
to solve where I think there 

1061
00:51:12,520 --> 00:51:15,760
must be a Python library for 
this so somebody else must have 

1062
00:51:15,760 --> 00:51:17,600
solved this problem in a way 
that I can use it. 

1063
00:51:17,920 --> 00:51:22,120
AI acts as like a super powered 
search because you it's not just

1064
00:51:22,120 --> 00:51:24,760
a search query, you can actually
give it some context about what 

1065
00:51:24,760 --> 00:51:27,400
you're trying to do. 
That's definitely one aspect I 

1066
00:51:27,400 --> 00:51:29,240
see. 
And then just, you know, 

1067
00:51:29,240 --> 00:51:33,080
accelerating smaller tasks. 
I mentioned creating charts. 

1068
00:51:33,440 --> 00:51:36,720
If you want to figure out how to
do a specific kind of data 

1069
00:51:36,720 --> 00:51:39,640
visualization, and again, you're
unfamiliar with the library, you

1070
00:51:39,640 --> 00:51:41,920
need a bit of help. 
You know, I can give you the 

1071
00:51:41,920 --> 00:51:44,640
starter code for it. 
Although one problem, and this 

1072
00:51:44,640 --> 00:51:47,400
is I think true for Copilot and 
other kinds of tools that 

1073
00:51:47,400 --> 00:51:50,440
generate code, is that it's only
trained on what it's found on 

1074
00:51:50,440 --> 00:51:54,400
the Internet and there's a lot 
of sort of not even incorrect 

1075
00:51:54,400 --> 00:51:57,040
code, but maybe inefficient code
or maybe not quite the right way

1076
00:51:57,040 --> 00:51:59,400
to do it. 
Again, a very specific example 

1077
00:51:59,400 --> 00:52:03,040
is there's a Python plotting 
library called matplotlib, and 

1078
00:52:03,040 --> 00:52:05,280
there's sort of two different 
ways to use it. 

1079
00:52:05,280 --> 00:52:07,280
One of them is the old MATLAB 
style. 

1080
00:52:07,280 --> 00:52:10,280
If anybody listening has ever 
used Matlab, there's a specific 

1081
00:52:10,280 --> 00:52:13,480
way to create plots in MATLAB, 
which is how matplotlib was 

1082
00:52:13,480 --> 00:52:15,800
originally written, and that's 
how you create plots in it. 

1083
00:52:16,160 --> 00:52:19,880
But now there is a much more 
modern object oriented interface

1084
00:52:19,880 --> 00:52:22,240
for building with map plot Lib 
where you create your chart. 

1085
00:52:22,240 --> 00:52:24,960
Object and you assign the 
properties and stuff like more 

1086
00:52:24,960 --> 00:52:26,240
how you. 
Would sort of write software in 

1087
00:52:26,240 --> 00:52:29,560
general, but unfortunately a lot
of the code on the Internet uses

1088
00:52:29,560 --> 00:52:32,960
the old style, and so the AI 
might perpetuate the use of the 

1089
00:52:33,280 --> 00:52:36,080
outdated, less modern style of 
code. 

1090
00:52:36,080 --> 00:52:38,800
So even for tasks like using a 
library, you've got to be 

1091
00:52:38,800 --> 00:52:41,440
careful about what the training 
data is out there. 

1092
00:52:42,120 --> 00:52:45,120
That's practical tips for 
people, maybe some creative idea

1093
00:52:45,120 --> 00:52:47,640
how you apply AI in your 
day-to-day job, right? 

1094
00:52:47,640 --> 00:52:51,240
So I think extracting text from 
PDF also like coming up with 

1095
00:52:51,240 --> 00:52:53,760
different charts. 
I could also imagine like for 

1096
00:52:53,760 --> 00:52:57,280
example downloading data from a 
typical API, right? 

1097
00:52:57,280 --> 00:53:00,320
Because if you're not familiar 
with the API, maybe AI can help 

1098
00:53:00,320 --> 00:53:03,120
accelerate that. 
Or maybe just like transforming 

1099
00:53:03,120 --> 00:53:05,280
data into like a different 
format. 

1100
00:53:05,400 --> 00:53:08,440
That is probably also another 
small task that you can use to 

1101
00:53:08,440 --> 00:53:12,440
solve the problem using AI. 
So David it's been quite a great

1102
00:53:12,720 --> 00:53:15,080
insight you know conversation 
about data analysis. 

1103
00:53:15,200 --> 00:53:18,000
As we wrap up the conversation, 
I have one last question that I 

1104
00:53:18,000 --> 00:53:20,680
would like to ask you which is 
something that I call tree 

1105
00:53:20,680 --> 00:53:23,560
technical leadership wisdom. 
So if you can think of it just 

1106
00:53:23,560 --> 00:53:25,920
like an advice that you want to 
give to the listeners here, 

1107
00:53:25,920 --> 00:53:28,440
maybe if you can share your 
version of tree technical 

1108
00:53:28,440 --> 00:53:30,560
leadership wisdom. 
Yeah, sure. 

1109
00:53:30,840 --> 00:53:35,240
So one of the things that I 
think differentiates like good 

1110
00:53:35,240 --> 00:53:38,600
analysts from the best analysts 
is curiosity. 

1111
00:53:39,040 --> 00:53:41,920
Just wanting to find out the 
answer to something. 

1112
00:53:42,240 --> 00:53:44,520
I don't know if that's good 
advice because I don't know if 

1113
00:53:44,520 --> 00:53:47,080
you can teach that to someone. 
I don't know if you can learn to

1114
00:53:47,080 --> 00:53:50,320
be more curious. 
But even if just mechanically 

1115
00:53:50,520 --> 00:53:55,160
you try to find the answer to 
things and persevere beyond the 

1116
00:53:55,160 --> 00:53:58,240
first answer or beyond the 
obvious answer, that is really 

1117
00:53:58,240 --> 00:54:01,560
such an important skill in life.
But it's particularly in data 

1118
00:54:01,560 --> 00:54:04,440
analysis, like really wanting to
dig in to find the answer and 

1119
00:54:04,440 --> 00:54:07,720
not resting until you're 
satisfied with the answer is a 

1120
00:54:07,720 --> 00:54:10,840
particularly good skill. 
And just from that, I think 

1121
00:54:10,840 --> 00:54:14,640
practicing your skills, just 
whatever your tech skills are, 

1122
00:54:14,640 --> 00:54:17,480
just constantly practicing them 
is so important. 

1123
00:54:17,480 --> 00:54:19,960
I mean, I'm just writing, you 
know, this book is a project 

1124
00:54:19,960 --> 00:54:22,560
based book full of practice 
opportunities for people. 

1125
00:54:22,800 --> 00:54:25,200
Because I think that's really 
once you've got the foundations,

1126
00:54:25,200 --> 00:54:27,200
the best way to learn is to 
apply your knowledge and 

1127
00:54:27,200 --> 00:54:29,760
practice. 
And so just keep doing that in 

1128
00:54:29,760 --> 00:54:31,720
the data realm. 
That just might mean solving 

1129
00:54:31,720 --> 00:54:35,280
problems for yourself, even if 
it's optimizing your fantasy. 

1130
00:54:35,280 --> 00:54:38,440
Football team or whatever. 
It doesn't have to be a money 

1131
00:54:38,440 --> 00:54:40,520
making business opportunity 
every time. 

1132
00:54:40,520 --> 00:54:43,320
Just something that solves a 
problem with data is great. 

1133
00:54:43,600 --> 00:54:46,680
So practicing your skills to 
stay relevant and to keep them 

1134
00:54:46,680 --> 00:54:49,240
fresh and to learn new things is
vital. 

1135
00:54:49,680 --> 00:54:53,360
And just on the sort of more 
organizational side of things, I

1136
00:54:53,360 --> 00:54:56,640
think it's very important to 
have a good reason. 

1137
00:54:56,800 --> 00:54:58,960
To do data. 
Work in the 1st place. 

1138
00:54:59,320 --> 00:55:03,240
I think there is a maybe 
mistaken belief in business that

1139
00:55:03,400 --> 00:55:05,840
doing data analysis is 
inherently useful. 

1140
00:55:05,840 --> 00:55:08,440
It's just inherently a good 
thing that we should be doing. 

1141
00:55:08,840 --> 00:55:11,440
But if you don't have a purpose,
if you don't have an end goal, 

1142
00:55:11,640 --> 00:55:14,600
if you don't have a good reason 
to do it, it's not going to be 

1143
00:55:14,600 --> 00:55:16,840
successful. 
And that's true for analysts as 

1144
00:55:16,840 --> 00:55:20,520
much as entire organizational 
strategies just have a good 

1145
00:55:20,520 --> 00:55:22,680
reason to do data work in the 
first place. 

1146
00:55:23,400 --> 00:55:25,440
Very interesting. 
Last wisdom there. 

1147
00:55:25,440 --> 00:55:28,920
So I think I can actually relate
with that kind of advice that 

1148
00:55:28,920 --> 00:55:31,040
you just gave, right? 
Because many people think, OK, 

1149
00:55:31,040 --> 00:55:32,560
we have the data, we have the 
data. 

1150
00:55:32,840 --> 00:55:34,400
So let's just come up with the 
insights. 

1151
00:55:34,680 --> 00:55:36,640
What insights are? 
Maybe they don't know what kind 

1152
00:55:36,640 --> 00:55:38,120
of insights they want to derive 
from it. 

1153
00:55:38,400 --> 00:55:40,840
So I think, yeah, know the 
reason why you want to tackle 

1154
00:55:40,840 --> 00:55:42,480
data problem. 
I think it's really, really 

1155
00:55:42,480 --> 00:55:44,960
important. 
So for people who love this 

1156
00:55:44,960 --> 00:55:47,000
conversation, they want to learn
from you further. 

1157
00:55:47,000 --> 00:55:50,000
Or maybe they just want to find 
more about yourself and your 

1158
00:55:50,000 --> 00:55:51,800
book, maybe. 
Is there a place where they can 

1159
00:55:51,800 --> 00:55:55,040
find you online? 
Yeah, I think probably following

1160
00:55:55,040 --> 00:55:57,720
me on LinkedIn is probably the 
best place to look. 

1161
00:55:57,920 --> 00:56:00,320
So my name is pretty uncommon, 
so you can put it in the search 

1162
00:56:00,320 --> 00:56:03,840
bar and find it pretty easily. 
Yeah, I don't post much on other

1163
00:56:03,840 --> 00:56:06,880
social media outlets anymore, so
I think LinkedIn is probably the

1164
00:56:06,880 --> 00:56:08,840
way to see what I'm doing more 
day-to-day. 

1165
00:56:09,120 --> 00:56:11,560
I also have a website where you 
can check out the book and 

1166
00:56:11,560 --> 00:56:13,640
there's a link to the podcast 
and other things that I've 

1167
00:56:13,640 --> 00:56:15,720
written. 
And yeah, the book is available 

1168
00:56:15,720 --> 00:56:18,680
on the Manning website, and when
it comes out, it'll be available

1169
00:56:18,680 --> 00:56:20,160
on Amazon. 
Thank you. 

1170
00:56:20,160 --> 00:56:23,200
I wish you good luck in the 
process of publishing that, and 

1171
00:56:23,200 --> 00:56:27,160
I hope people today who listen 
have much more equipped into 

1172
00:56:27,160 --> 00:56:29,840
solving any data analysis 
problem, just like the title of 

1173
00:56:29,840 --> 00:56:31,600
your book. 
Thanks Henry the.

