1
00:00:00,180 --> 00:00:02,340
When I was really young, I remember 

2
00:00:02,370 --> 00:00:06,420
a piece of homework that I had to do for Geography in school.

3
00:00:06,510 --> 00:00:11,340
And the teacher asked us to basically watch the weather report every single day

4
00:00:11,400 --> 00:00:14,550
on the news and note down what was the temperature,

5
00:00:14,880 --> 00:00:17,280
what was the weather condition of that day.

6
00:00:17,640 --> 00:00:20,910
And then we would end up with the weather condition and temperature for the

7
00:00:20,910 --> 00:00:23,460
previous week. Now at the time,

8
00:00:23,460 --> 00:00:27,930
it was the point in time where a lot of teachers didn't actually really know what

9
00:00:27,930 --> 00:00:30,510
the internet did. So, um,

10
00:00:30,810 --> 00:00:35,550
I realized that I could actually just wait until the day before I have to give

11
00:00:35,550 --> 00:00:37,920
the homework in, go onto the internet,

12
00:00:37,980 --> 00:00:40,380
find the weather conditions for the previous seven days

13
00:00:40,500 --> 00:00:43,980
and then write it out instead of having to watch the weather forecast every

14
00:00:43,980 --> 00:00:46,770
single day. So I don't know what that says about me,

15
00:00:46,800 --> 00:00:51,800
but if you are somebody who enjoys taking shortcuts in life and refuse to do

16
00:00:52,410 --> 00:00:56,760
work that can be done by computers, then you are in the right place.

17
00:00:56,910 --> 00:00:59,310
Learning Python is definitely the way to go.

18
00:01:00,510 --> 00:01:02,730
So inside Google sheets

19
00:01:02,760 --> 00:01:06,420
which you can access by going to sheets.google.com,

20
00:01:06,930 --> 00:01:11,930
I have created a new spreadsheet of the days of the week and the temperature of

21
00:01:13,140 --> 00:01:18,120
each of those days in Celsius and also the weather condition on those days.

22
00:01:18,630 --> 00:01:22,560
This is a replica of that homework from many, many moons ago,

23
00:01:23,040 --> 00:01:28,040
but we're going to be working with this data to learn how we can read data files

24
00:01:28,290 --> 00:01:31,350
and then analyze them and how we can do lots of things with it.

25
00:01:31,860 --> 00:01:36,330
So the first thing I want you to do is the head over to the link in the course

26
00:01:36,330 --> 00:01:39,090
resources which will take you to this spreadsheet.

27
00:01:39,570 --> 00:01:43,320
And then once you've got it open, then go to file download

28
00:01:43,410 --> 00:01:48,410
and I want you to download it in the comma separated values or CSV format.

29
00:01:51,210 --> 00:01:53,970
And you should end up with a file like this.

30
00:01:54,540 --> 00:01:59,540
Now I want you to rename this file so that it's just weather_data.csv

31
00:01:59,700 --> 00:02:04,260
, and then I've want you to create a new PyCharm project

32
00:02:04,410 --> 00:02:06,120
which you can call anything you want.

33
00:02:06,180 --> 00:02:10,169
I've called it day 25. And also create your main.py.

34
00:02:10,860 --> 00:02:15,860
Now I'm going to drag my weather_data.csv into my folder day-25 and click

35
00:02:16,980 --> 00:02:21,570
refactor to move that file into my project folder. Now,

36
00:02:21,600 --> 00:02:22,433
at this point,

37
00:02:22,440 --> 00:02:27,440
PyCharm recognizes that this is a CSV file and it asks you whether if you want to

38
00:02:27,750 --> 00:02:32,550
install some plugins to make it easier to view this file. Now,

39
00:02:32,550 --> 00:02:33,060
at this point,

40
00:02:33,060 --> 00:02:38,060
you can click cancel because we want to view the data as the raw data,

41
00:02:38,700 --> 00:02:41,880
which is in this CSV format. Now,

42
00:02:41,910 --> 00:02:46,710
CSVs are a very common way of representing tabula data,

43
00:02:46,710 --> 00:02:51,710
so data that fits into tables like a spreadsheet. And CSV,

44
00:02:51,900 --> 00:02:55,260
as you've already seen stands for comma separated values.

45
00:02:56,250 --> 00:02:58,050
So that's why when you look at the data,

46
00:02:58,080 --> 00:03:03,080
you can see each row here is a single set of data and each piece of data is

47
00:03:06,100 --> 00:03:08,830
separated by a comma without space.

48
00:03:09,610 --> 00:03:13,060
So we've seen how we can open files, read files, write

49
00:03:13,060 --> 00:03:15,070
to files. As a challenge,

50
00:03:15,100 --> 00:03:20,100
I want you to go ahead and open up this weather_data.csv file inside your

51
00:03:20,560 --> 00:03:25,560
main.py and add each line of data into a list which we'll call data.

52
00:03:28,870 --> 00:03:30,580
Pause the video and give that a go.

53
00:03:30,630 --> 00:03:31,463
Okay.

54
00:03:33,390 --> 00:03:36,360
All right. So we know that we're going to need to open the file.

55
00:03:36,660 --> 00:03:40,710
So it's stored inside the same folder as our day-25.

56
00:03:40,770 --> 00:03:45,770
So we can just use a relative file path to tap into this weather_data.csv.

57
00:03:46,890 --> 00:03:50,070
And we're going to save this data as a data_file.

58
00:03:51,600 --> 00:03:55,290
And then we're going to get the data by reading this data file.

59
00:03:55,620 --> 00:03:58,830
So data = data_file.read.

60
00:03:59,280 --> 00:04:02,850
And not only are we going to read it, we're going to use readlines

61
00:04:02,880 --> 00:04:07,880
which we know will take each line in this file and turn it into an item in a

62
00:04:09,150 --> 00:04:12,450
list. Now, if I go ahead and print my data,

63
00:04:16,170 --> 00:04:21,170
then you can see I've got this list now where each item is a row in that list.

64
00:04:23,130 --> 00:04:27,030
But as you can imagine, it would be pretty painful to work with the data,

65
00:04:27,300 --> 00:04:29,400
which is all in a string format.

66
00:04:29,460 --> 00:04:32,400
And they are still separated by commas.

67
00:04:32,730 --> 00:04:36,930
It would take a lot of cleaning to actually be able to extract each column and

68
00:04:36,930 --> 00:04:41,760
each row. So what can we do instead? Well,

69
00:04:41,760 --> 00:04:46,760
there's actually a inbuilt library that helps us with CSVs because Python is a

70
00:04:48,000 --> 00:04:52,290
language that's used really heavily for data processing, data analysis.

71
00:04:52,590 --> 00:04:56,700
There's a lot of great tools for working with tabula data,

72
00:04:56,910 --> 00:05:01,320
like our weather data. First, we're going to import the CSV library,

73
00:05:01,740 --> 00:05:04,440
and then we're going to, again, open up a file,

74
00:05:04,530 --> 00:05:07,740
weather_data.csv as our data file,

75
00:05:08,550 --> 00:05:11,670
and then we're going to use this CSV library.

76
00:05:12,210 --> 00:05:14,850
And it has a method called reader,

77
00:05:15,450 --> 00:05:19,380
which takes the file in question, which has already been opened

78
00:05:19,410 --> 00:05:24,410
so this is going to be all data_file, and it can read it an output

79
00:05:24,480 --> 00:05:25,313
the data.

80
00:05:25,710 --> 00:05:29,550
So now let's go ahead and print this data and let's see what we've got.

81
00:05:30,420 --> 00:05:33,720
You can see that it's created a CSV reader

82
00:05:33,750 --> 00:05:37,560
object. This object can be looped through.

83
00:05:37,920 --> 00:05:42,120
So if we wanted to get each row inside this data,

84
00:05:42,120 --> 00:05:44,790
we can say for row in data,

85
00:05:45,090 --> 00:05:49,860
go ahead and print each row. And once you've done that,

86
00:05:49,890 --> 00:05:54,540
you can see it's taken each of the rows inside our weather_data.csv,

87
00:05:55,050 --> 00:06:00,050
and separated out each item into a single value.

88
00:06:00,770 --> 00:06:03,980
So for example, on the Monday row,

89
00:06:04,010 --> 00:06:06,740
we've got the Monday as a string,

90
00:06:06,740 --> 00:06:10,850
we've got the temperature as a string and also the condition as a string.

91
00:06:11,150 --> 00:06:14,660
So now it's much easier for us to work with this data.

92
00:06:15,680 --> 00:06:18,320
Using what you know about Python lists

93
00:06:18,590 --> 00:06:22,790
I want you to create a new list called temperatures,

94
00:06:23,360 --> 00:06:28,360
and this list is going to contain all of the temperatures that is inside this

95
00:06:29,060 --> 00:06:32,990
weather_data.csv, like this 12 degrees, 14, 15,

96
00:06:33,350 --> 00:06:36,140
and it's going to be in the format of an integer.

97
00:06:36,200 --> 00:06:40,040
I don't want to see it as a string with quotation marks around.

98
00:06:40,370 --> 00:06:43,460
It should be a pure number so that we can work with it more easily.

99
00:06:44,090 --> 00:06:45,890
So this is your challenge.

100
00:06:46,190 --> 00:06:50,390
Pause the video and see if you can extract all of the temperatures from this

101
00:06:50,390 --> 00:06:52,430
file into this new list.

102
00:06:52,630 --> 00:06:53,463
Okay.

103
00:06:55,600 --> 00:06:58,660
All right. So when we printed out each of the rows,

104
00:06:58,720 --> 00:07:03,720
we can see that we've created several lists where each list contains an entire

105
00:07:04,840 --> 00:07:07,600
row of data from our weather data CSV.

106
00:07:08,260 --> 00:07:10,840
If we wanted to get the temperature,

107
00:07:10,930 --> 00:07:15,730
then it's going to be the item at index one in that list. For example,

108
00:07:15,730 --> 00:07:18,970
if we wanted to get the Monday temperature,

109
00:07:19,300 --> 00:07:24,300
then all we have to do is to tap into each of these rows and get the item at

110
00:07:24,520 --> 00:07:27,670
index 1. If I go ahead and print this,

111
00:07:27,700 --> 00:07:30,730
then you can see we get all of the temperatures,

112
00:07:31,090 --> 00:07:33,550
but also the label for that column.

113
00:07:34,090 --> 00:07:36,730
So if we want to exclude that label,

114
00:07:36,760 --> 00:07:41,760
then all we have to do is use an if statement and check if row at index one

115
00:07:43,090 --> 00:07:47,320
does not equal temp, which is the name of that column label,

116
00:07:47,740 --> 00:07:52,740
then we're going to tap into our list of temperatures and append this row at

117
00:07:53,410 --> 00:07:56,410
index one, which is going to be a temperature number.

118
00:07:57,040 --> 00:07:59,980
So now after we've done the entire for loop,

119
00:08:00,010 --> 00:08:04,480
then we can print out our list of temperatures. And if you take a look,

120
00:08:04,510 --> 00:08:07,030
you can see we've now got a list of all the temperatures

121
00:08:07,300 --> 00:08:12,130
excluding that column title, but they are all in the format of strings.

122
00:08:12,550 --> 00:08:15,130
So if we want to convert that into an integer,

123
00:08:15,370 --> 00:08:18,580
then all we have to do is wrap that around an int.

124
00:08:19,900 --> 00:08:23,530
So that's the goal of the challenge. Now you can of course,

125
00:08:23,530 --> 00:08:26,320
separate out this line into many more lines,

126
00:08:26,620 --> 00:08:31,620
but I think this should make enough sense for you at this stage. While CSV is the

127
00:08:32,710 --> 00:08:36,580
inbuilt CSV reading and writing library,

128
00:08:37,210 --> 00:08:42,210
notice how much faff was involved in just simply getting a single column of

129
00:08:43,059 --> 00:08:47,500
data. What are we going to do if we have more data,

130
00:08:47,500 --> 00:08:50,920
more complex data with way more columns, way more rows and

131
00:08:51,140 --> 00:08:53,440
we want to do more interesting things with it?

132
00:08:53,920 --> 00:08:56,460
This is going to be quite painful to work with.

133
00:08:57,060 --> 00:09:00,840
This is the point where we want to get the help of some pandas.

134
00:09:01,170 --> 00:09:03,750
Not these kinds of pandas. As cute as they are

135
00:09:03,750 --> 00:09:06,300
they're not going to help us with our data analysis.

136
00:09:06,630 --> 00:09:11,630
But instead, I'm talking about the Pandas library and this is a Python data

137
00:09:12,780 --> 00:09:14,190
analysis library

138
00:09:14,520 --> 00:09:19,520
which is super helpful and super powerful to perform data analysis on tabula

139
00:09:21,090 --> 00:09:25,410
data like the one that we have. In order to work with it,

140
00:09:25,440 --> 00:09:30,000
we have to import this library. But because it's not in built,

141
00:09:30,060 --> 00:09:32,700
you'll need to install it into your project.

142
00:09:33,090 --> 00:09:36,870
So the shortcut way of this is simply just a type import pandas

143
00:09:36,960 --> 00:09:38,400
and then once you see the red line,

144
00:09:38,700 --> 00:09:42,840
go ahead and hover over it and then click install package pandas,

145
00:09:43,020 --> 00:09:47,250
and then you can watch the progress down here. Now, while that's installing,

146
00:09:47,310 --> 00:09:50,940
I want to quickly introduce you to the documentation for this library.

147
00:09:51,240 --> 00:09:54,690
It's really well documented and it's really powerful,

148
00:09:54,690 --> 00:09:58,470
so it has a lot of things in the documentation. If you head over

149
00:09:58,470 --> 00:10:02,970
to pandas.pydata.org and then click on documentation,

150
00:10:03,300 --> 00:10:06,360
then you will be able to see all the things that you can do with it.

151
00:10:06,780 --> 00:10:08,730
So there's a API reference,

152
00:10:08,730 --> 00:10:13,730
there's a quick getting started guide as well as a user guide on the key

153
00:10:13,770 --> 00:10:18,770
concepts of pandas. When you're using a new library of any sort, a good idea is

154
00:10:20,490 --> 00:10:23,700
to take a look at their getting started guide if they have one,

155
00:10:24,060 --> 00:10:28,530
because it tells you how you can install it and a number of questions that you

156
00:10:28,530 --> 00:10:29,790
might have, for example,

157
00:10:29,790 --> 00:10:34,790
what kind of data does pandas handle or how can I read and write tabula data.

158
00:10:35,100 --> 00:10:38,100
And it's actually done really, really well. So if you have a moment,

159
00:10:38,220 --> 00:10:43,080
take a quick look at this page. And once you head back to PyCharm,

160
00:10:43,140 --> 00:10:46,800
your packages should have installed successfully. Now,

161
00:10:46,800 --> 00:10:48,840
once we've installed our pandas,

162
00:10:48,930 --> 00:10:51,960
you'll see that it's still grey because we're not using it yet.

163
00:10:52,470 --> 00:10:57,150
If we want to use pandas, all we have to do is say pandas.

164
00:10:57,900 --> 00:11:00,840
and in our case, we actually want to read our CSV.

165
00:11:01,320 --> 00:11:03,480
So we can say read_csv

166
00:11:04,050 --> 00:11:08,550
and inside this method, you can do lots and lots of things,

167
00:11:08,610 --> 00:11:13,140
as you can see by all of the attribute names. But most of these are optional.

168
00:11:13,560 --> 00:11:18,270
The only one that is not optional is the path that leads to the CSV file.

169
00:11:18,750 --> 00:11:22,440
So if we get hold of our weather_data.csv,

170
00:11:22,800 --> 00:11:25,860
then we can read that CSV using pandas.

171
00:11:26,160 --> 00:11:29,940
So notice how we don't have to open the file as a data file,

172
00:11:30,150 --> 00:11:34,350
or you use a CSV reader, it's one step and you're done.

173
00:11:34,800 --> 00:11:39,060
So now we've got hold of our data and if I print out the data,

174
00:11:39,090 --> 00:11:42,570
you can see how beautifully formatted it is.

175
00:11:43,110 --> 00:11:45,930
It's being printed out as an actual table,

176
00:11:45,960 --> 00:11:50,790
it's got the column headings on top of each of the columns and each of the rows

177
00:11:50,790 --> 00:11:55,600
gets given an index so that we can more easily identify how many records we have

178
00:11:55,900 --> 00:11:57,220
and where each one is.

179
00:11:58,390 --> 00:12:01,840
If we wanted to think about that previous task that we tried to do

180
00:12:01,840 --> 00:12:06,840
where we just try to get hold of a single column of data from this table,

181
00:12:08,830 --> 00:12:10,240
then using pandas,

182
00:12:10,330 --> 00:12:14,950
it is literally as easy as saying data and then square brackets,

183
00:12:15,070 --> 00:12:19,750
and then the name of that column. So in our case, it's temp.

184
00:12:20,860 --> 00:12:23,620
And now if I go ahead and print this out,

185
00:12:24,640 --> 00:12:29,640
you can see it's basically already identified the column and it's printed out

186
00:12:29,890 --> 00:12:31,720
all of the data in that column.

187
00:12:32,140 --> 00:12:34,720
So the really smart thing that Panda's doing here is

188
00:12:34,760 --> 00:12:39,700
it takes that first row to be the names of each column

189
00:12:40,120 --> 00:12:43,480
and it automatically knows how to find the data

190
00:12:43,750 --> 00:12:47,380
when you just specify the name of the column like this.

191
00:12:48,280 --> 00:12:51,220
So three lines versus eight lines,

192
00:12:51,730 --> 00:12:56,110
and we get better formatting. It's no wonder that most Python developers,

193
00:12:56,170 --> 00:12:58,240
as soon as they encounter a CSV fault,

194
00:12:58,270 --> 00:13:00,910
they will start using pandas to work with it

195
00:13:01,210 --> 00:13:03,970
no matter how simple the task or project.

196
00:13:04,300 --> 00:13:07,660
So that was a quick introduction to CSV data,

197
00:13:08,020 --> 00:13:13,020
working with CSV data and how to get started using pandas. In the next lesson,

198
00:13:14,080 --> 00:13:18,100
we're going to dive deeper into this library and see all of the common things that

199
00:13:18,100 --> 00:13:19,300
we can do with pandas.

