1
00:00:00,510 --> 00:00:04,950
Now that you've learned how to do the basics of data analysis with pandas,

2
00:00:05,280 --> 00:00:08,250
it's time to put your knowledge to the test.

3
00:00:08,700 --> 00:00:13,700
And we're going to be working with a really interesting data set. Back in 2018/

4
00:00:13,950 --> 00:00:18,390
2019, a bunch of volunteers went into New York  central park

5
00:00:18,450 --> 00:00:23,100
and they basically combed the entire park to find all of the squirrels.

6
00:00:23,730 --> 00:00:28,680
So this resulted in a huge dataset of squirrels, squirrel numbers,

7
00:00:28,680 --> 00:00:31,410
squirrel fur color and a whole bunch of other things.

8
00:00:31,860 --> 00:00:36,860
And all of this data can be found on the New York city open data website

9
00:00:37,020 --> 00:00:40,740
which we'll link to in the course resources. When you head over here,

10
00:00:40,740 --> 00:00:44,130
you can take a look at all of the data that they collected.

11
00:00:44,490 --> 00:00:48,540
If you take a look at all of the columns in this dataset and click show

12
00:00:48,540 --> 00:00:52,770
all, you can see they've logged the location of each of the squirrels,

13
00:00:53,010 --> 00:00:55,320
they gave each squirrel a unique ID,

14
00:00:55,560 --> 00:00:59,730
they evaluated whether if the squirrel was an adult or a juvenile,

15
00:01:00,030 --> 00:01:04,379
and they looked at what the primary fur color was. So it could be gray, cinnamon

16
00:01:04,379 --> 00:01:07,860
which is red, or black. Now, if you scroll down

17
00:01:07,860 --> 00:01:10,830
there's actually a really interesting visualization of this data,

18
00:01:11,090 --> 00:01:13,130
where they've plotted all of this squirrels

19
00:01:13,130 --> 00:01:15,800
onto a map of central park.

20
00:01:16,220 --> 00:01:21,220
So you can take a look at the distribution of the squirrel population in central

21
00:01:21,380 --> 00:01:21,680
park.

22
00:01:21,680 --> 00:01:24,830
So you can see where the red squirrels are, where the gray ones are and where the

23
00:01:24,830 --> 00:01:27,860
black ones are. So there's only three colors, really.

24
00:01:28,040 --> 00:01:30,830
So what you're going to do is you're going to go to this website,

25
00:01:30,860 --> 00:01:34,670
click on export and download the CSV data.

26
00:01:35,780 --> 00:01:36,830
Now, once you do that,

27
00:01:36,830 --> 00:01:41,830
you'll end up with a 2018 central park squirrel data like this.

28
00:01:42,890 --> 00:01:47,890
Now you're going to pull that data into your day-25 project folder and click

29
00:01:48,470 --> 00:01:49,770
re-factor to move it in.

30
00:01:50,240 --> 00:01:55,130
And you can see that this is a lot bigger than what ever data we had before.

31
00:01:55,430 --> 00:01:57,800
There's thousands of entries,

32
00:01:57,860 --> 00:02:00,740
and it's actually real dedication that these volunteers went around

33
00:02:00,740 --> 00:02:04,160
logging all of this data, observing all the squirrels.

34
00:02:04,340 --> 00:02:07,040
It must've been a very tedious task.

35
00:02:07,340 --> 00:02:09,740
But now that somebody else has carried out the hard part,

36
00:02:10,039 --> 00:02:12,980
we can do some data analysis on that data.

37
00:02:13,430 --> 00:02:18,020
So I want you to comment out everything that you've got so far in your day-25

38
00:02:18,020 --> 00:02:18,853
project

39
00:02:19,070 --> 00:02:24,070
and the goal is for you to use that data and use what you've learned about

40
00:02:24,290 --> 00:02:25,123
pandas

41
00:02:25,160 --> 00:02:30,160
to be able to create a CSV that's called squirrel_count that has a small table

42
00:02:32,330 --> 00:02:36,050
which just contains the fur color, so there's only three fur colors,

43
00:02:36,170 --> 00:02:40,580
and they are logged under the primary fur color column.

44
00:02:41,090 --> 00:02:44,990
And it can either basically be gray, cinnamon

45
00:02:45,020 --> 00:02:47,060
which is red, or black.

46
00:02:47,090 --> 00:02:50,810
There's only three possible values in that column. Now,

47
00:02:50,840 --> 00:02:55,460
what you're going to do is you are going to figure out how many gray squirrels

48
00:02:55,460 --> 00:02:56,510
there are in total,

49
00:02:56,540 --> 00:03:01,540
how many cinnamon ones and how many black ones based on that primary fur color

50
00:03:02,020 --> 00:03:02,853
column.

51
00:03:03,040 --> 00:03:07,330
And then you're gonna take that data and build a new data frame from it,

52
00:03:07,660 --> 00:03:08,590
and using that,

53
00:03:08,620 --> 00:03:13,620
create this final CSV using pandas. Now that you know what you need to do,

54
00:03:14,260 --> 00:03:18,160
have a think about the problem and see if you can complete this challenge.

55
00:03:18,400 --> 00:03:22,690
Pause the video now. All right.

56
00:03:22,690 --> 00:03:25,870
If you've commented out the line where you've imported pandas,

57
00:03:25,900 --> 00:03:27,760
then you'll obviously have to do that again.

58
00:03:28,330 --> 00:03:32,320
So let's think about what we want to do. We want to isolate the column

59
00:03:32,320 --> 00:03:36,190
which is the primary fur color. And if it helps you,

60
00:03:36,190 --> 00:03:40,120
you can actually better visualize the data on this website where they've got a

61
00:03:40,120 --> 00:03:44,740
table preview. So you can see that here is the primary fur color,

62
00:03:44,800 --> 00:03:47,410
and you can see the colors that have been logged.

63
00:03:48,130 --> 00:03:52,180
So our goal is to somehow get hold of this data series

64
00:03:52,210 --> 00:03:56,170
which contains this entire column, figure out how many of them are gray,

65
00:03:56,170 --> 00:04:01,060
how many of them are black and how many are cinnamon. How do we do this?

66
00:04:01,150 --> 00:04:03,820
Well, firstly, let's get hold of our data.

67
00:04:03,820 --> 00:04:08,820
So we're going to use our pandas.read_csv method.

68
00:04:09,670 --> 00:04:14,650
And then we're going to direct it towards this 2018 central parks squirrel census

69
00:04:14,650 --> 00:04:16,029
data. Now, if you want,

70
00:04:16,029 --> 00:04:19,779
you can actually right-click refactor and rename it to something a little bit

71
00:04:19,779 --> 00:04:23,890
shorter. But because I know that PyCharm will actually fill this in for me

72
00:04:23,950 --> 00:04:25,720
as long as I start out with a string

73
00:04:26,260 --> 00:04:28,840
and that that file is in the same folder,

74
00:04:29,110 --> 00:04:33,460
it'll actually type it all out if I just hit enter, then it doesn't really matter.

75
00:04:34,000 --> 00:04:34,780
But of course,

76
00:04:34,780 --> 00:04:37,600
make sure that you don't have any typos in here if you're typing it out,

77
00:04:37,870 --> 00:04:41,290
because otherwise when you hit run, you're going to get a whole bunch of error

78
00:04:41,300 --> 00:04:43,000
text inside your console.

79
00:04:43,870 --> 00:04:46,600
So once I've successfully read that CSV,

80
00:04:46,630 --> 00:04:50,080
I've now got a dataframe. From that data frame,

81
00:04:50,110 --> 00:04:53,980
I can get hold of the column that I'm interested in

82
00:04:54,220 --> 00:04:56,530
which is called primary fur color. Now,

83
00:04:56,530 --> 00:05:01,150
because it's got spaces, it's easier to access that data by using a square

84
00:05:01,150 --> 00:05:04,900
bracket and then putting in the name of that column like this.

85
00:05:04,990 --> 00:05:07,630
This is one of the methods that we showed you. Now,

86
00:05:07,630 --> 00:05:09,790
once I've gotten hold of that column, well

87
00:05:09,790 --> 00:05:14,080
the next thing I need to do is to find all of the rows in that column

88
00:05:14,110 --> 00:05:17,470
where the data is equal to each of the colors.

89
00:05:17,500 --> 00:05:20,980
So there was the color which is gray,

90
00:05:20,980 --> 00:05:24,250
so gray not 'ey'.

91
00:05:24,970 --> 00:05:28,600
And then once we've got hold of all the gray squirrels,

92
00:05:28,870 --> 00:05:31,840
then we're going to pull that out from our data.

93
00:05:32,440 --> 00:05:36,130
So now we should have a bunch of gray squirrels,

94
00:05:36,580 --> 00:05:39,010
and it's probably a good idea to print them out

95
00:05:39,040 --> 00:05:41,320
just to see if that actually worked.

96
00:05:41,650 --> 00:05:45,280
And because I expect there'll be lots of rows with gray squirrels,

97
00:05:45,490 --> 00:05:49,900
it makes sense to make it a plural, grey_squirrels. So now if I hit run,

98
00:05:50,200 --> 00:05:55,200
you can see listed here are all of the rows where it contains a gray squirrel.

99
00:05:57,260 --> 00:05:59,570
Now it redacted this table

100
00:05:59,570 --> 00:06:03,710
so that it can actually display it in the console because we know that there's many,

101
00:06:03,710 --> 00:06:05,900
many columns and there's many, many rows.

102
00:06:06,140 --> 00:06:10,340
It's just showing you the first few rows and then the last few rows and also the

103
00:06:10,340 --> 00:06:12,470
first few columns and the last few columns.

104
00:06:12,890 --> 00:06:16,760
So we can be pretty sure that we've managed to get hold of all of the rows that

105
00:06:16,760 --> 00:06:20,990
contain gray squirrels as their primary fur color. Now,

106
00:06:20,990 --> 00:06:25,910
what if we wanted to know the gray squirrels count? Well,

107
00:06:25,910 --> 00:06:30,910
we could use our length method because remember, once we get hold of the rows

108
00:06:31,610 --> 00:06:35,000
it kind of gets treated a bit like a iterable,

109
00:06:35,330 --> 00:06:40,330
like a list, and can use methods like length on this data.

110
00:06:41,060 --> 00:06:44,330
So now if we print grey_squirrels_count,

111
00:06:45,350 --> 00:06:50,350
you can see that we've got a total of 2,473 grey squirrels.

112
00:06:51,620 --> 00:06:56,420
So now all I need to do is to repeat this process for the other colored

113
00:06:56,420 --> 00:07:00,440
squirrels. So I'm going to call it a red squirrel, even though theoretically,

114
00:07:00,470 --> 00:07:02,570
their fur color is cinnamon.

115
00:07:02,990 --> 00:07:07,580
So I'm just going to copy and paste that in there in case I make any typos and

116
00:07:07,580 --> 00:07:10,400
the final squirrel is the black squirrel.

117
00:07:11,270 --> 00:07:13,910
So those are all three squirrel types,

118
00:07:14,000 --> 00:07:17,240
and if I go ahead and print out all of these,

119
00:07:17,510 --> 00:07:22,510
so the reds squirrel count, the black squirrel count and the gray squirrel count,

120
00:07:23,390 --> 00:07:28,390
you can see that you got mostly grey squirrels, a few red ones and very

121
00:07:28,610 --> 00:07:30,710
rarely do you actually see a black squirrel.

122
00:07:30,920 --> 00:07:35,540
I certainly haven't seen one recently. Now that we've got our three values,

123
00:07:35,570 --> 00:07:39,980
it's time to construct our data frame. So to construct our data frame,

124
00:07:40,100 --> 00:07:43,640
the easiest way is to actually just create a dictionary.

125
00:07:44,000 --> 00:07:48,470
So I'm going to create a data dictionary and this dictionary is going to have

126
00:07:48,530 --> 00:07:50,450
two key-value pairs.

127
00:07:50,630 --> 00:07:53,750
So the first key is going to be the fur color

128
00:07:55,070 --> 00:07:59,690
and this is going to contain the three fur colors, which is, um,

129
00:07:59,720 --> 00:08:04,400
gray, cinnamon or red, and black.

130
00:08:05,210 --> 00:08:06,830
And then, um,

131
00:08:06,890 --> 00:08:10,760
the next key value pair is going to be the count.

132
00:08:11,420 --> 00:08:15,830
So now we can create a list where the first value is going to be the gray squirrel 

133
00:08:15,830 --> 00:08:19,520
count, next is going to be the red squirrel count and finally,

134
00:08:19,520 --> 00:08:24,050
it's going to be the black squirrel count. So now that we've got our dictionary

135
00:08:24,320 --> 00:08:26,750
and this is what it looks like,

136
00:08:27,500 --> 00:08:31,700
then we can go ahead and actually turn this into a data frame.

137
00:08:32,090 --> 00:08:32,929
So to do that,

138
00:08:32,929 --> 00:08:37,220
we need to get hold of the pandas and then get hold of the data frame class,

139
00:08:37,580 --> 00:08:41,120
and then initialize it using this data dictionary.

140
00:08:41,570 --> 00:08:44,300
So I'm going to save that as df, df for data frame.

141
00:08:44,780 --> 00:08:48,860
And then the final thing I need to do is to get my df

142
00:08:48,920 --> 00:08:51,620
to convert to a CSV.

143
00:08:52,190 --> 00:08:54,530
So now I get to specify the name of the file,

144
00:08:54,530 --> 00:08:59,250
which I will call squirrel_count.csv

145
00:08:59,730 --> 00:09:03,540
and once I hit run, you'll see that new file show up right here

146
00:09:03,990 --> 00:09:08,990
and you can see that it's constructed my new CSV file with all of the data that

147
00:09:09,840 --> 00:09:13,260
I've extracted from my central park squirrel census

148
00:09:13,620 --> 00:09:17,220
and I've now got a new table with the data that I'm interested in.

149
00:09:17,790 --> 00:09:21,780
So did you manage to complete this challenge? If you found it tricky

150
00:09:21,780 --> 00:09:25,950
working with the data frames and figuring out how to get hold of the columns or

151
00:09:25,950 --> 00:09:29,490
how to get hold of the rows depending on the conditions we're interested

152
00:09:29,490 --> 00:09:33,360
in, then I strongly recommend to just head back the last lesson,

153
00:09:33,630 --> 00:09:37,830
just try to write out the code that we're doing in each step of the video

154
00:09:37,830 --> 00:09:38,663
yourself,

155
00:09:38,790 --> 00:09:41,910
just to make sure that you're a hundred percent sure with what's going on.

156
00:09:42,630 --> 00:09:44,820
Once you are ready, head of the next lesson

157
00:09:44,880 --> 00:09:49,350
we're going to finally build our US states game.

158
00:09:50,030 --> 00:09:52,410
For all of that and more, I'll see you on the next lesson.

