1
00:00:00,690 --> 00:00:01,859
Now here's a question.

2
00:00:02,310 --> 00:00:07,310
Have you ever gotten into a situation where it's evening time and you wanna

3
00:00:07,920 --> 00:00:10,710
watch a movie, but you don't know what to watch?

4
00:00:11,160 --> 00:00:15,000
Like there's just no ideas coming to your mind and everybody's sitting around

5
00:00:15,000 --> 00:00:16,500
the TV, like, what do we do?

6
00:00:17,400 --> 00:00:22,400
So what if we scraped a list of the top 100 movies of all time and you pick one

7
00:00:25,320 --> 00:00:26,153
from there?

8
00:00:26,700 --> 00:00:31,700
So I recently came across this article on empire where they list the 100

9
00:00:32,040 --> 00:00:34,590
greatest movies of all time.

10
00:00:35,220 --> 00:00:40,220
And you can see that it goes from 100 all the way down to one.

11
00:00:41,820 --> 00:00:45,960
These are supposed to be the top rated movies that have ever been produced

12
00:00:46,440 --> 00:00:51,440
and I always wanted to watch through all 100 of them and check off each one as I

13
00:00:51,810 --> 00:00:56,790
go along. So this is what we're going to be doing for our final project today.

14
00:00:57,180 --> 00:01:02,040
You're going to be scraping from this website the 100 movies

15
00:01:02,430 --> 00:01:06,570
and you are going to be using Python code to create a text file called movies

16
00:01:06,570 --> 00:01:11,040
.text that lists them in order, starting from one.

17
00:01:11,760 --> 00:01:16,760
Notice how each of these are just the titles of each of these movies.

18
00:01:17,730 --> 00:01:20,160
Essentially, it's this part that you want.

19
00:01:20,730 --> 00:01:23,520
And because the listing starts from 100,

20
00:01:23,790 --> 00:01:27,090
you have to figure out how to flip it the other way so that you get it

21
00:01:27,090 --> 00:01:30,270
starting from one. But essentially this is the goal.

22
00:01:30,450 --> 00:01:32,910
And this is the project for today.

23
00:01:32,970 --> 00:01:35,190
You're going to have to use what you learned about web scraping,

24
00:01:35,460 --> 00:01:38,580
but also other things that you've learned along the way about Python.

25
00:01:39,330 --> 00:01:42,540
Pause the video now and try to complete this challenge.

26
00:01:45,230 --> 00:01:45,770
Okay.

27
00:01:45,770 --> 00:01:49,340
So here I've created a brand new blank project,

28
00:01:49,370 --> 00:01:52,670
which I've called top100-movies, but of course, that doesn't matter.

29
00:01:53,210 --> 00:01:56,900
But in the main.py is where we're going to be doing all the serious stuff.

30
00:01:57,200 --> 00:02:01,160
So the first thing we'll need is to copy the URL.

31
00:02:01,670 --> 00:02:05,870
And at the time when you're doing this project, this URL might change.

32
00:02:06,110 --> 00:02:08,090
So don't get it from here. Instead,

33
00:02:08,360 --> 00:02:11,480
go to the course resources and copy the URL from there.

34
00:02:12,380 --> 00:02:13,790
Once you've got the URL,

35
00:02:13,820 --> 00:02:18,020
we're going to paste it into our main.py and save it as a constant.

36
00:02:18,710 --> 00:02:21,980
In addition, we're going to have to import all the things that we need,

37
00:02:22,010 --> 00:02:23,480
including requests,

38
00:02:23,870 --> 00:02:27,500
and also the bs4 module

39
00:02:28,790 --> 00:02:31,280
where we can get hold of our Beautiful Soup.

40
00:02:33,980 --> 00:02:37,640
I'm going to need to install both of these because I don't have them yet.

41
00:02:38,510 --> 00:02:42,440
So I'm going to click on the red squiggly line, install requests,

42
00:02:43,070 --> 00:02:47,330
and also install beautiful soup. Once that's all done,

43
00:02:47,390 --> 00:02:52,390
then my red squiggly lines should be gone and I can now start using my request

44
00:02:52,880 --> 00:02:53,870
module to

45
00:02:53,870 --> 00:02:58,870
make a get request to this particular URL. And I'll save the response

46
00:02:59,100 --> 00:03:01,750
I get back as the response.

47
00:03:03,280 --> 00:03:05,200
The actual HTML files,

48
00:03:05,200 --> 00:03:10,200
so the website HTML, is actually under response.text.

49
00:03:11,560 --> 00:03:14,440
So this is going to be the raw HTML text,

50
00:03:14,800 --> 00:03:18,820
and this is what I'm going to be using to use Beautiful Soup to parse.

51
00:03:19,480 --> 00:03:24,010
So now we make soup and we're going to use Beautiful Soup and we're going to

52
00:03:24,010 --> 00:03:26,110
parse in our website HTML

53
00:03:26,530 --> 00:03:29,170
and also the html.parser

54
00:03:29,620 --> 00:03:33,220
which is the method which we're going to use to parse through this website.

55
00:03:34,360 --> 00:03:35,710
Now that we've made soup,

56
00:03:35,740 --> 00:03:40,720
let's go ahead and print out our soup in a predefined format so that we can just

57
00:03:40,720 --> 00:03:41,980
see what it looks like.

58
00:03:42,160 --> 00:03:46,660
So let's run our main.py and we can see that we've got basically that

59
00:03:46,660 --> 00:03:49,750
entire website's HTML being printed out

60
00:03:49,750 --> 00:03:54,750
here. Now comes the part where we need to use our Google Chrome inspector.

61
00:03:56,020 --> 00:04:00,700
The part that we want from this website is just these lines. Now,

62
00:04:00,730 --> 00:04:02,500
of course, if we didn't know how to code,

63
00:04:02,740 --> 00:04:06,490
we would be here copying and pasting for hours on end,

64
00:04:06,910 --> 00:04:11,890
and we would die of boredom or we'd get repetitive strain injury from doing so

65
00:04:11,890 --> 00:04:15,760
much copy and pasting. But because we know code, we know better than that.

66
00:04:15,910 --> 00:04:19,510
So let's get hold of the part that we want. Let's right-

67
00:04:19,510 --> 00:04:22,270
click and click inspect. Now,

68
00:04:22,270 --> 00:04:27,220
we can see that this lives inside and h3  with the class of title.

69
00:04:27,790 --> 00:04:31,630
Now let's just check against one of the other pieces that we want and just make

70
00:04:31,630 --> 00:04:33,880
sure that they've got the same structure.

71
00:04:34,330 --> 00:04:38,470
So this is also inside an h3 with the class of title.

72
00:04:38,950 --> 00:04:43,180
So basically as long as we can scrape this entire page and get all of the 

73
00:04:43,210 --> 00:04:48,040
h3 with the class of title and get the text that's contained inside the 

74
00:04:48,070 --> 00:04:52,120
h3 element, then we're golden. Let's go ahead and do that.

75
00:04:52,600 --> 00:04:55,120
So instead of printing our soup.prettify,

76
00:04:55,150 --> 00:04:57,550
I'm going to tap into soup and I'm going to say

77
00:04:57,550 --> 00:05:02,550
find all. The thing that I want to find has the tag name of an h3 and it

78
00:05:04,360 --> 00:05:07,300
has the class of title.

79
00:05:08,080 --> 00:05:11,710
So that all came from our inspections right here.

80
00:05:12,400 --> 00:05:15,940
This should get us a list of all of the 

81
00:05:15,970 --> 00:05:17,620
h3 elements with this class

82
00:05:18,160 --> 00:05:23,110
and we should be able to save that into a variable called all_movies.

83
00:05:23,980 --> 00:05:28,450
Let's print all_movies and see what we get.

84
00:05:30,190 --> 00:05:33,430
Now, we've got a list of all of our h3s,

85
00:05:33,970 --> 00:05:38,970
and we're now going to go one step further and fetch the text from within these

86
00:05:41,380 --> 00:05:45,340
h3 elements. We do that using the

87
00:05:45,640 --> 00:05:49,000
getText method. But we can't do that on the list

88
00:05:49,030 --> 00:05:51,820
so we're going to use list comprehension.

89
00:05:52,390 --> 00:05:57,390
So we're going to say the movie_titles is equal to a new list

90
00:05:58,850 --> 00:05:59,780
and in this list,

91
00:05:59,930 --> 00:06:04,930
each item is going to be formed from a movie in the all_movies list.

92
00:06:09,710 --> 00:06:13,970
And this new item is going to be created by taking each of the movies in the

93
00:06:13,970 --> 00:06:17,900
list and then calling getText on it. Now,

94
00:06:17,900 --> 00:06:21,380
if I go ahead and print my movie_titles

95
00:06:21,410 --> 00:06:25,940
instead of all_movies, then this is what we get.

96
00:06:25,940 --> 00:06:29,930
We get all of the titles of all 100 movies.

97
00:06:32,120 --> 00:06:37,120
Now we want to reverse this list so that we can put it into a text file starting

98
00:06:38,060 --> 00:06:40,490
from 1, going down to 100.

99
00:06:41,000 --> 00:06:43,070
So there's a couple of ways that we could do this.

100
00:06:43,130 --> 00:06:46,520
One is we can use the Python splice operator.

101
00:06:46,910 --> 00:06:49,100
So we add a set of square brackets

102
00:06:49,520 --> 00:06:52,220
and then we add a ::-1,

103
00:06:54,650 --> 00:06:58,400
and this will reverse the order. And as always,

104
00:06:58,760 --> 00:07:03,260
you can find out this information either by Googling or through what you've

105
00:07:03,260 --> 00:07:07,700
learned before in previous lessons. So this is the syntax that we're using,

106
00:07:08,180 --> 00:07:12,380
which comes from the slice operator where we have a start,

107
00:07:12,680 --> 00:07:15,620
a stop, and a step. So in this case,

108
00:07:15,800 --> 00:07:18,410
the start is at the very beginning of the list,

109
00:07:18,470 --> 00:07:20,330
the stop is at the very end of the list so

110
00:07:20,360 --> 00:07:22,940
we don't have to specify those cause they're the defaults,

111
00:07:23,360 --> 00:07:28,360
and then the step is basically -1 and this syntax will reverse that list.

112
00:07:31,040 --> 00:07:32,480
Now, alternatively, you can,

113
00:07:32,480 --> 00:07:37,480
of course also use a for loop where you create some sort of n in a range and the

114
00:07:40,370 --> 00:07:44,150
range again, can take a start, stop and step.

115
00:07:44,630 --> 00:07:46,820
So you could start at the end,

116
00:07:46,850 --> 00:07:50,930
so the length of our movie_titles-1

117
00:07:51,020 --> 00:07:55,370
because remember lists in Python start numbering from zero.

118
00:07:55,730 --> 00:08:00,680
So the very last item is actually at the total number minus one.

119
00:08:01,370 --> 00:08:05,690
And next is going to be the end, which is going to be zero, and finally

120
00:08:05,690 --> 00:08:09,770
it's going to be the step, which is minus one each time.

121
00:08:09,890 --> 00:08:13,790
So this time we start off from the very end of the list,

122
00:08:13,790 --> 00:08:18,620
go back to the beginning, stepping by minus one each time. And this way,

123
00:08:18,620 --> 00:08:19,880
if we print n,

124
00:08:19,940 --> 00:08:24,940
you can see that this is going to give us basically all the way from 99 down to

125
00:08:26,090 --> 00:08:26,923
1.

126
00:08:27,590 --> 00:08:32,590
And we can use that to tap into our movie titles and get hold of each of the items

127
00:08:35,330 --> 00:08:36,620
at index n.

128
00:08:41,110 --> 00:08:45,580
And because range actually doesn't go beyond the end

129
00:08:45,610 --> 00:08:50,110
we actually have to put -1 there if we want to get the very final one

130
00:08:50,110 --> 00:08:55,000
which is number 100. And you'll notice also that there's a

131
00:08:55,200 --> 00:08:58,110
bit of a typo here for number 93,

132
00:08:58,530 --> 00:09:00,990
and this is actually not our fault at all.

133
00:09:00,990 --> 00:09:04,380
It's in fact, in the original empire post.

134
00:09:04,440 --> 00:09:09,440
They actually screwed up and this should be number 93. Because we're scraping

135
00:09:10,650 --> 00:09:12,150
data we can't really be picky.

136
00:09:12,480 --> 00:09:15,090
We're just going to end up with what we end up with.

137
00:09:15,870 --> 00:09:20,870
So I'm going to choose the method where we actually take the movie title

138
00:09:20,910 --> 00:09:25,910
and then we use the splice to get hold of it in reverse.

139
00:09:27,090 --> 00:09:29,460
And I'll call that the movies.

140
00:09:30,690 --> 00:09:35,640
And now we can create our new text file. So with open

141
00:09:35,730 --> 00:09:38,670
we're going to create a new file called movies.txt,

142
00:09:39,090 --> 00:09:44,090
and we of course have to change the mode two write mode so that we can actually

143
00:09:44,490 --> 00:09:47,370
create this file. And then

144
00:09:47,430 --> 00:09:51,810
because this file movies.txt doesn't exist, once this line runs

145
00:09:51,810 --> 00:09:55,230
it's going to create that file and then we're going to write to it.

146
00:09:55,230 --> 00:09:57,060
So we're going to say file.write

147
00:09:57,510 --> 00:10:02,510
and we're going to write each of the lines in our list of movies.

148
00:10:03,060 --> 00:10:05,160
We can again use a for loop

149
00:10:05,460 --> 00:10:08,370
so for movie in movies,

150
00:10:08,850 --> 00:10:13,850
let's go ahead and write the name of the movie

151
00:10:16,800 --> 00:10:20,040
and then lets add a new line operator,

152
00:10:20,040 --> 00:10:25,040
so \n, so that we get each of the movies onto its own line.

153
00:10:26,250 --> 00:10:28,710
Now, finally, if I run this code,

154
00:10:28,830 --> 00:10:32,790
then it's going to create our movies.txt. And if I take a look at it,

155
00:10:32,820 --> 00:10:37,820
you can see it's now got all of the 100 movies listed on here from 1 down to

156
00:10:38,250 --> 00:10:39,083
100.

157
00:10:39,900 --> 00:10:43,710
And now you can go through this list and delete the ones that you've already

158
00:10:43,710 --> 00:10:47,970
watched and then continue watching through the rest of the list.

159
00:10:48,660 --> 00:10:52,470
I hope you had fun trying out web scraping by yourself in this project.

160
00:10:52,740 --> 00:10:56,100
So that's all for today. Take a look at what you've done.

161
00:10:56,130 --> 00:11:00,600
If there's anything confusing about the class or ID or HTML,

162
00:11:00,750 --> 00:11:04,410
then be sure to review some of the lessons in the previous four days.

163
00:11:05,400 --> 00:11:06,420
That's all for today.

