1
00:00:00,420 --> 00:00:05,420
In the last lesson, we got started using Beautiful Soup and we saw how we could

2
00:00:05,430 --> 00:00:10,430
use it to parse through the HTML of a website and pull out the pieces that we're

3
00:00:10,560 --> 00:00:11,393
interested in.

4
00:00:12,210 --> 00:00:16,920
Now it's no fun scraping a website that you've already got access to locally.

5
00:00:17,490 --> 00:00:21,900
It's much better if we can get hold of something that's currently live on the

6
00:00:21,900 --> 00:00:26,520
internet. So I'm going to go ahead and comment out all of this code,

7
00:00:27,960 --> 00:00:32,960
and I'm going to be using Beautiful Soup to get hold of a live website and grab

8
00:00:33,900 --> 00:00:34,800
data from it.

9
00:00:35,580 --> 00:00:40,170
And the website that we're going to be using is the YCombinator's Hacker

10
00:00:40,170 --> 00:00:41,003
News website.

11
00:00:41,670 --> 00:00:46,650
This is where anybody can post a link to a news piece that they've discovered

12
00:00:46,650 --> 00:00:51,300
that's tech related, or you could show off things that you've built.

13
00:00:51,660 --> 00:00:52,200
For example,

14
00:00:52,200 --> 00:00:56,430
I've just been looking at this guy's website that he built called My Desk Tour

15
00:00:57,030 --> 00:00:59,760
where you can post a picture of your desk setup,

16
00:01:00,150 --> 00:01:04,110
and you can see all of the tools and gear that they've got.

17
00:01:04,860 --> 00:01:08,160
So we're going to be scraping the main Hacker News website

18
00:01:08,190 --> 00:01:10,980
which is under this particular URL.

19
00:01:11,520 --> 00:01:15,450
And this is usually where I go to find the latest tech news.

20
00:01:16,200 --> 00:01:20,760
We're going to copy this URL, and we're going to go back to our main.py.

21
00:01:21,270 --> 00:01:25,050
And in order to download the data from that website,

22
00:01:25,080 --> 00:01:28,860
we're going to be using our handy friend, which is requests.

23
00:01:29,520 --> 00:01:34,520
Now requests allows us to get hold off of the data from a particular URL,

24
00:01:36,060 --> 00:01:39,240
which in this case is news.ycombinator.com.

25
00:01:40,500 --> 00:01:43,080
And once we've made that request,

26
00:01:43,110 --> 00:01:48,110
then we can save the data that we get back in a response variable.

27
00:01:49,070 --> 00:01:50,840
And once we've got the response,

28
00:01:50,870 --> 00:01:54,350
we can actually print out the text of the response

29
00:01:54,890 --> 00:01:59,890
and this is basically equivalent to what we did when we opened up our HTML file

30
00:02:00,590 --> 00:02:04,040
and we read the file contents, the text of the file.

31
00:02:04,700 --> 00:02:06,950
So now if I go ahead and run this,

32
00:02:07,010 --> 00:02:10,850
then you can see that it's going to print out loads of stuff,

33
00:02:11,150 --> 00:02:15,860
but this is basically the code that represents this particular page.

34
00:02:16,310 --> 00:02:19,700
So in fact, if you right click and click view page source,

35
00:02:19,730 --> 00:02:24,470
you'll see that this is exactly the same HTML code that we're getting back over

36
00:02:24,470 --> 00:02:28,790
here. So we don't actually want all of this jumbled mess.

37
00:02:29,180 --> 00:02:34,180
What we're more interested in is the specific titles and the links

38
00:02:34,760 --> 00:02:37,100
for each of these pieces.

39
00:02:37,730 --> 00:02:41,480
It shows by default 30 of the top articles,

40
00:02:41,870 --> 00:02:43,940
and this is ranked by an algorithm.

41
00:02:43,970 --> 00:02:48,970
So it's most recent and also getting a lot of traction,

42
00:02:49,040 --> 00:02:50,420
so a lot of upvotes,

43
00:02:50,840 --> 00:02:55,730
but it doesn't represent the most upvoted items. So you can see here, in fact,

44
00:02:55,760 --> 00:02:58,340
the most upvoted at least today anyways,

45
00:02:58,430 --> 00:03:01,210
is Mozilla laying off of 250 employees.

46
00:03:01,960 --> 00:03:06,960
So what if I wanted to get hold of the article title and the link of the post

47
00:03:07,840 --> 00:03:10,300
from this page that has the highest point.

48
00:03:10,690 --> 00:03:14,590
I don't want to have to manually check all of this. I want to do it with code.

49
00:03:15,280 --> 00:03:18,040
So let's go ahead and scrape it. Now,

50
00:03:18,070 --> 00:03:21,970
if I right click on each of these titles and I click on inspect,

51
00:03:22,420 --> 00:03:26,530
then it takes me to the precise line of code in the HTML

52
00:03:26,560 --> 00:03:30,070
that's responsible for rendering this component.

53
00:03:30,550 --> 00:03:34,210
This is actually a anchor tag, so that's the

54
00:03:34,240 --> 00:03:39,240
a tag. And the text in the anchor tag is the title of the article,

55
00:03:41,470 --> 00:03:45,430
and then the href is the link that will take me to the actual story. So

56
00:03:45,460 --> 00:03:47,230
if I click on this, you can see

57
00:03:47,230 --> 00:03:51,820
it takes me to the actual news piece about Joan Feynman.

58
00:03:52,960 --> 00:03:54,880
So what about this point? Well,

59
00:03:54,880 --> 00:03:57,850
let's go ahead and right click on it and click inspect.

60
00:03:58,180 --> 00:04:03,180
You can see this is in a span and it has a class of score while this title is in

61
00:04:06,910 --> 00:04:10,570
a a ref and it has a class of story link.

62
00:04:11,170 --> 00:04:13,420
So with those two pieces of information,

63
00:04:13,570 --> 00:04:17,350
we can use Beautiful Soup to scrape all of the titles,

64
00:04:17,500 --> 00:04:20,019
all of the links and all of their points.

65
00:04:20,470 --> 00:04:24,550
And we can compare all those points and figure out which one has the highest

66
00:04:24,550 --> 00:04:28,030
point on this page. So let's go ahead and do that.

67
00:04:29,020 --> 00:04:33,310
So I'm gonna save the yc_webpage as the response.text,

68
00:04:34,300 --> 00:04:38,710
and then I'm going to use Beautiful Soup to parse that webpage.

69
00:04:39,220 --> 00:04:40,480
So BeautifulSoup

70
00:04:40,540 --> 00:04:45,540
and then I'm going to pass in the actual HTML document that we want to parse.

71
00:04:45,910 --> 00:04:48,220
So this is the YC webpage,

72
00:04:48,820 --> 00:04:52,750
and then we provide the method to which we're going to parse it.

73
00:04:52,750 --> 00:04:56,890
So html.parser, with an ER at the end.

74
00:04:57,580 --> 00:05:01,930
And this is our soup. Once we've created our soup,

75
00:05:02,620 --> 00:05:06,580
the next step is actually to dig in the soup and find the parts that we want.

76
00:05:06,730 --> 00:05:11,290
So if, for example, if I want to get hold of the title of all of that,

77
00:05:11,290 --> 00:05:16,060
then I can just say print soup.title. And now you'll see

78
00:05:16,060 --> 00:05:19,000
it gives me the title which is Hacker News.

79
00:05:19,780 --> 00:05:24,100
And that's the same as what you see here in the tab bar.

80
00:05:24,910 --> 00:05:26,710
Now, what if I didn't want the title?

81
00:05:26,710 --> 00:05:31,150
What if I actually wanted to get hold of this text here,

82
00:05:31,750 --> 00:05:34,120
the title of each of these articles?

83
00:05:34,810 --> 00:05:39,810
See if you can figure out how to get hold of this text and print it out in your

84
00:05:40,330 --> 00:05:44,590
code. Remember it has the class that's a story link,

85
00:05:44,950 --> 00:05:46,360
and it's an anchor tag.

86
00:05:46,960 --> 00:05:50,410
Pause the video and see if you can get this title,

87
00:05:50,590 --> 00:05:53,380
so yours might be different from what I've got on screen of course.

88
00:05:53,440 --> 00:05:57,590
It depends on what's showing up on Hacker News on the day you are doing this.

89
00:05:58,010 --> 00:06:02,810
But get the title of the first article printed out using BeautifulSoup.

90
00:06:03,490 --> 00:06:04,323
Okay.

91
00:06:05,800 --> 00:06:08,920
All right. What we want to do is we want to use find.

92
00:06:09,430 --> 00:06:14,430
So we're going to find the first instance from this webpage where the actual

93
00:06:15,850 --> 00:06:20,080
name of the tag is equal to a, so that's an anchor tag,

94
00:06:20,770 --> 00:06:25,770
and then the class is equal to the story link.

95
00:06:27,250 --> 00:06:29,530
So I'm just gonna copy that and paste it in.

96
00:06:30,220 --> 00:06:34,600
Remember that in order to not clash with the reserved class keyword,

97
00:06:34,630 --> 00:06:37,210
we have to add a underscore afterwards.

98
00:06:38,200 --> 00:06:41,350
Now this should be our article tag.

99
00:06:42,100 --> 00:06:44,230
And if we go ahead and print it,

100
00:06:44,470 --> 00:06:49,470
then you can see that we get this exact anchor tag.

101
00:06:50,350 --> 00:06:54,130
But if we want to get hold of the text that's actually in the anchor tag,

102
00:06:54,190 --> 00:06:58,450
then we have to go one step further and call the getText method

103
00:06:58,570 --> 00:07:03,190
that's also from Beautiful Soup. So now when I run that, you can see

104
00:07:03,310 --> 00:07:06,670
I only get the actual text of the article.

105
00:07:07,510 --> 00:07:12,310
Let's work on some of the other pieces. So this is the article text.

106
00:07:13,840 --> 00:07:18,840
And then if we want to get hold of the article_link and the article_upvotes.

107
00:07:22,300 --> 00:07:26,020
See if you can figure out how to complete these two parts as well.

108
00:07:26,200 --> 00:07:30,880
So we want the HTML link that is, of course, all of this HTTP,

109
00:07:30,880 --> 00:07:34,330
et cetera. And then we also want to get hold of the upvote

110
00:07:34,390 --> 00:07:37,150
which is this little number right here.

111
00:07:38,110 --> 00:07:41,080
It's inside a span with a class of score.

112
00:07:41,380 --> 00:07:42,213
Right?

113
00:07:44,500 --> 00:07:48,190
All right. So let's do the first thing, which is article link. Well,

114
00:07:48,190 --> 00:07:52,840
we can actually already tap into the same article tag we already got up here.

115
00:07:53,260 --> 00:07:55,060
And instead of saying getText,

116
00:07:55,090 --> 00:08:00,090
we can use the get method to get the specific value of a attribute.

117
00:08:02,080 --> 00:08:04,990
So what we want is of course, the href.

118
00:08:06,520 --> 00:08:09,190
And then the article_upvote,

119
00:08:09,250 --> 00:08:14,250
we'll have to tap into our soup and find the tag with a name that is span

120
00:08:17,140 --> 00:08:21,400
because this is what we're looking for, and has a class of score.

121
00:08:21,880 --> 00:08:22,713
Right?

122
00:08:24,400 --> 00:08:29,400
Like this. Finding this particular tag is not enough.

123
00:08:29,680 --> 00:08:34,570
This actually just gets us the tag. If we want to go further

124
00:08:34,570 --> 00:08:38,350
and we actually want to get the text that's inside that span

125
00:08:38,650 --> 00:08:41,230
which is of course the 19 points,

126
00:08:41,799 --> 00:08:45,280
then we have to dig one step deeper and call the

127
00:08:45,310 --> 00:08:49,450
getText method like this. Now,

128
00:08:49,480 --> 00:08:54,480
if I go ahead and print out the article_text and the article_link,

129
00:08:55,710 --> 00:08:59,550
and also finally the article_upvote,

130
00:09:00,090 --> 00:09:01,200
then you can see

131
00:09:01,200 --> 00:09:05,730
I get all three pieces of data that I'm interested in. Now,

132
00:09:05,790 --> 00:09:10,620
instead of getting the first occurrence, I want to get all of the ones that are

133
00:09:10,650 --> 00:09:15,570
on this page, so all 30 results. Now, in order to do that,

134
00:09:15,810 --> 00:09:20,810
I have to change the find to find_all both here and here.

135
00:09:23,280 --> 00:09:23,790
This way

136
00:09:23,790 --> 00:09:28,790
we get a list of all of the articles and I'll get a list of all of the article

137
00:09:31,860 --> 00:09:32,693
_upvotes.

138
00:09:33,990 --> 00:09:38,040
So now it's going to be a little bit different. In order to get all of the text

139
00:09:38,070 --> 00:09:41,730
and all of the link, then I have to use a for loop.

140
00:09:42,450 --> 00:09:46,410
So I'll say for article tag in articles,

141
00:09:46,740 --> 00:09:47,940
so articles is of course,

142
00:09:47,940 --> 00:09:52,410
this list where we find all of the anchor tags with a class of storylink,

143
00:09:53,130 --> 00:09:56,820
and then I'm going to loop through each one of those and for each of the tags,

144
00:09:56,850 --> 00:10:00,030
I'm going to get the text and also get the Href.

145
00:10:02,100 --> 00:10:06,150
I'm going to create two new lists, articles_text, and article_links.

146
00:10:08,550 --> 00:10:12,930
And then I'm going to save each of the new articles into those lists.

147
00:10:19,700 --> 00:10:20,533
...

148
00:10:20,690 --> 00:10:23,000
Like this. And in fact,

149
00:10:23,030 --> 00:10:28,030
we could probably simplify this a little bit by refactoring and renaming the

150
00:10:30,020 --> 00:10:31,040
article text,

151
00:10:31,070 --> 00:10:35,240
so the singular version into just text and the article_link

152
00:10:36,770 --> 00:10:38,420
to just the link.

153
00:10:39,740 --> 00:10:44,300
So now let's print out the lists. So the article_texts,

154
00:10:45,530 --> 00:10:49,010
the article_links and the article_upvotes.

155
00:10:49,760 --> 00:10:51,440
And this find_all

156
00:10:51,440 --> 00:10:55,760
gives me a list and I can't call getText on the list.

157
00:10:56,120 --> 00:10:59,720
So I'll also need to create a new list.

158
00:11:00,050 --> 00:11:02,360
So I'm going to choose to use list comprehension here.

159
00:11:03,260 --> 00:11:08,260
So I'm going to say for score in all of the scores,

160
00:11:10,370 --> 00:11:13,850
we're going to create a list using each of those scores

161
00:11:14,090 --> 00:11:17,450
and we're going to call getText in order to get each of them.

162
00:11:18,500 --> 00:11:21,560
This is the same as writing out a for loop like this

163
00:11:21,650 --> 00:11:26,180
but it's obviously much shorter. Now, when I hit run,

164
00:11:26,210 --> 00:11:28,880
you can see that each of my lists are ordered.

165
00:11:29,150 --> 00:11:33,950
So this is the first article's text, this is the first article's link,

166
00:11:34,220 --> 00:11:36,560
and this is the first article's points.

167
00:11:38,660 --> 00:11:43,660
What we want to do is we want to get the article_upvotes into a number format,

168
00:11:45,080 --> 00:11:47,750
so an integer. And to do that,

169
00:11:47,780 --> 00:11:51,380
we of course have to get rid of the points that comes afterwards.

170
00:11:51,830 --> 00:11:56,740
But notice how each of these items are strings. So that means we can split the string

171
00:11:56,890 --> 00:12:01,270
by the space and only get hold of the first item in that space.

172
00:12:02,110 --> 00:12:06,070
Let me show you what I mean. So we've got all of the article upvotes,

173
00:12:06,190 --> 00:12:09,670
let's go ahead and just print out the first item.

174
00:12:10,380 --> 00:12:11,213
Right.

175
00:12:13,770 --> 00:12:18,090
So now we just get the first item, which is 40 points. Now,

176
00:12:18,090 --> 00:12:21,840
if I take that item and I call the split method,

177
00:12:21,900 --> 00:12:26,400
then it's going to split every word in the sentence. By default,

178
00:12:26,400 --> 00:12:31,350
it splits by the space. Now, if I run this code,

179
00:12:32,220 --> 00:12:32,910
you can see

180
00:12:32,910 --> 00:12:37,910
I get a list where I've got the first item being 40 and the second being points.

181
00:12:39,390 --> 00:12:42,900
So it's basically split that string by the space.

182
00:12:43,740 --> 00:12:48,300
Now the next stage is I could get hold of just the first item that comes from

183
00:12:48,300 --> 00:12:50,130
that list, which is now 40.

184
00:12:50,820 --> 00:12:54,180
If I now finally wrap it around an int,

185
00:12:54,270 --> 00:12:59,130
then I can turn that into an actual number. Don't worry if your number changes

186
00:12:59,130 --> 00:13:03,600
because you're pulling data live from a website. That upvote number can change in

187
00:13:03,600 --> 00:13:06,540
any second. So this is the method

188
00:13:06,540 --> 00:13:10,260
how we can get hold of the actual number from the upvotes.

189
00:13:10,890 --> 00:13:14,790
Now we're going to apply all of this .split

190
00:13:14,850 --> 00:13:19,740
and also getting hold of the first item into our list comprehension.

191
00:13:20,100 --> 00:13:22,920
So for each of the scores that soup finds,

192
00:13:23,160 --> 00:13:27,420
we're going to get hold of the text and then split the text and then get the first

193
00:13:27,420 --> 00:13:30,150
item from the text. And then finally,

194
00:13:30,180 --> 00:13:35,180
we wrap all of this around an int and turn it into an integer. Then if I go ahead

195
00:13:36,960 --> 00:13:39,120
and uncomment all these lines of code,

196
00:13:39,540 --> 00:13:43,350
then you can see I've got all of these numbers being printed out,

197
00:13:43,980 --> 00:13:46,110
which means I can now sort them.

198
00:13:47,670 --> 00:13:51,930
I want to get the index of the list item that has the highest value.

199
00:13:52,140 --> 00:13:55,950
And then I want to use that index to pick out the title, text,

200
00:13:56,100 --> 00:13:58,920
and also the link from these two lists,

201
00:13:59,400 --> 00:14:02,610
because they're all ordered in exactly the same way.

202
00:14:02,610 --> 00:14:07,610
So this first item corresponds to this first link corresponds to this first 

203
00:14:07,950 --> 00:14:12,690
upvote. And I want to pose this to you as a challenge.

204
00:14:13,050 --> 00:14:16,080
Can you print out the title and link for the Hacker

205
00:14:16,080 --> 00:14:18,840
News story with the highest number of upvotes?

206
00:14:19,350 --> 00:14:22,080
Since we're working with three different lists at this point,

207
00:14:22,320 --> 00:14:27,320
you'll have to find the index of the largest number inside the article_upvotes

208
00:14:27,390 --> 00:14:28,830
list to accomplish this.

209
00:14:29,310 --> 00:14:32,910
I'll give you a few seconds to pause the video before I show you the solution.

210
00:14:36,330 --> 00:14:36,750
All right,

211
00:14:36,750 --> 00:14:41,750
here's the solution. We can use the max function that Python comes with to get

212
00:14:42,660 --> 00:14:46,440
the largest number from our article_upvotes.

213
00:14:48,450 --> 00:14:52,970
And then we can print this largest number and see if it works.

214
00:14:53,660 --> 00:14:58,040
So we've got 1,312. Now,

215
00:14:58,040 --> 00:15:00,650
once we've gotten hold of the largest number,

216
00:15:00,800 --> 00:15:04,160
then we can find its index from this list.

217
00:15:04,610 --> 00:15:07,640
So we can say article_upvotes.index

218
00:15:07,700 --> 00:15:11,300
and then we find the index of this largest number.

219
00:15:11,410 --> 00:15:12,243
Right.

220
00:15:16,270 --> 00:15:19,780
Now, if we hit run, you can see that we're getting index number 27.

221
00:15:20,470 --> 00:15:22,870
So instead of just printing out that index,

222
00:15:22,960 --> 00:15:27,960
we can print, instead, the article_texts with that index, so passing in the largest

223
00:15:30,010 --> 00:15:35,010
index, and also the article_links and passing in the same index.

224
00:15:37,690 --> 00:15:38,980
So now if I hit run,

225
00:15:39,070 --> 00:15:44,070
you can see that the most popular article at the moment on this page has this

226
00:15:45,100 --> 00:15:48,640
title text and this particular link. Of course for you

227
00:15:48,640 --> 00:15:52,870
it will be different because it depends on what's currently showing up on Hacker

228
00:15:52,870 --> 00:15:55,600
News. But if I refresh this page,

229
00:15:55,630 --> 00:16:00,630
you can see that this article with 1,313 points is of course the most popular

230
00:16:02,500 --> 00:16:05,890
article and it is the one that's about Mozilla.

231
00:16:07,030 --> 00:16:12,030
You can imagine a use case for this where every day we scrape all the data on Y

232
00:16:12,250 --> 00:16:16,840
Combinator and then we send ourselves through a text message through an email,

233
00:16:17,170 --> 00:16:20,860
the most upvoted title and article

234
00:16:21,100 --> 00:16:24,040
so that we can just look at that one thing.

235
00:16:25,150 --> 00:16:30,150
And you've seen now how we can use the requests module to get hold of the text

236
00:16:31,090 --> 00:16:34,750
the HTML code from a particular website,

237
00:16:35,020 --> 00:16:38,350
and then use Beautiful Soup to parse through that website

238
00:16:38,770 --> 00:16:43,770
and then to get hold of these specific parts that we want by using find_all or

239
00:16:44,470 --> 00:16:45,303
find,

240
00:16:45,460 --> 00:16:49,900
and then getting hold of the text or getting hold of the link or getting hold of

241
00:16:49,900 --> 00:16:51,640
any other thing that we want.

242
00:16:52,480 --> 00:16:56,920
So now that we've seen how we can do web scraping using Beautiful Soup, in the

243
00:16:56,920 --> 00:16:57,670
next lesson

244
00:16:57,670 --> 00:17:02,670
I want to talk a little bit about the ethics of scraping websites and when to do

245
00:17:03,370 --> 00:17:06,339
it and what you can use the data you get from this

246
00:17:06,339 --> 00:17:10,420
for. So for all of that and more, I'll see you on the next lesson.