1
00:00:00,210 --> 00:00:02,190
Now there's two things you'll notice here.

2
00:00:02,580 --> 00:00:06,900
One is we're only getting hold of the first, for example,

3
00:00:06,900 --> 00:00:11,730
p tag or a tag, but we're not getting hold of any of the other ones.

4
00:00:12,120 --> 00:00:14,970
So what if we wanted to get all of the anchor tags,

5
00:00:15,000 --> 00:00:18,420
all of the paragraphs in our website? Well,

6
00:00:18,450 --> 00:00:23,450
then we can use a function that comes with Beautiful Soup called find_all.

7
00:00:24,570 --> 00:00:27,990
This is probably one of the most commonly used methods when it comes to

8
00:00:27,990 --> 00:00:28,823
Beautiful Soup 

9
00:00:29,730 --> 00:00:34,730
And here we can search by a bunch of things. We could search by name,

10
00:00:34,950 --> 00:00:35,783
so we can say

11
00:00:35,790 --> 00:00:40,790
find all of the tags where the tag name is equal to a.

12
00:00:43,080 --> 00:00:46,050
So this is going to give us all of the anchor tags,

13
00:00:46,190 --> 00:00:47,023
right?

14
00:00:49,220 --> 00:00:50,840
And if I print that,

15
00:00:51,140 --> 00:00:56,140
you can see that it gives us a list and it gives us all three of the links that

16
00:00:57,650 --> 00:00:59,450
exists in our website.

17
00:01:00,350 --> 00:01:03,290
And if I change this to p for example,

18
00:01:03,470 --> 00:01:08,470
then it'll find all of the paragraphs and we can change this to basically any of

19
00:01:09,650 --> 00:01:13,850
the tag names in our website. Now,

20
00:01:13,850 --> 00:01:16,280
what if we wanted to drill a little bit deeper?

21
00:01:16,820 --> 00:01:19,190
We've got a list of all the anchor tags,

22
00:01:19,490 --> 00:01:23,870
but what if I only wanted the text in that anchor tag?

23
00:01:23,870 --> 00:01:28,730
So I just wanted this part. Well, how would I get hold of all of them? Well,

24
00:01:28,760 --> 00:01:31,730
firstly, we would probably need a for loop.

25
00:01:32,030 --> 00:01:37,030
So we could say for tag in all anchor tags and we can loop through all of those

26
00:01:39,560 --> 00:01:44,560
anchor tags and use a method called tag.getText.

27
00:01:46,220 --> 00:01:49,040
And now if I go ahead and print this,

28
00:01:49,130 --> 00:01:54,130
you can see that it's basically going to print out all three of the text that is

29
00:01:55,490 --> 00:01:59,810
in all three of the anchor tags that it found. Now,

30
00:01:59,810 --> 00:02:02,420
what if I didn't want to get the text,

31
00:02:02,510 --> 00:02:07,340
but instead I wanted to get hold of the actual href,

32
00:02:07,370 --> 00:02:11,870
so the link, right? So let's print our all_anchor_tags again.

33
00:02:12,650 --> 00:02:13,483
All right.

34
00:02:15,800 --> 00:02:20,150
And you can see that there is a attribute called href

35
00:02:20,420 --> 00:02:25,190
which stores the actual link that the tag goes to.

36
00:02:25,700 --> 00:02:30,200
So very often, you'll want to isolate that link. So how would you do that?

37
00:02:30,800 --> 00:02:31,280
Well,

38
00:02:31,280 --> 00:02:36,280
you can tap into each of the tags and you can use another method called get.

39
00:02:38,390 --> 00:02:42,140
And here you can get the value of any of the attributes.

40
00:02:42,590 --> 00:02:46,250
So if I pass in href here and I print this,

41
00:02:46,940 --> 00:02:51,200
then it's going to give me all of the links and it's not going to give me

42
00:02:51,200 --> 00:02:54,080
anything else. It's basically just stripped out the link

43
00:02:54,200 --> 00:02:59,090
which is what I'm interested in. Similarly, when we use find_all

44
00:02:59,920 --> 00:03:04,920
we can also find things by their attribute, so the moment we're searching by

45
00:03:05,410 --> 00:03:08,950
the tag name, but we can also get hold of things

46
00:03:09,040 --> 00:03:11,860
by the attribute name. For example,

47
00:03:11,860 --> 00:03:16,780
if I wanted to get hold of this item, I can of course search for an h1.

48
00:03:17,080 --> 00:03:19,660
But what if I had lots of h1s? Well,

49
00:03:19,660 --> 00:03:24,430
then I could isolate it by this ID. So I could say,

50
00:03:27,210 --> 00:03:27,990
soup

51
00:03:27,990 --> 00:03:32,990
.find_all which will give me a list of all of the items that match the search

52
00:03:34,680 --> 00:03:35,513
query,

53
00:03:35,760 --> 00:03:40,760
or I can use the find method to only find the first item that matches the query.

54
00:03:42,900 --> 00:03:45,450
In my case, there's only one thing I'm looking for.

55
00:03:45,690 --> 00:03:50,610
And this particular tag has a name of h1

56
00:03:51,270 --> 00:03:56,270
but it's also got a ID of name.

57
00:03:58,440 --> 00:03:59,280
As you can see,

58
00:03:59,520 --> 00:04:03,510
this ID is equal to name and it's also an h1 tag.

59
00:04:04,110 --> 00:04:07,290
So this will give us that particular element.

60
00:04:07,380 --> 00:04:12,380
So if I print out this heading and let's comment out everything else,

61
00:04:16,920 --> 00:04:21,600
then now you can see I've just isolated that one h1.

62
00:04:22,260 --> 00:04:25,830
And this also means if I just add another h1 here,

63
00:04:26,130 --> 00:04:26,963
...

64
00:04:29,220 --> 00:04:31,080
and I run this code again,

65
00:04:31,350 --> 00:04:36,150
that is not going to show up because I've said it has to have a name of h1

66
00:04:36,570 --> 00:04:40,020
and an ID that matches this particular value.

67
00:04:41,670 --> 00:04:42,900
Now, as you can imagine,

68
00:04:42,900 --> 00:04:47,040
you can also do the same thing with the class attribute.

69
00:04:47,640 --> 00:04:49,110
So we can say,

70
00:04:52,920 --> 00:04:53,430
...

71
00:04:53,430 --> 00:04:56,520
soup.find because again, I'm only looking for one.

72
00:04:57,240 --> 00:05:01,290
And the thing that I'm looking for has a name

73
00:05:01,680 --> 00:05:04,290
which is a h3

74
00:05:04,380 --> 00:05:05,213
...

75
00:05:07,140 --> 00:05:12,140
but it's also got a class that's equal to heading.

76
00:05:13,200 --> 00:05:17,220
So I'm just going to copy that and paste that in here. Now,

77
00:05:17,250 --> 00:05:22,250
one of the things you'll get here is an error because this class keyword is a

78
00:05:23,700 --> 00:05:25,710
reserved keyword in Python.

79
00:05:26,190 --> 00:05:29,100
And what that means is that it's a special word

80
00:05:29,370 --> 00:05:34,170
which can only be used for creating classes. Now, in this case,

81
00:05:34,200 --> 00:05:38,100
we're definitely not creating a class or an object here. Instead,

82
00:05:38,100 --> 00:05:39,990
we're trying to tap into an attribute.

83
00:05:40,470 --> 00:05:43,410
So in order to not clash with the class keyword,

84
00:05:43,680 --> 00:05:46,980
this attribute is actually called class_re.

85
00:05:48,330 --> 00:05:52,920
Now it's going to look for all of the h3s where the class attribute is

86
00:05:52,920 --> 00:05:54,360
equal to heading.

87
00:05:54,970 --> 00:05:59,970
Let's go ahead and print this section_heading and you should see now we'll get that 

88
00:06:01,220 --> 00:06:05,810
h3 with the class of heading show up. And again,

89
00:06:05,810 --> 00:06:09,080
if we wanted to get hold of the text

90
00:06:09,110 --> 00:06:14,060
that's contained in that h3, then we simply use the getText method,

91
00:06:14,600 --> 00:06:19,490
or if we want to know the name of that particular tag,

92
00:06:19,550 --> 00:06:21,080
then we can say .name.

93
00:06:23,920 --> 00:06:24,370
Okay.

94
00:06:24,370 --> 00:06:29,230
And if we want to get hold of the value of an attribute, for example,

95
00:06:29,260 --> 00:06:32,680
get the class value,

96
00:06:32,830 --> 00:06:36,430
then we can do something like this. Now,

97
00:06:36,460 --> 00:06:41,460
while that's a pretty good way of selecting elements from the entire website,

98
00:06:42,130 --> 00:06:46,360
there's certain cases where it might not work. For example,

99
00:06:46,990 --> 00:06:51,100
at the moment here, we've got our three anchor tags.

100
00:06:51,640 --> 00:06:55,300
If we wanted to get hold of a specific anchor tag,

101
00:06:55,540 --> 00:06:59,950
let's say we wanted this anchor tag, then what do we do?

102
00:07:00,340 --> 00:07:00,640
Well,

103
00:07:00,640 --> 00:07:05,640
then we could just simply find all of the anchor tags and then find the first

104
00:07:06,820 --> 00:07:09,280
one. But as you can imagine,

105
00:07:09,310 --> 00:07:12,400
this is a incredibly simple website.

106
00:07:12,700 --> 00:07:15,490
Most websites will have thousands

107
00:07:15,520 --> 00:07:19,000
if not tens of thousands of links. In that situation,

108
00:07:19,180 --> 00:07:24,180
it's really hard to know which particular link you want from the list of all of

109
00:07:24,880 --> 00:07:25,713
the anchor tags.

110
00:07:26,320 --> 00:07:31,320
So we want to have a way where we can drill down into a particular element.

111
00:07:32,440 --> 00:07:35,770
What's unique about this particular anchor tag? Well,

112
00:07:35,800 --> 00:07:40,800
it sits inside a strong tag and it sits inside an emphasis tag and it sits

113
00:07:41,830 --> 00:07:46,390
inside a paragraph tag, which itself is in the body.

114
00:07:47,080 --> 00:07:52,080
We can narrow it down using these steps. In our current website,

115
00:07:52,540 --> 00:07:57,340
nowhere else is there an anchor tag that sits inside a paragraph tag.

116
00:07:57,940 --> 00:08:02,940
And you'll remember from our previous lessons on CSS that you can use CSS

117
00:08:03,670 --> 00:08:08,670
selectors in order to narrow down on a particular element in order to specify

118
00:08:09,580 --> 00:08:13,600
its style. And if we were to write CSS code,

119
00:08:15,370 --> 00:08:17,680
then it would look something like this.

120
00:08:19,000 --> 00:08:24,000
So we would select first the paragraph and then we would select the anchor tag

121
00:08:24,670 --> 00:08:29,670
and then we can specify what the style should be

122
00:08:32,200 --> 00:08:32,830
.

123
00:08:32,830 --> 00:08:34,690
Now. When, we're using  Beautiful Soup,

124
00:08:34,840 --> 00:08:37,990
we can also use the CSS selectors.

125
00:08:38,620 --> 00:08:43,270
I can get hold of that company URL by simply saying soup,

126
00:08:43,720 --> 00:08:48,370
and instead of using find or find_all, I'm going to use select_one.

127
00:08:49,210 --> 00:08:54,210
There's select and select_one. Select_one will give us the first matching

128
00:08:54,430 --> 00:08:58,620
item and select will give us all of the matching items in a list.

129
00:08:59,280 --> 00:09:04,050
Now we get to specify the selector as a string. And again,

130
00:09:04,080 --> 00:09:07,020
I'm going to use the same selector that I showed you before.

131
00:09:07,350 --> 00:09:11,820
So we're looking for a a tag  which sits inside p tag.

132
00:09:12,240 --> 00:09:17,040
And this string is the CSS selector. So you can write anything in here

133
00:09:17,040 --> 00:09:21,810
really. This means that we'll be able to get that anchor tag.

134
00:09:22,080 --> 00:09:27,080
And then once I've gotten hold of the company URL and print it out,

135
00:09:27,720 --> 00:09:31,290
you can see its that exact anchor tag that we wanted.

136
00:09:32,610 --> 00:09:36,270
We don't have to just stick to the HTML selectors.

137
00:09:36,300 --> 00:09:40,890
You can also use the class or the ID in your CSS selector.

138
00:09:41,280 --> 00:09:44,220
So remember, to select on an ID,

139
00:09:44,520 --> 00:09:47,550
we use the pound sign. So let's say

140
00:09:47,880 --> 00:09:51,720
we want to get hold of this h1, which has an ID of name,

141
00:09:51,990 --> 00:09:53,820
we can say #name,

142
00:09:54,180 --> 00:09:58,740
and now this is going to be equal to my name.

143
00:09:59,160 --> 00:10:01,740
And if I now run it, you can see that last one,

144
00:10:01,950 --> 00:10:06,570
the element that was picked out is the h1 with the ID of name.

145
00:10:07,650 --> 00:10:12,210
And finally you can use a CSS selector to select an element by class.

146
00:10:12,390 --> 00:10:16,680
So for example, here, we've got heading and here we've got heading as well.

147
00:10:17,160 --> 00:10:22,160
So if we want to select all of the elements that have a class of heading,

148
00:10:22,980 --> 00:10:26,940
then we could say soup.select so this will give us a list

149
00:10:27,060 --> 00:10:32,060
and then the selector is the first item that goes into the method.

150
00:10:32,520 --> 00:10:37,470
So similar to this, we can have this keyword argument there or we can delete it.

151
00:10:38,070 --> 00:10:43,070
And this selector will be using the .heading in order to select the element

152
00:10:45,060 --> 00:10:47,130
that has a class of heading.

153
00:10:49,410 --> 00:10:53,430
And this is now going to be a list if we print it out.

154
00:10:55,620 --> 00:10:56,580
Right here.

155
00:10:56,630 --> 00:10:58,530
So,

156
00:10:58,880 --> 00:11:03,500
you can use everything that you've learned about CSS selectors to select a

157
00:11:03,500 --> 00:11:06,560
particular item out of an HTML file.

158
00:11:07,040 --> 00:11:11,330
And this is usually really useful because a lot of these elements will be nested

159
00:11:11,330 --> 00:11:14,600
inside divs and the div will have an ID

160
00:11:14,870 --> 00:11:19,190
and then all you have to do is to narrow down on the div and then narrow down on the

161
00:11:19,190 --> 00:11:20,023
element you want.

162
00:11:20,240 --> 00:11:25,240
So you can basically drill through using CSS selectors to get to any item you

163
00:11:26,180 --> 00:11:27,260
want on the page.

164
00:11:28,460 --> 00:11:33,460
Now that we've looked at how to find various items from HTML using Beautiful

165
00:11:33,980 --> 00:11:37,610
Soup, in the next lesson, I've got a quiz for you

166
00:11:37,910 --> 00:11:42,910
for you to have a go and have some practice at selecting and finding elements

167
00:11:43,850 --> 00:11:46,760
from an HTML file using Beautiful Soup.

168
00:11:47,240 --> 00:11:50,330
So for all of that and more, head over to the next lesson.

