1
00:00:00,480 --> 00:00:01,230
In this lesson,

2
00:00:01,230 --> 00:00:06,230
we're going to get started using a library called Beautiful Soup to pass an HTML

3
00:00:06,930 --> 00:00:08,280
file. Parsing

4
00:00:08,280 --> 00:00:13,280
the HTML file is the first step to extracting the data contained in a website. To

5
00:00:13,620 --> 00:00:17,160
get started, the first thing I want you to do is to head over to

6
00:00:17,160 --> 00:00:21,480
the course resources and download the starting project for today.

7
00:00:21,930 --> 00:00:26,010
It's called bs4-start and once you've extracted it and opened it in

8
00:00:26,010 --> 00:00:30,600
PyCharm, then I want you to take a look inside. First, we've got an empty

9
00:00:30,600 --> 00:00:31,800
main.py file

10
00:00:31,830 --> 00:00:36,810
which we're going to write  in order to use Beautiful Soup and extract the data

11
00:00:36,810 --> 00:00:41,670
that we want. And the data is going to come from this website.html.

12
00:00:42,270 --> 00:00:46,590
Now, a really nice thing about PyCharm is when you have an HTML file,

13
00:00:46,620 --> 00:00:49,620
you can always just click on your favorite browser icon here

14
00:00:50,160 --> 00:00:52,770
and it will open up the website as it is.

15
00:00:53,430 --> 00:00:58,430
And you can see that this is a simplified version of our HTML CV page that we

16
00:00:59,130 --> 00:01:02,190
built on day 41. Now,

17
00:01:02,220 --> 00:01:05,550
if you've skipped day 41 to 44

18
00:01:05,580 --> 00:01:08,070
because you already know HTML and CSS,

19
00:01:08,640 --> 00:01:11,190
then have a quick read through the HTML.

20
00:01:11,520 --> 00:01:15,420
It's a very simple document with a bunch of different HTML tags,

21
00:01:15,840 --> 00:01:18,030
attributes like class and ID,

22
00:01:18,570 --> 00:01:23,190
and also a list. I've kept this as simple as possible

23
00:01:23,220 --> 00:01:25,560
just so that it's easy for us to work through it

24
00:01:25,890 --> 00:01:27,600
when we're trying to get hold of things

25
00:01:27,630 --> 00:01:30,840
using Beautiful Soup. In the course resources,

26
00:01:30,900 --> 00:01:34,380
I've also got a link to the Beautiful Soup documentation.

27
00:01:34,860 --> 00:01:39,120
So this is where you can find out everything that you can do with Beautiful 

28
00:01:39,120 --> 00:01:43,920
Soup. But I wanna walk you through some of the most commonly used components.

29
00:01:44,610 --> 00:01:45,360
Beautiful Soup,

30
00:01:45,360 --> 00:01:50,360
as they say, is a Python library for pulling data out of HTML and XML files.

31
00:01:51,420 --> 00:01:56,190
So HTML and XML are both structural languages and they're responsible for

32
00:01:56,190 --> 00:02:00,960
structuring data like the data in a website using these tags.

33
00:02:01,680 --> 00:02:03,330
And the great thing about Beautiful Soup

34
00:02:03,380 --> 00:02:07,850
is it's super easy to use, and it can save you hours or,

35
00:02:07,910 --> 00:02:12,910
days of work to get hold of the data that you want from a particular website.

36
00:02:13,520 --> 00:02:18,410
And you can see the documentation's also been translated by kind users into some

37
00:02:18,440 --> 00:02:21,320
other languages as well. So if you find it easier,

38
00:02:21,590 --> 00:02:24,170
then these languages might help you as well. Now,

39
00:02:24,170 --> 00:02:29,170
the first thing I have to do here in my main.py is to actually get hold of

40
00:02:30,290 --> 00:02:32,270
this particular file.

41
00:02:32,840 --> 00:02:35,600
Now you might remember from previous lessons,

42
00:02:35,660 --> 00:02:38,240
how we open a file in Python.

43
00:02:38,900 --> 00:02:43,900
So have a quick think and see if you can figure out how to get a hold of the

44
00:02:44,330 --> 00:02:49,330
content in this HTML file as a string or as a piece of text.

45
00:02:50,060 --> 00:02:53,840
Pause the video and see if you can complete this challenge. Just as a hint,

46
00:02:53,960 --> 00:02:57,380
you might need the keyword with and the 

47
00:02:57,380 --> 00:02:58,330
keyword open.

48
00:03:01,030 --> 00:03:04,660
All right. So the way we do this, as we say with open,

49
00:03:05,020 --> 00:03:09,010
and then we provide the name of the file, which is website.html.

50
00:03:09,520 --> 00:03:13,540
And we can open this as a alias name which we'll call file,

51
00:03:14,140 --> 00:03:18,580
and now we have access to this file and we can say file.read.

52
00:03:19,240 --> 00:03:21,010
Now, once we've read this file,

53
00:03:21,040 --> 00:03:24,730
we can save this to a variable

54
00:03:24,730 --> 00:03:29,590
which we'll call contents. And once we've gotten hold of these contents,

55
00:03:29,770 --> 00:03:32,740
then we can start using Beautiful Soup.

56
00:03:33,460 --> 00:03:36,940
So as always, we always start with the import,

57
00:03:37,120 --> 00:03:42,120
so the name of the library that we're going to install is called bs4

58
00:03:43,600 --> 00:03:45,850
and this is Beautiful Soup version four

59
00:03:45,970 --> 00:03:49,210
which is currently the latest version of Beautiful Soup.

60
00:03:49,870 --> 00:03:53,620
And from this particular package, we want to import the class

61
00:03:53,620 --> 00:03:57,130
which is called Beautiful Soup. Now,

62
00:03:57,160 --> 00:04:00,460
if you downloaded this project from the course resources,

63
00:04:00,820 --> 00:04:05,820
you should see that there's no errors underlining this line because bs4 has

64
00:04:06,430 --> 00:04:09,610
already been installed. If you see some squiggly underlines,

65
00:04:09,670 --> 00:04:13,660
just click on the red light bulb and install the required module

66
00:04:13,750 --> 00:04:17,230
which is called bs4. Now,

67
00:04:17,230 --> 00:04:20,380
once we've got hold of our BeautifulSoup class

68
00:04:20,500 --> 00:04:24,220
then we're ready to make soup. In order to make soup,

69
00:04:24,280 --> 00:04:29,280
we use our BeautifulSoup class and we create a new object from that class

70
00:04:29,770 --> 00:04:34,360
and we pass in a string. So this is the markup,

71
00:04:34,720 --> 00:04:39,460
and that's the same M that you find in HTML and XML.

72
00:04:39,760 --> 00:04:44,760
It's the hypertext markup language and the extensible markup language.

73
00:04:47,170 --> 00:04:51,580
So the markup refers to basically all of this.

74
00:04:52,570 --> 00:04:55,480
We've already gotten hold of it through this contents

75
00:04:55,720 --> 00:04:57,520
so let's go ahead and pass that in.

76
00:04:58,300 --> 00:05:02,680
So now that we've specified what it is we want to use to create our soup,

77
00:05:03,370 --> 00:05:06,340
the next thing we have to provide is the parser.

78
00:05:06,850 --> 00:05:09,370
This is going to help the BeautifulSoup module

79
00:05:09,430 --> 00:05:14,430
understand what language this particular content is structured in.

80
00:05:15,370 --> 00:05:18,490
As they mentioned, it can parse HTML and XML

81
00:05:18,700 --> 00:05:22,660
so we have to tell it what particular type of document we've got.

82
00:05:23,260 --> 00:05:26,920
And the easiest way is just to use the Python

83
00:05:27,040 --> 00:05:28,420
html.parser.

84
00:05:29,650 --> 00:05:34,030
So after we've parsed in the text that we want to turn into soup,

85
00:05:34,600 --> 00:05:39,600
we're going to add the parser as html.parser.

86
00:05:41,740 --> 00:05:45,940
And this is going to help Beautiful Soup understand these contents.

87
00:05:47,200 --> 00:05:50,590
Now, depending on the website that you're working with, occasionally,

88
00:05:50,590 --> 00:05:55,390
you might need to use the lxml's parser. And do that,

89
00:05:55,480 --> 00:05:59,990
all you have to do is say import lxml

90
00:06:00,290 --> 00:06:05,290
and then install this particular package, and here in a string

91
00:06:06,290 --> 00:06:10,730
instead of using html.parser, you can use lxml.

92
00:06:11,450 --> 00:06:16,400
And this is basically just a different way of parsing or understanding the

93
00:06:16,400 --> 00:06:18,380
content that you're passing to Beautiful Soup.

94
00:06:18,890 --> 00:06:23,750
And I found that with certain websites the html.parser might not work

95
00:06:23,750 --> 00:06:26,990
and you might get an error that tells you something about your parser not

96
00:06:26,990 --> 00:06:31,070
working. So then you might consider using lxml instead.

97
00:06:32,450 --> 00:06:36,710
So this one line of code basically completes our parsing.

98
00:06:37,280 --> 00:06:42,280
And this soup is now an object that allows us to tap in to various parts of the

99
00:06:43,550 --> 00:06:47,270
website, but using Python code. For example,

100
00:06:47,270 --> 00:06:51,470
if I wanted this title tag out of this whole website,

101
00:06:51,770 --> 00:06:55,610
all I have to do is say soup.title.

102
00:06:57,680 --> 00:07:01,760
And now if I print this soup.title,

103
00:07:02,360 --> 00:07:07,360
you can see that we'll get the title tag being printed out in its entirety.

104
00:07:09,710 --> 00:07:14,710
Once Beautiful Soup has made sense of this website by parsing the HTML,

105
00:07:17,300 --> 00:07:19,460
we can now tap into that object

106
00:07:19,520 --> 00:07:23,510
which is the HTML code as if it were a Python object.

107
00:07:23,840 --> 00:07:25,730
So we can tap into the title,

108
00:07:26,000 --> 00:07:30,200
but we can dig even deeper. Instead of just tapping into the title,

109
00:07:30,500 --> 00:07:35,500
we can also get hold of other things like the title.name,

110
00:07:36,200 --> 00:07:41,200
and this is going to give us the name of that particular title tag.

111
00:07:41,960 --> 00:07:44,900
So remember this gave us the title tag,

112
00:07:44,930 --> 00:07:49,930
so all of this, and this next stage drilling even deeper into the name of this

113
00:07:50,900 --> 00:07:54,200
tag, you'll see that it gives us title.

114
00:07:54,710 --> 00:07:57,590
So the name of this title tag is called title,

115
00:07:58,250 --> 00:08:01,460
and we can also get hold of the string

116
00:08:01,700 --> 00:08:06,700
which is contained in the title tag by simply using .title.string.

117
00:08:07,280 --> 00:08:11,840
And you can see this is the actual string that's inside that title tag.

118
00:08:13,280 --> 00:08:16,130
If we think about it, this entire soup

119
00:08:16,190 --> 00:08:19,520
object now represents our HTML code.

120
00:08:20,030 --> 00:08:24,440
So I can also actually just print out the entire soup object.

121
00:08:24,950 --> 00:08:29,180
And you can see that this is basically just all HTML.

122
00:08:29,990 --> 00:08:34,100
And if you want to, there's even a method called prettify

123
00:08:34,460 --> 00:08:37,820
which will indent your soup HTML code.

124
00:08:38,120 --> 00:08:43,120
So now, compared this where everything's all on one line, with this prettified

125
00:08:43,909 --> 00:08:47,900
version where everything's all indented properly and easier to read.

126
00:08:49,130 --> 00:08:53,780
In addition to getting the title tag, we can also get hold of,

127
00:08:53,810 --> 00:08:56,700
for example, the a tag.

128
00:08:57,060 --> 00:09:02,060
So this is going to give us the first anchor tag that it finds in our website,

129
00:09:02,700 --> 00:09:05,220
which happens to be this one right here.

130
00:09:06,180 --> 00:09:11,180
And we can swap that with maybe the first li or the first paragraph.

131
00:09:14,910 --> 00:09:18,450
Essentially what we're doing with beautiful soup is we're just drilling down

132
00:09:18,480 --> 00:09:23,370
into this HTML file, finding the HTML tags that we're interested in,

133
00:09:23,850 --> 00:09:26,040
and then getting hold of the

134
00:09:26,100 --> 00:09:30,660
either name of the tag or the actual text of the tag.

135
00:09:31,470 --> 00:09:36,470
But what if we wanted all of the paragraphs or all of the anchor tags in our

136
00:09:37,380 --> 00:09:41,880
website, how would we do that? In the next lesson,

137
00:09:42,030 --> 00:09:47,030
we're going to dive deeper into searching through websites for all of the

138
00:09:47,190 --> 00:09:49,500
components that we're looking for. For example,

139
00:09:49,740 --> 00:09:52,410
all of the P tags or all of the anchor tags.

140
00:09:52,770 --> 00:09:57,770
And we're going to see how we can refine our search and specify exactly what it

141
00:09:58,140 --> 00:10:02,610
is that we want. So for all of that and more, I'll see you on the next lesson.