1
00:00:00,060 --> 00:00:00,750
Hey guys,

2
00:00:00,750 --> 00:00:05,330
welcome to Day 45 of 100 Days of Code. Now,

3
00:00:05,330 --> 00:00:08,450
today, we're going to be getting back to coding with Python,

4
00:00:08,900 --> 00:00:12,080
and we're going to be learning how to scrape the web for data

5
00:00:12,320 --> 00:00:14,810
using a module called BeautifulSoup.

6
00:00:15,770 --> 00:00:19,040
Now we've been working with APIs for quite a while now,

7
00:00:19,550 --> 00:00:20,600
and we know that 

8
00:00:20,630 --> 00:00:25,630
we can use a website's API to access their data or to interact with the

9
00:00:26,120 --> 00:00:27,860
website using code.

10
00:00:28,520 --> 00:00:32,810
But some websites don't have an API or their API

11
00:00:32,810 --> 00:00:35,630
doesn't let us do all the things that we want to do.

12
00:00:36,500 --> 00:00:40,460
So this is where we start thinking about using web scraping

13
00:00:41,090 --> 00:00:44,540
where we look through the underlying HTML code

14
00:00:44,570 --> 00:00:48,110
of a website to get hold of the information that we want.

15
00:00:49,130 --> 00:00:52,880
So the aim of today is to learn how to make soup,

16
00:00:53,390 --> 00:00:57,410
but not this kind of soup. We're going to be making BeautifulSoup.

17
00:00:57,980 --> 00:00:59,960
What exactly is BeautifulSoup? Well,

18
00:00:59,990 --> 00:01:04,989
it's a module that helps developers like us make sense of websites.

19
00:01:06,170 --> 00:01:09,770
We could think of a lot of websites as a bit of a spaghetti soup,

20
00:01:10,190 --> 00:01:14,000
even something seemingly as simple as the Google front page,

21
00:01:14,270 --> 00:01:16,850
when you right click on it and view page source,

22
00:01:17,120 --> 00:01:20,060
you can see that it's horrendously complicated.

23
00:01:20,570 --> 00:01:25,100
And if you wanted to make sense of this webpage and pull out the relevant parts

24
00:01:25,100 --> 00:01:25,933
of the data,

25
00:01:26,270 --> 00:01:31,130
then you'll need an HTML parser like BeautifulSoup so that you can

26
00:01:31,130 --> 00:01:36,130
find and pull out the HTML elements that you're interested in from this

27
00:01:36,680 --> 00:01:39,140
soup of jumbled HTML code.

28
00:01:39,770 --> 00:01:42,380
And once we've mastered this skill,

29
00:01:42,500 --> 00:01:46,040
then we'll be able to take any website, for example,

30
00:01:46,070 --> 00:01:48,980
Empire's 100 Greatest Movies Of All Time,

31
00:01:49,220 --> 00:01:53,000
this is a huge list of a hundred movies that apparently everyone should have

32
00:01:53,000 --> 00:01:54,950
watched at some point in their life,

33
00:01:55,400 --> 00:01:58,460
and we can pull out the relevant parts to us

34
00:01:58,700 --> 00:02:02,540
namely the title and the ranking of each movie

35
00:02:02,840 --> 00:02:07,840
and we're going to use it to compile a list of movies that we have to watch so

36
00:02:07,880 --> 00:02:11,390
that we can look at the list, cross out the ones that we've already seen,

37
00:02:11,720 --> 00:02:16,280
and then pick at random one from the list so that we can watch all of the

38
00:02:16,280 --> 00:02:19,550
hundred movies of all time. That's the goal.

39
00:02:19,760 --> 00:02:24,290
And once you're ready head over to the next lesson and we're going to get started

40
00:02:24,350 --> 00:02:25,850
using BeautifulSoup.

