1
00:00:00,300 --> 00:00:02,969
Instructor: Hey guys, welcome to day 53

2
00:00:02,969 --> 00:00:05,430
of a hundred days of Code.

3
00:00:05,430 --> 00:00:09,060
Today it's time for your Capstone project,

4
00:00:09,060 --> 00:00:11,880
and it's time when you review everything that we've learned

5
00:00:11,880 --> 00:00:14,190
over the last 10 days or so.

6
00:00:14,190 --> 00:00:17,070
Everything to do with web scraping.

7
00:00:17,070 --> 00:00:18,870
The project that we're gonna be tackling

8
00:00:18,870 --> 00:00:21,360
is a data entry job.

9
00:00:21,360 --> 00:00:23,970
Now, there's a lot of data entry jobs out there

10
00:00:23,970 --> 00:00:27,510
where you're kind of just meant to transfer data

11
00:00:27,510 --> 00:00:29,040
from one format to another.

12
00:00:29,040 --> 00:00:32,700
So maybe you have it in a physical print copy,

13
00:00:32,700 --> 00:00:35,370
or maybe it's on a website, maybe it's in a PDF,

14
00:00:35,370 --> 00:00:37,620
and you just have to transfer it somewhere else,

15
00:00:37,620 --> 00:00:40,800
usually typing it into a spreadsheet.

16
00:00:40,800 --> 00:00:43,890
The inspiration for this project came from

17
00:00:43,890 --> 00:00:45,810
when I was browsing Reddit actually

18
00:00:45,810 --> 00:00:48,090
on the r/Python subreddit,

19
00:00:48,090 --> 00:00:51,540
which is a really good community for you to actually look at

20
00:00:51,540 --> 00:00:54,000
and see what other people are doing with Python

21
00:00:54,000 --> 00:00:56,880
and seeing the latest and greatest things built

22
00:00:56,880 --> 00:00:59,100
or news about Python.

23
00:00:59,100 --> 00:01:01,650
Now, one of the posts I saw was asking

24
00:01:01,650 --> 00:01:04,890
whether if anyone has automated their job completely,

25
00:01:04,890 --> 00:01:06,810
basically using Python.

26
00:01:06,810 --> 00:01:09,570
Now we've seen how powerful Python can be,

27
00:01:09,570 --> 00:01:11,640
especially when we apply it to web scraping

28
00:01:11,640 --> 00:01:14,400
using Beautiful Soup and Selenium.

29
00:01:14,400 --> 00:01:17,040
And looking through all the comments,

30
00:01:17,040 --> 00:01:19,590
there's actually a lot of people who have done this,

31
00:01:19,590 --> 00:01:22,170
including this one guy

32
00:01:22,170 --> 00:01:27,170
who basically pretty much automated his entire job.

33
00:01:27,600 --> 00:01:30,420
And the jobs that tend to be easily automated

34
00:01:30,420 --> 00:01:33,300
using Python are data entry jobs,

35
00:01:33,300 --> 00:01:36,270
moving data from one format to another.

36
00:01:36,270 --> 00:01:39,690
And if you think about it, if that job is in fact remote,

37
00:01:39,690 --> 00:01:44,690
so if you search on Indeed.com for a remote data entry job

38
00:01:44,790 --> 00:01:47,700
and you get up and running with the company

39
00:01:47,700 --> 00:01:49,440
and you start doing it manually,

40
00:01:49,440 --> 00:01:52,800
and then once you've understood what it is you have to do,

41
00:01:52,800 --> 00:01:56,910
for example, gathering statistical data, preparing reports,

42
00:01:56,910 --> 00:01:58,830
and maintaining spreadsheets,

43
00:01:58,830 --> 00:02:02,190
if you realize that this is a large part of your job

44
00:02:02,190 --> 00:02:05,610
and you can automate it pretty much with Python,

45
00:02:05,610 --> 00:02:08,940
then you could probably get Python to do 70% of your job

46
00:02:08,940 --> 00:02:11,160
and you spend the rest 30% of the day

47
00:02:11,160 --> 00:02:13,950
doing the rest of the job, but still being paid

48
00:02:13,950 --> 00:02:16,950
as a full on worker with full benefits.

49
00:02:16,950 --> 00:02:18,450
So this is something that a lot of people

50
00:02:18,450 --> 00:02:22,050
in the Python community has talked about and explored.

51
00:02:22,050 --> 00:02:24,960
And this is something that we are going to be trying out

52
00:02:24,960 --> 00:02:29,430
using both Beautiful Soup and Selenium in this project.

53
00:02:29,430 --> 00:02:31,140
In our case, we're gonna be tackling

54
00:02:31,140 --> 00:02:33,480
a research data entry job

55
00:02:33,480 --> 00:02:36,450
where we're researching house prices

56
00:02:36,450 --> 00:02:40,080
that fit a particular criteria for a client

57
00:02:40,080 --> 00:02:41,580
on the Zillow website

58
00:02:41,580 --> 00:02:45,300
and then we're gonna be transferring that data into a form,

59
00:02:45,300 --> 00:02:48,420
which will create a spreadsheet in Google Sheets.

60
00:02:48,420 --> 00:02:52,380
And that is usually how as a data entry person,

61
00:02:52,380 --> 00:02:54,180
this is how we would make our money.

62
00:02:55,230 --> 00:02:58,110
Now, because this is a Capstone project,

63
00:02:58,110 --> 00:03:00,360
we're gonna be using everything that we've learned

64
00:03:00,360 --> 00:03:02,100
in this section.

65
00:03:02,100 --> 00:03:05,430
That means Beautiful soup as well as Selenium.

66
00:03:05,430 --> 00:03:06,810
You might have to revise up

67
00:03:06,810 --> 00:03:08,220
on some of the things you learned,

68
00:03:08,220 --> 00:03:10,290
especially the stuff on Beautiful Soup,

69
00:03:10,290 --> 00:03:12,090
which we covered a few days ago

70
00:03:12,090 --> 00:03:14,010
and we're gonna combine all the skills

71
00:03:14,010 --> 00:03:15,720
that you've done so far.

72
00:03:15,720 --> 00:03:17,400
And this project really is gonna test

73
00:03:17,400 --> 00:03:20,640
all of your web scraping skills that you've acquired so far

74
00:03:20,640 --> 00:03:23,070
and see how far you can run with it.

75
00:03:23,070 --> 00:03:24,480
Because it's Capstone project,

76
00:03:24,480 --> 00:03:26,610
there's not gonna be a lot of guidance,

77
00:03:26,610 --> 00:03:28,230
so you're gonna have to persevere

78
00:03:28,230 --> 00:03:30,570
and try to see if you can solve your own problems

79
00:03:30,570 --> 00:03:33,900
and see if you can get to the end outcome.

80
00:03:33,900 --> 00:03:36,960
Let's say that you have a client who wants you

81
00:03:36,960 --> 00:03:41,100
to compile a list of all the places that they can rent

82
00:03:41,100 --> 00:03:45,570
in San Francisco up to $3,000 per month,

83
00:03:45,570 --> 00:03:48,870
and it has to have at least one bedroom.

84
00:03:48,870 --> 00:03:50,790
Now, San Francisco is notorious

85
00:03:50,790 --> 00:03:53,820
for really expensive housing,

86
00:03:53,820 --> 00:03:55,440
and it's also really difficult

87
00:03:55,440 --> 00:03:58,350
often to actually find somewhere that you wanna live.

88
00:03:58,350 --> 00:04:01,320
On Zillow, you can already filter on these things.

89
00:04:01,320 --> 00:04:03,900
So for example, you could say, this is the area,

90
00:04:03,900 --> 00:04:06,810
San Francisco, California that I want to rent

91
00:04:06,810 --> 00:04:09,060
and then of course, changing it to for rent

92
00:04:09,060 --> 00:04:10,440
rather than for sale,

93
00:04:10,440 --> 00:04:15,390
switching the maximum price up to $3,000

94
00:04:15,390 --> 00:04:17,760
and then adding in the extra requirement

95
00:04:17,760 --> 00:04:20,910
that must have at least one bedroom.

96
00:04:20,910 --> 00:04:24,270
Now, we could use the live version of Zillow

97
00:04:24,270 --> 00:04:25,860
to do this project,

98
00:04:25,860 --> 00:04:29,280
but the problem is that websites frequently get updated.

99
00:04:29,280 --> 00:04:32,790
Companies like Zillow continuously improve their site,

100
00:04:32,790 --> 00:04:36,060
so they might change the structure of their HTML,

101
00:04:36,060 --> 00:04:38,460
the names of their CSS classes

102
00:04:38,460 --> 00:04:42,660
and have popups to the website or introduce captures

103
00:04:42,660 --> 00:04:45,300
which causes issues for Selenium.

104
00:04:45,300 --> 00:04:46,890
For this course, I wanna make sure

105
00:04:46,890 --> 00:04:49,410
that we can all practice writing our code

106
00:04:49,410 --> 00:04:52,410
in a stable environment that doesn't change

107
00:04:52,410 --> 00:04:55,710
and that my code solution continues to work.

108
00:04:55,710 --> 00:04:59,070
That's why I've created a clone of Zillow's website

109
00:04:59,070 --> 00:05:01,590
so that you can practice and test your knowledge

110
00:05:01,590 --> 00:05:03,693
of Beautiful Soup and Selenium.

111
00:05:04,590 --> 00:05:08,370
Open up the Zillow clone website inside your Chrome browser.

112
00:05:08,370 --> 00:05:10,590
And as you can see, I've created a snapshot

113
00:05:10,590 --> 00:05:12,060
of the Zillow site

114
00:05:12,060 --> 00:05:15,030
where I've already narrowed down the search criteria.

115
00:05:15,030 --> 00:05:17,850
I've picked San Francisco as the location,

116
00:05:17,850 --> 00:05:21,240
and I've picked for rent rather than for sale,

117
00:05:21,240 --> 00:05:24,810
and I specify the price as up to $3,000

118
00:05:24,810 --> 00:05:27,990
for a one bedroom apartment.

119
00:05:27,990 --> 00:05:31,170
So the URL you should use with Selenium for this project

120
00:05:31,170 --> 00:05:36,170
should read https://appbrewery.github.io/Zillow-Clone.

121
00:05:42,630 --> 00:05:44,910
Now, in addition to using that URL,

122
00:05:44,910 --> 00:05:47,490
you're also gonna be using Beautiful Soup

123
00:05:47,490 --> 00:05:50,820
to scrape through all of this data.

124
00:05:50,820 --> 00:05:54,930
And what we want is the price, the address,

125
00:05:54,930 --> 00:05:58,830
and also the URL that this will link to.

126
00:05:58,830 --> 00:06:01,080
So for example, when I click on this,

127
00:06:01,080 --> 00:06:04,680
it will link to the actual listing of the place.

128
00:06:04,680 --> 00:06:06,510
And then once you've scraped all of that data

129
00:06:06,510 --> 00:06:10,260
using Beautiful Soup, then you're gonna be using Selenium

130
00:06:10,260 --> 00:06:13,740
to auto fill in a Google form.

131
00:06:13,740 --> 00:06:15,870
So we're gonna be adding in the address of the property,

132
00:06:15,870 --> 00:06:18,210
the price per month and the linked property

133
00:06:18,210 --> 00:06:21,120
and of course, we're gonna fill out one of these forms

134
00:06:21,120 --> 00:06:24,000
per listing that we have on Zillow.

135
00:06:24,000 --> 00:06:26,700
And once all of that form's been compiled,

136
00:06:26,700 --> 00:06:31,470
then you'll have the option to turn it into a spreadsheet.

137
00:06:31,470 --> 00:06:34,770
Whenever you create a form in Google Forms,

138
00:06:34,770 --> 00:06:37,290
you can see that when you go to the responses tab,

139
00:06:37,290 --> 00:06:39,420
you can click on this button

140
00:06:39,420 --> 00:06:42,690
in order to create a Google sheet from the responses

141
00:06:42,690 --> 00:06:44,100
that have been submitted

142
00:06:44,100 --> 00:06:45,570
and this is what you end up with,

143
00:06:45,570 --> 00:06:49,290
a spreadsheet with the address of the property,

144
00:06:49,290 --> 00:06:52,380
the price per month, and a link to the property.

145
00:06:52,380 --> 00:06:54,810
So this way, once you've done this research,

146
00:06:54,810 --> 00:06:56,760
then you can send it to your client

147
00:06:56,760 --> 00:06:59,100
so that they can filter down on each of the listings

148
00:06:59,100 --> 00:07:02,880
that match their criteria and decide which one they wanna go

149
00:07:02,880 --> 00:07:04,740
and make a viewing.

150
00:07:04,740 --> 00:07:07,530
So this of course makes their job a little bit easier,

151
00:07:07,530 --> 00:07:10,230
and this is our research task

152
00:07:10,230 --> 00:07:12,360
that we're gonna complete today.

153
00:07:12,360 --> 00:07:14,760
So the first part of scraping the data

154
00:07:14,760 --> 00:07:17,190
for the relevant listings is gonna be done

155
00:07:17,190 --> 00:07:20,160
using Beautiful Soup and then the second part

156
00:07:20,160 --> 00:07:23,790
where we're gonna be filling in this form is gonna be done

157
00:07:23,790 --> 00:07:25,620
using Selenium.

158
00:07:25,620 --> 00:07:28,020
So that is the project.

159
00:07:28,020 --> 00:07:30,930
And once you're ready, head over to the next lesson

160
00:07:30,930 --> 00:07:33,990
and take a look at the requirements of the project

161
00:07:33,990 --> 00:07:37,203
and we can get started with the Capstone project.

