Web Scraping With Python
W E B SC R AP IN G IN P YTH ON
Thomas Laetsch
Data Scientist, NYU
Web Scraping With P y thon W E B SC R AP IN G IN P YTH ON Thomas - - PowerPoint PPT Presentation
Web Scraping With P y thon W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU B u siness Sa vvy What are b u sinesses looking for ? Comparing prices Satisfaction of c u stomers Generating potential leads ... and m u ch more !
W E B SC R AP IN G IN P YTH ON
Thomas Laetsch
Data Scientist, NYU
WEB SCRAPING IN PYTHON
What are businesses looking for? Comparing prices Satisfaction of customers Generating potential leads ...and much more!
WEB SCRAPING IN PYTHON
What could you do? Search for your favorite memes on your favorite sites. Automatically look through classied ads for your favorite gadgets. Scrape social site content looking for hot topics. Scrape cooking blogs looking for particular recipes, or recipe reviews. ...and much more!
WEB SCRAPING IN PYTHON
WEB SCRAPING IN PYTHON
WEB SCRAPING IN PYTHON
Setup Understand what we want to do. Find sources to help us do it.
WEB SCRAPING IN PYTHON
Acquisition Read in the raw data from online. Format these data to be usable.
WEB SCRAPING IN PYTHON
Processing Many options!
WEB SCRAPING IN PYTHON
Our Focus Acquisition! (Using scrapy via python )
W E B SC R AP IN G IN P YTH ON
W E B SC R AP IN G IN P YTH ON
Thomas Laetsch
Data Scientist, NYU
WEB SCRAPING IN PYTHON
WEB SCRAPING IN PYTHON
<html> ... </html> <body> ... </body> <div> ... </div> <p> ... </p>
WEB SCRAPING IN PYTHON
WEB SCRAPING IN PYTHON
WEB SCRAPING IN PYTHON
W E B SC R AP IN G IN P YTH ON
W E B SC R AP IN G IN P YTH ON
Thomas Laetsch
Data Scientist, NYU
WEB SCRAPING IN PYTHON
Information within HTML tags can be valuable Extract link URLs Easier way to select elements
WEB SCRAPING IN PYTHON
We've seen tag names such as html, div, and p. The aribute name is followed by = followed by information assigned to that aribute, usually quoted text.
WEB SCRAPING IN PYTHON
id aribute should be unique class aribute doesn't need to be unique
WEB SCRAPING IN PYTHON
a tags are for hyperlinks href aribute tells what link to go to
WEB SCRAPING IN PYTHON
W E B SC R AP IN G IN P YTH ON
W E B SC R AP IN G IN P YTH ON
Thomas Laetsch
Data Scientist, NYU
WEB SCRAPING IN PYTHON
xpath = '/html/body/div[2]'
Simple XPath: Single forward-slash / used to move forward one generation. tag-names between slashes give direction to which element(s). Brackets [] aer a tag name tell us which of the selected siblings to choose.
WEB SCRAPING IN PYTHON
xpath = '/html/body/div[2]'
WEB SCRAPING IN PYTHON
Direct to all table elements within the entire HTML code:
xpath = '//table'
Direct to all table elements which are descendants of the 2nd div child of the body element:
xpath = '/html/body/div[2]//table`
W E B SC R AP IN G IN P YTH ON