XPath Navigation
W E B SC R AP IN G IN P YTH ON
Thomas Laetsch
Data Scientist, NYU
XPath Na v igation W E B SC R AP IN G IN P YTH ON Thomas Laetsch - - PowerPoint PPT Presentation
XPath Na v igation W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU Slashes and Brackets Single for w ard slash / looks for w ard one generation Do u ble for w ard slash // looks for w ard all f u t u re generations Sq u are
W E B SC R AP IN G IN P YTH ON
Thomas Laetsch
Data Scientist, NYU
WEB SCRAPING IN PYTHON
Single forward slash / looks forward one generation Double forward slash // looks forward all future generations Square brackets [] help narrow in on specic elements
WEB SCRAPING IN PYTHON
xpath = '/html/body' xpath = '/html[1]/body[1]'
Give the same selection
WEB SCRAPING IN PYTHON
xpath = '/html/body/p'
WEB SCRAPING IN PYTHON
xpath = '/html/body/div/p' xpath = '/html/body/div/p[2]'
WEB SCRAPING IN PYTHON
xpath = '//p' xpath = '//p[1]'
WEB SCRAPING IN PYTHON
xpath = '/html/body/*'
The asterisks * is the "wildcard"
W E B SC R AP IN G IN P YTH ON
W E B SC R AP IN G IN P YTH ON
Thomas Laetsch
Data Scientist, NYU
WEB SCRAPING IN PYTHON
@ represents "aribute" @class @id @href
WEB SCRAPING IN PYTHON
WEB SCRAPING IN PYTHON
xpath = '//p[@class="class-1"]'
WEB SCRAPING IN PYTHON
xpath = '//*[@id="uid"]'
WEB SCRAPING IN PYTHON
xpath = '//div[@id="uid"]/p[2]'
WEB SCRAPING IN PYTHON
Xpath Contains Notation: contains( @ari-name, "string-expr" )
WEB SCRAPING IN PYTHON
xpath = '//*[contains(@class,"class-1")]'
WEB SCRAPING IN PYTHON
xpath = '//*[@class="class-1"]'
WEB SCRAPING IN PYTHON
xpath = '/html/body/div/p[2]'
WEB SCRAPING IN PYTHON
xpath = '/html/body/div/p[2]/@class'
W E B SC R AP IN G IN P YTH ON
W E B SC R AP IN G IN P YTH ON
Thomas Laetsch
Data Scientist, NYU
WEB SCRAPING IN PYTHON
from scrapy import Selector html = ''' <html> <body> <div class="hello datacamp"> <p>Hello World!</p> </div> <p>Enjoy DataCamp!</p> </body> </html> ''' sel = Selector( text = html )
Created a scrapy Selector object using a string with the html code The selector sel has selected the entire html document
WEB SCRAPING IN PYTHON
We can use the xpath call within a Selector to create new Selector s of specic pieces
The return is a SelectorList of Selector objects
sel.xpath("//p") # outputs the SelectorList: [<Selector xpath='//p' data='<p>Hello World!</p>'>, <Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>]
WEB SCRAPING IN PYTHON
Use the extract() method
>>> sel.xpath("//p")
<Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>] >>> sel.xpath("//p").extract()
'<p>Enjoy DataCamp!</p>' ]
We can use extract_first() to get the rst element of the list
>>> sel.xpath("//p").extract_first()
WEB SCRAPING IN PYTHON
ps = sel.xpath('//p') second_p = ps[1] second_p.extract()
W E B SC R AP IN G IN P YTH ON
W E B SC R AP IN G IN P YTH ON
Thomas Laetsch, PhD
Data Scientist, NYU
WEB SCRAPING IN PYTHON
WEB SCRAPING IN PYTHON
WEB SCRAPING IN PYTHON
from scrapy import Selector import requests url = 'https://www.datacamp.com/courses/all' html = requests.get( url ).content sel = Selector( text = html )
W E B SC R AP IN G IN P YTH ON