XPath Na v igation W E B SC R AP IN G IN P YTH ON Thomas Laetsch - - PowerPoint PPT Presentation

xpath na v igation
SMART_READER_LITE
LIVE PREVIEW

XPath Na v igation W E B SC R AP IN G IN P YTH ON Thomas Laetsch - - PowerPoint PPT Presentation

XPath Na v igation W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU Slashes and Brackets Single for w ard slash / looks for w ard one generation Do u ble for w ard slash // looks for w ard all f u t u re generations Sq u are


slide-1
SLIDE 1

XPath Navigation

W E B SC R AP IN G IN P YTH ON

Thomas Laetsch

Data Scientist, NYU

slide-2
SLIDE 2

WEB SCRAPING IN PYTHON

Slashes and Brackets

Single forward slash / looks forward one generation Double forward slash // looks forward all future generations Square brackets [] help narrow in on specic elements

slide-3
SLIDE 3

WEB SCRAPING IN PYTHON

To Bracket or not to Bracket

xpath = '/html/body' xpath = '/html[1]/body[1]'

Give the same selection

slide-4
SLIDE 4

WEB SCRAPING IN PYTHON

A Body of P

xpath = '/html/body/p'

slide-5
SLIDE 5

WEB SCRAPING IN PYTHON

The Birds and the Ps

xpath = '/html/body/div/p' xpath = '/html/body/div/p[2]'

slide-6
SLIDE 6

WEB SCRAPING IN PYTHON

Double Slashing the Brackets

xpath = '//p' xpath = '//p[1]'

slide-7
SLIDE 7

WEB SCRAPING IN PYTHON

The Wildcard

xpath = '/html/body/*'

The asterisks * is the "wildcard"

slide-8
SLIDE 8

Xposé

W E B SC R AP IN G IN P YTH ON

slide-9
SLIDE 9

Off the Beaten XPath

W E B SC R AP IN G IN P YTH ON

Thomas Laetsch

Data Scientist, NYU

slide-10
SLIDE 10

WEB SCRAPING IN PYTHON

(At)tribute

@ represents "aribute" @class @id @href

slide-11
SLIDE 11

WEB SCRAPING IN PYTHON

Brackets and Attributes

slide-12
SLIDE 12

WEB SCRAPING IN PYTHON

Brackets and Attributes

xpath = '//p[@class="class-1"]'

slide-13
SLIDE 13

WEB SCRAPING IN PYTHON

Brackets and Attributes

xpath = '//*[@id="uid"]'

slide-14
SLIDE 14

WEB SCRAPING IN PYTHON

Brackets and Attributes

xpath = '//div[@id="uid"]/p[2]'

slide-15
SLIDE 15

WEB SCRAPING IN PYTHON

Content with Contains

Xpath Contains Notation: contains( @ari-name, "string-expr" )

slide-16
SLIDE 16

WEB SCRAPING IN PYTHON

Contain This

xpath = '//*[contains(@class,"class-1")]'

slide-17
SLIDE 17

WEB SCRAPING IN PYTHON

Contain This

xpath = '//*[@class="class-1"]'

slide-18
SLIDE 18

WEB SCRAPING IN PYTHON

Get Classy

xpath = '/html/body/div/p[2]'

slide-19
SLIDE 19

WEB SCRAPING IN PYTHON

Get Classy

xpath = '/html/body/div/p[2]/@class'

slide-20
SLIDE 20

End of the Path

W E B SC R AP IN G IN P YTH ON

slide-21
SLIDE 21

Introduction to the scrapy Selector

W E B SC R AP IN G IN P YTH ON

Thomas Laetsch

Data Scientist, NYU

slide-22
SLIDE 22

WEB SCRAPING IN PYTHON

Setting up a Selector

from scrapy import Selector html = ''' <html> <body> <div class="hello datacamp"> <p>Hello World!</p> </div> <p>Enjoy DataCamp!</p> </body> </html> ''' sel = Selector( text = html )

Created a scrapy Selector object using a string with the html code The selector sel has selected the entire html document

slide-23
SLIDE 23

WEB SCRAPING IN PYTHON

Selecting Selectors

We can use the xpath call within a Selector to create new Selector s of specic pieces

  • f the html code

The return is a SelectorList of Selector objects

sel.xpath("//p") # outputs the SelectorList: [<Selector xpath='//p' data='<p>Hello World!</p>'>, <Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>]

slide-24
SLIDE 24

WEB SCRAPING IN PYTHON

Extracting Data from a SelectorList

Use the extract() method

>>> sel.xpath("//p")

  • ut: [<Selector xpath='//p' data='<p>Hello World!</p>'>,

<Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>] >>> sel.xpath("//p").extract()

  • ut: [ '<p>Hello World!</p>',

'<p>Enjoy DataCamp!</p>' ]

We can use extract_first() to get the rst element of the list

>>> sel.xpath("//p").extract_first()

  • ut: '<p>Hello World!</p>'
slide-25
SLIDE 25

WEB SCRAPING IN PYTHON

Extracting Data from a Selector

ps = sel.xpath('//p') second_p = ps[1] second_p.extract()

  • ut: '<p>Enjoy DataCamp!</p>'
slide-26
SLIDE 26

Select This Course!

W E B SC R AP IN G IN P YTH ON

slide-27
SLIDE 27

"Inspecting the HTML"

W E B SC R AP IN G IN P YTH ON

Thomas Laetsch, PhD

Data Scientist, NYU

slide-28
SLIDE 28

WEB SCRAPING IN PYTHON

"Source" = HTML Code

slide-29
SLIDE 29

WEB SCRAPING IN PYTHON

Inspecting Elements

slide-30
SLIDE 30

WEB SCRAPING IN PYTHON

HTML text to Selector

from scrapy import Selector import requests url = 'https://www.datacamp.com/courses/all' html = requests.get( url ).content sel = Selector( text = html )

slide-31
SLIDE 31

You Know Our Secrets

W E B SC R AP IN G IN P YTH ON