css locators
play

CSS Locators W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data - PowerPoint PPT Presentation

CSS Locators W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU Rosetta CSStone / replace b y > ( e x cept rst character ) XPath : /html/body/div CSS Locator : html > body > div // replaced b y a blank space ( e x


  1. CSS Locators W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU

  2. Rosetta CSStone / replace b y > ( e x cept � rst character ) XPath : /html/body/div CSS Locator : html > body > div // replaced b y a blank space ( e x cept � rst character ) XPath : //div/span//p CSS Locator : div > span p [N] replaced b y :nth-of-type(N) XPath : //div/p[2] CSS Locator : div > p:nth-of-type(2) WEB SCRAPING IN PYTHON

  3. Rosetta CSStone XPATH xpath = '/html/body//div/p[2]' CSS css = 'html > body div > p:nth-of-type(2)' WEB SCRAPING IN PYTHON

  4. Attrib u tes in CSS To � nd an element b y class , u se a period . E x ample : p.class-1 selects all paragraph elements belonging to class-1 To � nd an element b y id , u se a po u nd sign # E x ample : div#uid selects the div element w ith id eq u al to uid WEB SCRAPING IN PYTHON

  5. Attrib u tes in CSS Select paragraph elements w ithin class class1 : css_locator = 'div#uid > p.class1' Select all elements w hose class a � rib u te belongs to class1 : css_locator = '.class1' WEB SCRAPING IN PYTHON

  6. Class Stat u s css = '.class1' WEB SCRAPING IN PYTHON

  7. Class Stat u s xpath = '//*[@class="class1"]' WEB SCRAPING IN PYTHON

  8. Class Stat u s xpath = '//*[contains(@class,"class1")]' WEB SCRAPING IN PYTHON

  9. Selectors w ith CSS from scrapy import Selector html = ''' <html> <body> <div class="hello datacamp"> <p>Hello World!</p> </div> <p>Enjoy DataCamp!</p> </body> </html> ''' sel = Selector( text = html ) >>> sel.css("div > p") out: [<Selector xpath='...' data='<p>Hello World!</p>'>] >>> sel.css("div > p").extract() out: [ '<p>Hello World!</p>' ] WEB SCRAPING IN PYTHON

  10. C ( SS ) Yo u Soon ! W E B SC R AP IN G IN P YTH ON

  11. Attrib u te and Te x t Selection W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU

  12. Yo u M u st ha v e G u ts to u se y o u r Colon Using XPath : <xpath-to-element>/@attr-name xpath = '//div[@id="uid"]/a/@href' Using CSS Locator : <css-to-element>::attr(attr-name) css_locator = 'div#uid > a::attr(href)' WEB SCRAPING IN PYTHON

  13. Te x t E x traction <p id="p-example"> Hello world! Try <a href="http://www.datacamp.com">DataCamp</a> today! </p> In XPath u se text() sel.xpath('//p[@id="p-example"]/text()').extract() # result: ['\n Hello world!\n Try ', ' today!\n'] sel.xpath('//p[@id="p-example"]//text()').extract() # result: ['\n Hello world!\n Try ', 'DataCamp', ' today!\n'] WEB SCRAPING IN PYTHON

  14. Te x t E x traction <p id="p-example"> Hello world! Try <a href="http://www.datacamp.com">DataCamp</a> today! </p> For CSS Locator , u se ::text sel.css('p#p-example::text').extract() # result: ['\n Hello world!\n Try ', ' today!\n'] sel.css('p#p-example ::text').extract() # result: ['\n Hello world!\n Try ', 'DataCamp', ' today!\n'] WEB SCRAPING IN PYTHON

  15. Scoping the Colon W E B SC R AP IN G IN P YTH ON

  16. Getting Read y to Cra w l W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU

  17. Let ' s Respond Selector v s Response : The Response has all the tools w e learned w ith Selectors : xpath and css methods follo w ed b y extract and extract_first methods . The Response also keeps track of the u rl w here the HTML code w as loaded from . The Response helps u s mo v e from one site to another , so that w e can " cra w l " the w eb w hile scraping . WEB SCRAPING IN PYTHON

  18. What We Kno w! xpath method w orks like a Selector response.xpath( '//div/span[@class="bio"]' ) css method w orks like a Selector response.css( 'div > span.bio' ) Chaining w orks like a Selector response.xpath('//div').css('span.bio') Data e x traction w orks like a Selector response.xpath('//div').css('span.bio').extract() response.xpath('//div').css('span.bio').extract_first() WEB SCRAPING IN PYTHON

  19. What We Don ' t Kno w The response keeps track of the URL w ithin the response u rl v ariable . response.url >>> 'http://www.DataCamp.com/courses/all' The response lets u s " follo w" a ne w link w ith the follow() method # next_url is the string path of the next url we want to scrape response.follow( next_url ) We ' ll learn more abo u t follow later . WEB SCRAPING IN PYTHON

  20. In Response W E B SC R AP IN G IN P YTH ON

  21. Scraping For Reals W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU

  22. DataCamp Site h � ps ://www. datacamp . com / co u rses / all WEB SCRAPING IN PYTHON

  23. What ' s the Di v, Yo ? # response loaded with HTML from https://www.datacamp.com/courses/all course_divs = response.css('div.course-block') print( len(course_divs) ) >>> 185 WEB SCRAPING IN PYTHON

  24. Inspecting co u rse - block first_div = course_divs[0] children = first_div.xpath('./*') print( len(children) ) >>> 3 WEB SCRAPING IN PYTHON

  25. The first child first_div = course_divs[0] children = first_div.xpath('./*') first_child = children[0] print( first_child.extract() ) >>> <a class=... /> WEB SCRAPING IN PYTHON

  26. The second child first_div = course_divs[0] children = first_div.xpath('./*') second_child = children[1] print( second_child.extract() ) >>> <div class=... /> WEB SCRAPING IN PYTHON

  27. The forgotten child first_div = course_divs[0] children = first_div.xpath('./*') third_child = children[2] print( third_child.extract() ) >>> <span class=... /> WEB SCRAPING IN PYTHON

  28. Listf u l In one CSS Locator links = response.css('div.course-block > a::attr(href)').extract() Step w ise # step 1: course blocks course_divs = response.css('div.course-block') # step 2: hyperlink elements hrefs = course_divs.xpath('./a/@href') # step 3: extract the links links = hrefs.extract() WEB SCRAPING IN PYTHON

  29. Get Schooled for l in links: print( l ) >>> /courses/free-introduction-to-r >>> /courses/data-table-data-manipulation-r-tutorial >>> /courses/dplyr-data-manipulation-r-tutorial >>> /courses/ggvis-data-visualization-r-tutorial >>> /courses/reporting-with-r-markdown >>> /courses/intermediate-r ... WEB SCRAPING IN PYTHON

  30. Links Achie v ed W E B SC R AP IN G IN P YTH ON

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend