CSS Locators W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data - PowerPoint PPT Presentation

CSS Locators W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU

Rosetta CSStone / replace b y > ( e x cept � rst character ) XPath : /html/body/div CSS Locator : html > body > div // replaced b y a blank space ( e x cept � rst character ) XPath : //div/span//p CSS Locator : div > span p [N] replaced b y :nth-of-type(N) XPath : //div/p[2] CSS Locator : div > p:nth-of-type(2) WEB SCRAPING IN PYTHON

Rosetta CSStone XPATH xpath = '/html/body//div/p[2]' CSS css = 'html > body div > p:nth-of-type(2)' WEB SCRAPING IN PYTHON

Attrib u tes in CSS To � nd an element b y class , u se a period . E x ample : p.class-1 selects all paragraph elements belonging to class-1 To � nd an element b y id , u se a po u nd sign # E x ample : div#uid selects the div element w ith id eq u al to uid WEB SCRAPING IN PYTHON

Attrib u tes in CSS Select paragraph elements w ithin class class1 : css_locator = 'div#uid > p.class1' Select all elements w hose class a � rib u te belongs to class1 : css_locator = '.class1' WEB SCRAPING IN PYTHON

Class Stat u s css = '.class1' WEB SCRAPING IN PYTHON

Class Stat u s xpath = '//*[@class="class1"]' WEB SCRAPING IN PYTHON

Class Stat u s xpath = '//*[contains(@class,"class1")]' WEB SCRAPING IN PYTHON

Selectors w ith CSS from scrapy import Selector html = ''' <html> <body> <div class="hello datacamp"> <p>Hello World!</p> </div> <p>Enjoy DataCamp!</p> </body> </html> ''' sel = Selector( text = html ) >>> sel.css("div > p") out: [<Selector xpath='...' data='<p>Hello World!</p>'>] >>> sel.css("div > p").extract() out: [ '<p>Hello World!</p>' ] WEB SCRAPING IN PYTHON

C ( SS ) Yo u Soon ! W E B SC R AP IN G IN P YTH ON

Attrib u te and Te x t Selection W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU

Yo u M u st ha v e G u ts to u se y o u r Colon Using XPath : <xpath-to-element>/@attr-name xpath = '//div[@id="uid"]/a/@href' Using CSS Locator : <css-to-element>::attr(attr-name) css_locator = 'div#uid > a::attr(href)' WEB SCRAPING IN PYTHON

Te x t E x traction <p id="p-example"> Hello world! Try <a href="http://www.datacamp.com">DataCamp</a> today! </p> In XPath u se text() sel.xpath('//p[@id="p-example"]/text()').extract() # result: ['\n Hello world!\n Try ', ' today!\n'] sel.xpath('//p[@id="p-example"]//text()').extract() # result: ['\n Hello world!\n Try ', 'DataCamp', ' today!\n'] WEB SCRAPING IN PYTHON

Te x t E x traction <p id="p-example"> Hello world! Try <a href="http://www.datacamp.com">DataCamp</a> today! </p> For CSS Locator , u se ::text sel.css('p#p-example::text').extract() # result: ['\n Hello world!\n Try ', ' today!\n'] sel.css('p#p-example ::text').extract() # result: ['\n Hello world!\n Try ', 'DataCamp', ' today!\n'] WEB SCRAPING IN PYTHON

Scoping the Colon W E B SC R AP IN G IN P YTH ON

Getting Read y to Cra w l W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU

Let ' s Respond Selector v s Response : The Response has all the tools w e learned w ith Selectors : xpath and css methods follo w ed b y extract and extract_first methods . The Response also keeps track of the u rl w here the HTML code w as loaded from . The Response helps u s mo v e from one site to another , so that w e can " cra w l " the w eb w hile scraping . WEB SCRAPING IN PYTHON

What We Kno w! xpath method w orks like a Selector response.xpath( '//div/span[@class="bio"]' ) css method w orks like a Selector response.css( 'div > span.bio' ) Chaining w orks like a Selector response.xpath('//div').css('span.bio') Data e x traction w orks like a Selector response.xpath('//div').css('span.bio').extract() response.xpath('//div').css('span.bio').extract_first() WEB SCRAPING IN PYTHON

What We Don ' t Kno w The response keeps track of the URL w ithin the response u rl v ariable . response.url >>> 'http://www.DataCamp.com/courses/all' The response lets u s " follo w" a ne w link w ith the follow() method # next_url is the string path of the next url we want to scrape response.follow( next_url ) We ' ll learn more abo u t follow later . WEB SCRAPING IN PYTHON

In Response W E B SC R AP IN G IN P YTH ON

Scraping For Reals W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU

DataCamp Site h � ps ://www. datacamp . com / co u rses / all WEB SCRAPING IN PYTHON

What ' s the Di v, Yo ? # response loaded with HTML from https://www.datacamp.com/courses/all course_divs = response.css('div.course-block') print( len(course_divs) ) >>> 185 WEB SCRAPING IN PYTHON

Inspecting co u rse - block first_div = course_divs[0] children = first_div.xpath('./*') print( len(children) ) >>> 3 WEB SCRAPING IN PYTHON

The first child first_div = course_divs[0] children = first_div.xpath('./*') first_child = children[0] print( first_child.extract() ) >>> <a class=... /> WEB SCRAPING IN PYTHON

The second child first_div = course_divs[0] children = first_div.xpath('./*') second_child = children[1] print( second_child.extract() ) >>> <div class=... /> WEB SCRAPING IN PYTHON

The forgotten child first_div = course_divs[0] children = first_div.xpath('./*') third_child = children[2] print( third_child.extract() ) >>> <span class=... /> WEB SCRAPING IN PYTHON

Listf u l In one CSS Locator links = response.css('div.course-block > a::attr(href)').extract() Step w ise # step 1: course blocks course_divs = response.css('div.course-block') # step 2: hyperlink elements hrefs = course_divs.xpath('./a/@href') # step 3: extract the links links = hrefs.extract() WEB SCRAPING IN PYTHON

Get Schooled for l in links: print( l ) >>> /courses/free-introduction-to-r >>> /courses/data-table-data-manipulation-r-tutorial >>> /courses/dplyr-data-manipulation-r-tutorial >>> /courses/ggvis-data-visualization-r-tutorial >>> /courses/reporting-with-r-markdown >>> /courses/intermediate-r ... WEB SCRAPING IN PYTHON

Links Achie v ed W E B SC R AP IN G IN P YTH ON

CSS Locators W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data - PowerPoint PPT Presentation

CSS Locators W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU Rosetta CSStone / replace b y > ( e x cept rst character ) XPath : /html/body/div CSS Locator : html > body > div // replaced b y a blank space ( e x

Cascading Style Sheets (CSS) (CSS) - Konsep dasar CSS - CSS properties Pemrograman Web/TI/ AK

CSS Styl e WHAT IS CSS? language for specifying the presentations of Web documents

CSS CUSTOM PROPERTIES (VARIABLES) What CSS Variables are? CSS variables are entities defined by

CSS CSS - cascading style sheets CSS - permite separar num documento HTML o contedo do

Cascading Style Sheets (CSS) Asst. Prof. Dr. Kanda Runapongsa Saikaew (krunapon@kku.ac.th)

Introduction to CSS Measurement Measurement 3 Measurement units units units Selectors

HTML CSS Content Presentation JavaScript Behavior CSS is for giving style to your content

Using DNS for Mapping Using DNS for Mapping Host Identifiers to Locators Host Identifiers to

Drawing on the Web CSS CSCI-UA 380 Cascading Style Sheets Drawing on the Web CSS CSCI-UA 380

Web Application Development Web Application Development What is CSS? CSS stands for C

Session 4 Style Sheets (CSS) 1 Reading & References Reading en.wikipedia.org/wiki/Css

CSS Style WHAT IS CSS? language for specifying the presentations of Web documents

CSE 154 LECTURE 3: MORE CSS Cascading Style Sheets (CSS): <link> <head> ...

CSS creative by @aganaplocha breaking the norm with CSS creative by @aganaplocha breaking

CSS Styl e WHAT IS CSS? language for specifying the presentations of Web documents

Council on Student Services (CSS) Orientation September 18 th , 2019 Orientation to CSS 1.

Updates and Status of the Noble Element Simulation Technique Jon Balajthy CPAD Workshop

Adaptive Streaming and HTML5 Mark Watson 8 February 2011 1 Ne5lix

An Error Correction Solver for Linear Systems Evaluation of Mixed Precision Implementations

Supernova neutrinos and Supernova Relic Neutrinos using a Water Cherenkov Detector M.Nakahata

Web Site Design and Development CS 0134 Fall 2018 T ues and Thurs 1:00 2:15PM By the end

Agenda Item 4A.2/3 Conceptual Framework: Elements First Draft (partial) Chapter 5 Paul

Creating Resilience to a Floodplain-Wetland Five years of interdisciplinary Complex in a Rapidly

1. Landscape Restoration Strategy Framework, Part A The first half of the meeting was dedicated to

CSS Locators W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data - PowerPoint PPT Presentation

CSS Locators W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU Rosetta CSStone / replace b y > ( e x cept rst character ) XPath : /html/body/div CSS Locator : html > body > div // replaced b y a blank space ( e x

Cascading Style Sheets (CSS) (CSS) - Konsep dasar CSS - CSS properties Pemrograman Web/TI/ AK

CSS Styl e WHAT IS CSS? language for specifying the presentations of Web documents

CSS CUSTOM PROPERTIES (VARIABLES) What CSS Variables are? CSS variables are entities defined by

CSS CSS - cascading style sheets CSS - permite separar num documento HTML o contedo do

Cascading Style Sheets (CSS) Asst. Prof. Dr. Kanda Runapongsa Saikaew (krunapon@kku.ac.th)

Introduction to CSS Measurement Measurement 3 Measurement units units units Selectors

HTML CSS Content Presentation JavaScript Behavior CSS is for giving style to your content

Using DNS for Mapping Using DNS for Mapping Host Identifiers to Locators Host Identifiers to

Drawing on the Web CSS CSCI-UA 380 Cascading Style Sheets Drawing on the Web CSS CSCI-UA 380

Web Application Development Web Application Development What is CSS? CSS stands for C

Session 4 Style Sheets (CSS) 1 Reading &amp; References Reading en.wikipedia.org/wiki/Css

CSS Style WHAT IS CSS? language for specifying the presentations of Web documents

CSE 154 LECTURE 3: MORE CSS Cascading Style Sheets (CSS): &lt;link&gt; &lt;head&gt; ...

CSS creative by @aganaplocha breaking the norm with CSS creative by @aganaplocha breaking

CSS Styl e WHAT IS CSS? language for specifying the presentations of Web documents

Council on Student Services (CSS) Orientation September 18 th , 2019 Orientation to CSS 1.

Updates and Status of the Noble Element Simulation Technique Jon Balajthy CPAD Workshop

Adaptive Streaming and HTML5 Mark Watson 8 February 2011 1 Ne5lix

An Error Correction Solver for Linear Systems Evaluation of Mixed Precision Implementations

Supernova neutrinos and Supernova Relic Neutrinos using a Water Cherenkov Detector M.Nakahata

Web Site Design and Development CS 0134 Fall 2018 T ues and Thurs 1:00 2:15PM By the end

Agenda Item 4A.2/3 Conceptual Framework: Elements First Draft (partial) Chapter 5 Paul

Creating Resilience to a Floodplain-Wetland Five years of interdisciplinary Complex in a Rapidly

1. Landscape Restoration Strategy Framework, Part A The first half of the meeting was dedicated to

Session 4 Style Sheets (CSS) 1 Reading & References Reading en.wikipedia.org/wiki/Css

CSE 154 LECTURE 3: MORE CSS Cascading Style Sheets (CSS): <link> <head> ...