a class y spider
play

A Class y Spider W E B SC R AP IN G IN P YTH ON Thomas Laetsch - PowerPoint PPT Presentation

A Class y Spider W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU Yo u r Spider import scrapy from scrapy.crawler import CrawlerProcess class SpiderClassName(scrapy.Spider): name = "spider_name" # the code for


  1. A Class y Spider W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU

  2. Yo u r Spider import scrapy from scrapy.crawler import CrawlerProcess class SpiderClassName(scrapy.Spider): name = "spider_name" # the code for your spider ... process = CrawlerProcess() process.crawl(SpiderClassName) process.start() WEB SCRAPING IN PYTHON

  3. Yo u r Spider Req u ired imports import scrapy from scrapy.crawler import CrawlerProcess The part w e w ill foc u s on : the act u al spider class SpiderClassName(scrapy.Spider): name = "spider_name" # the code for your spider ... R u nning the spider # initiate a CrawlerProcess process = CrawlerProcess() # tell the process which spider to use process.crawl(YourSpider) # start the crawling process process.start() WEB SCRAPING IN PYTHON

  4. Wea v ing the Web class DCspider( scrapy.Spider ): name = 'dc_spider' def start_requests( self ): urls = [ 'https://www.datacamp.com/courses/all' ] for url in urls: yield scrapy.Request( url = url, callback = self.parse ) def parse( self, response ): # simple example: write out the html html_file = 'DC_courses.html' with open( html_file, 'wb' ) as fout: fout.write( response.body ) Need to ha v e a f u nction called start_requests Need to ha v e at least one parser f u nction to handle the HTML code WEB SCRAPING IN PYTHON

  5. We ' ll Wea v e the Web Together W E B SC R AP IN G IN P YTH ON

  6. A Req u est for Ser v ice W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU

  7. Spider Recall import scrapy from scrapy.crawler import CrawlerProcess class SpiderClassName(scrapy.Spider): name = "spider_name" # the code for your spider ... process = CrawlerProcess() process.crawl(SpiderClassName) process.start() WEB SCRAPING IN PYTHON

  8. Spider Recall class DCspider( scrapy.Spider ): name = "dc_spider" def start_requests( self ): urls = [ 'https://www.datacamp.com/courses/all' ] for url in urls: yield scrapy.Request( url = url, callback = self.parse ) def parse( self, response ): # simple example: write out the html html_file = 'DC_courses.html' with open( html_file, 'wb' ) as fout: fout.write( response.body ) WEB SCRAPING IN PYTHON

  9. The Skinn y on start _ req u ests def start_requests( self ): urls = ['https://www.datacamp.com/courses/all'] for url in urls: yield scrapy.Request( url = url, callback = self.parse ) def start_requests( self ): url = 'https://www.datacamp.com/courses/all' yield scrapy.Request( url = url, callback = self.parse ) scrapy.Request here w ill � ll in a response v ariable for u s The url arg u ment tells u s w hich site to scrape The callback arg u ment tells u s w here to send the response v ariable for processing WEB SCRAPING IN PYTHON

  10. Zoom O u t class DCspider( scrapy.Spider ): name = "dc_spider" def start_requests( self ): urls = [ 'https://www.datacamp.com/courses/all' ] for url in urls: yield scrapy.Request( url = url, callback = self.parse ) def parse( self, response ): # simple example: write out the html html_file = 'DC_courses.html' with open( html_file, 'wb' ) as fout: fout.write( response.body ) WEB SCRAPING IN PYTHON

  11. End Req u est W E B SC R AP IN G IN P YTH ON

  12. Mo v e Yo u r Bloomin ' Parse W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU

  13. Once Again class DCspider( scrapy.Spider ): name = "dcspider" def start_requests( self ): urls = [ 'https://www.datacamp.com/courses/all' ] for url in urls: yield scrapy.Request( url = url, callback = self.parse ) def parse( self, response ): # simple example: write out the html html_file = 'DC_courses.html' with open( html_file, 'wb' ) as fout: fout.write( response.body ) WEB SCRAPING IN PYTHON

  14. Yo u Alread y Kno w! def parse( self, response ): # input parsing code with response that you already know! # output to a file, or... # crawl the web! WEB SCRAPING IN PYTHON

  15. DataCamp Co u rse Links : Sa v e to File class DCspider( scrapy.Spider ): name = "dcspider" def start_requests( self ): urls = [ 'https://www.datacamp.com/courses/all' ] for url in urls: yield scrapy.Request( url = url, callback = self.parse ) def parse( self, response ): links = response.css('div.course-block > a::attr(href)').extract() filepath = 'DC_links.csv' with open( filepath, 'w' ) as f: f.writelines( [link + '/n' for link in links] ) WEB SCRAPING IN PYTHON

  16. DataCamp Co u rse Links : Parse Again class DCspider( scrapy.Spider ): name = "dcspider" def start_requests( self ): urls = [ 'https://www.datacamp.com/courses/all' ] for url in urls: yield scrapy.Request( url = url, callback = self.parse ) def parse( self, response ): links = response.css('div.course-block > a::attr(href)').extract() for link in links: yield response.follow( url = link, callback = self.parse2 ) def parse2( self, response ): # parse the course sites here! WEB SCRAPING IN PYTHON

  17. WEB SCRAPING IN PYTHON

  18. Johnn y Parsin ' W E B SC R AP IN G IN P YTH ON

  19. Capstone W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU

  20. Inspecting Elements import scrapy from scrapy.crawler import CrawlerProcess class DC_Chapter_Spider(scrapy.Spider): name = "dc_chapter_spider" def start_requests( self ): url = 'https://www.datacamp.com/courses/all' yield scrapy.Request( url = url, callback = self.parse_front ) def parse_front( self, response ): ## Code to parse the front courses page def parse_pages( self, response ): ## Code to parse course pages ## Fill in dc_dict here dc_dict = dict() process = CrawlerProcess() process.crawl(DC_Chapter_Spider) process.start() WEB SCRAPING IN PYTHON

  21. Parsing the Front Page def parse_front( self, response ): # Narrow in on the course blocks course_blocks = response.css( 'div.course-block' ) # Direct to the course links course_links = course_blocks.xpath( './a/@href' ) # Extract the links (as a list of strings) links_to_follow = course_links.extract() # Follow the links to the next parser for url in links_to_follow: yield response.follow( url = url, callback = self.parse_pages ) WEB SCRAPING IN PYTHON

  22. Parsing the Co u rse Pages def parse_pages( self, response ): # Direct to the course title text crs_title = response.xpath('//h1[contains(@class,"title")]/text()') # Extract and clean the course title text crs_title_ext = crs_title.extract_first().strip() # Direct to the chapter titles text ch_titles = response.css( 'h4.chapter__title::text' ) # Extract and clean the chapter titles text ch_titles_ext = [t.strip() for t in ch_titles.extract()] # Store this in our dictionary dc_dict[ crs_title_ext ] = ch_titles_ext WEB SCRAPING IN PYTHON

  23. It ' s time to Wea v e W E B SC R AP IN G IN P YTH ON

  24. Stop Scratching and Start Scraping ! W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU

  25. Feeding the Machine WEB SCRAPING IN PYTHON

  26. Scraping Skills Objecti v e : Scrape a w ebsite comp u tationall y Ho w ? We decide to u se scrapy Ho w ? We need to w ork w ith : Selector and Response objects Ma y be e v en create a Spider Ho w ? We need to learn XPath or CSS Locator notation Ho w ? Understand the str u ct u re of HTML WEB SCRAPING IN PYTHON

  27. What ' d 'y a Kno w? Str u ct u re of HTML XPath and CSS Locator notation Ho w to u se Selector and Response objects in scrapy Ho w to set u p a spider Ho w to scrape the w eb WEB SCRAPING IN PYTHON

  28. EOT W E B SC R AP IN G IN P YTH ON

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend