Informationsextraktion aus Websites Michael Haas - PowerPoint PPT Presentation

Informationsextraktion aus Websites Michael Haas <haas@computerlinguist.org> Service-Center Forschungsdaten, Universität Mannheim 22.01.2013

Lessons Learned - Kontext ■ Mein Hintergrund: B.A. Computerlinguistik, Universität Heidelberg ■ Projekte am Service-Center: ■ Manuel Trenz: Beobachtung der Preisveränderungen einer gegebenen Menge an Produkten auf Online-Shops und Preisvergleichern ■ Dominic Nyhuis: Durchsuchen der Online-Archive von 10 Zeitungen per Screen Scraping ■ Georg Wernicke: NER und Sentiment-Analyse auf Zeitungsartikeln ■ Ziele für heute ■ Tutorial Screen Scraping ■ Folien als Referenz ■ Lessons Learned als Hinweise/best practices

Aufgabe ■ Kunde benötigt Daten von Website ■ Manuelle Extraktion mit HiWis und Copy&Paste zu aufwendig ■ Automatisieren!

Python Konkret: Kunde möchte Produktpreise über längeren Zeitraum überwachen 1 > > import u r l l i b 2 2 > > content = u r l l i b 2 . ur lo p en ( " http :// host / produkt / i d " ) 3 HTTPError : HTTP Error 403: denied � contact webmaster@host

Python - Ninja Level 1 Website mag unseren User-Agent nicht! 1 > > r e q u e s t = u r l l i b 2 . Request ( " http :// host / produkt /123" ) 2 > > r e q u e s t . add_header ( ’ User � Agent ’ , ’ M o z i l l a /5.0 ’ ) 3 > > opener = u r l l i b 2 . build_opener () 4 > > content = opener . open ( r e q u e s t ) . read () 5 > > content [ 0 : 3 0 ] 6 ’ <!DOCTYPE HTML > <html lang="de" ’

Python - Iteration Kunde will mehrere Produkte überwachen 1 > > f o r p in products ✄ 100: 2 r e q u e s t = u r l l i b 2 . Request ( " http :// host / produkt /" + p ) 3 r e q u e s t . add_header ( ’ User � Agent ’ , ’ M o z i l l a /5.0 ’ ) 4 opener = u r l l i b 2 . build_opener () 5 content = opener . open ( r e q u e s t ) . read () 6 HTTPError : HTTP Error 421: too fast � contact webmaster@host

Python - Ninja Level 2 1 > > import time 2 > > f o r p in products ✄ 100: 3 r e q u e s t = u r l l i b 2 . Request ( " http :// host / product /" + p ) 4 r e q u e s t . add_header ( ’ User � Agent ’ , ’ M o z i l l a /5.0 ’ ) 5 opener = u r l l i b 2 . build_opener () 6 content = opener . open ( r e q u e s t ) . read () 7 time . s l e e p (5)

Python - Paranoid Ninja Admin könnte Access Logs überwachen - Abstände der Zugriffe zufällig halten! 1 > > import random , time 2 > > f o r p in products ✄ 100: 3 r e q u e s t = u r l l i b 2 . Request ( " http :// host / product /" + p ) 4 r e q u e s t . add_header ( ’ User � Agent ’ , ’ M o z i l l a /5.0 ’ ) 5 opener = u r l l i b 2 . build_opener () 6 content = opener . open ( r e q u e s t ) . read () 7 time . s l e e p ( random . uniform (1 ,5) )

Python - there is a lib for that Alles zu kompliziert! Besser: 1 $ sudo e a s y _ i n s t a l l � 2.7 l e e c h i 2 $ ipython2 3 > > import l e e c h i 4 > > l = l e e c h i . Leechi () 5 > > f o r p in product : 6 content = l . fetchDelayed ( " http :// host / product /" + p )

Python - Leechi 1 > > l = l e e c h i . Leechi ( c o o k i e s=True , r e t r y =3) 2 > > l . chooseRandomUA () 3 > > l . setCustomUA ( "Wget / 1 . 9 . 1 " ) 4 > > handle = l . obtainHandle ( " http :// host /" ) 5 > > content = handle . read () 6 > > handle = l . obtainHandleDelayed ( " http :// host /" )

Python - Leechi - Source ■ http://github.com/mhaas/leechi/ ■ http://pypi.python.org/pypi/Leechi/0.2 ■ Send Patches: ■ Periodisches Wechseln von UA ■ Unterstützung (anonymer) Proxy-Server ■ Tests

Python - HTML Parsing Und nun? 1 > > content = """<html> <body> 2 <p c l a s s =" p r i c e "> p r e i s i s t : 5 euro </p> 3 </body></html>""" 4 > > import re 5 > > re . s e ar c h ( ur ’ p r e i s i s t : (\ d {0 ,4}) euro ’ , content ) . group (1) 6 ’ 5 ’

Python - HTML Parsing - Nie RegEx ■ HTML/XML sind kontextfreie Sprachen ■ Reguläre Ausdrücke beschreiben reguläre Sprachen ■ Kontextfreie Sprachen sind mächtiger als reguläre Sprachen 1 1 “Parsing HTML with regex summons tainted souls into the realm of the living.”http://stackoverflow.com/a/1732454

Python - HTML Parsing - BeautifulSoup ■ Besser: BeautifulSoup 4 ■ Kann alles, auch Tagsuppe: ■ fehlende schließende Tags ■ mangelhaft kodierte Sonderzeichen ■ http://www.crummy.com/software/BeautifulSoup/

Python - HTML Parsing - BeautifulSoup - Navigation HTML-Attribut class sehr nützlich als Ziel. 1 $ sudo e a s y _ i n s t a l l � 2.7 b e a u t i f u l s o u p 4 2 $ python2 3 > > content = """<html> 4 <body> 5 <p class=”price”>p r e i s i s t : 5e</p> 6 </body> 7 </html>""" 8 > > from bs4 BeautifulSoup import 9 > > soup = BeautifulSoup ( content ) 10 > > soup . f i n d (class_=”price” ) 11 <p c l a s s =" p r i c e ">p r e i s i s t : 5e</p>

Python - HTML Parsing - BeautifulSoup - Navigation Container für Listen 1 > > content = """<u l id=”priceList”> 2 <l i >p r e i s : 5e</ l i > 3 <l i >p r e i s : 10e</ l i > 4 <l i >p r e i s : 15e</ l i > 5 </ul>""" 6 > > p r i c e L i s t = soup . f i n d ( ’ u l ’ , id=’priceList’) 7 > > f o r node in p r i c e L i s t . c h i l d r e n : # p r i c e L i s t . c o n te n t s 8 re . s e ar c h ( ur " p r e i s : (\ d {1 ,4}) e" , node . s t r i n g ) . group (1) 9 5 10 10 11 15

Python - HTML Parsing - BeautifulSoup - Navigation Durch den Baum hangeln 1 > > content = """<div > 2 <h1 class=”section-header”>Preise </h1> 3 <h3>Zubehoer </h3> 4 <ul> 5 <l i >p r e i s : 5e</ l i > 6 </ul> 7 </div >""" 8 > > soup . d i v . h1 . n e x t S i b l i n g . n e x t S i b l i n g 9 <ul> <l i >p r e i s : 5e</ l i ></ul> 10 > > soup . d i v . c o n te n ts [ 2 ] 11 <ul> <l i >p r e i s : 5e</ l i ></ul> 12 > > soup . f i n d (class_=”section-header” ) . n e x t S i b l i n g . n e x t S i b l i n g 13 <ul> <l i >p r e i s : 5e</ l i ></ul>

Python - HTML Parsing - BeautifulSoup - Navigation 1 > > f o r s t r i n g soup . s t r i n g s : in 2 p r i n t s t r i n g 3 P r e i s e 4 Zubehoer 5 p r e i s : 5e ■ soup.stripped_strings : ohne Leerzeichen ■ soup.descendants : depth-first search

BeautifulSoup - Suchen ■ Nach Tag-Namen ■ Nach Attributen ■ Nach Text ■ Kombinationen

Python - HTML Parsing - BeautifulSoup - Suchen - Tag 1 > > soup . h1 2 <h1 c l a s s =" s e c t i o n � header ">Preise </h1> 3 > > soup . f i n d ( ’ h1 ’ ) 4 <h1 c l a s s =" s e c t i o n � header ">Preise </h1> 5 > > soup . f i n d _ a l l ( ’ h1 ’ ) [ 0 ] 6 <h1 c l a s s =" s e c t i o n � header ">Preise </h1>

Python - HTML Parsing - BeautifulSoup - Suchen - Attribut 1 > > content = """<div > 2 <h1>Preise </h1> 3 <h3 id=”header”>Zubehoer </h3> 4 <ul> 5 <l i >p r e i s : 5e</ l i > 6 </ul> 7 </div >’ 8 > > soup . f i n d ( i d="header ") 9 <h3 i d="header">Zubehoer </h3> 10 > > soup . f i n d (" h3 " , i d="header ") 11 <h3 i d="header">Zubehoer </h3> 12 > > soup . f i n d ( i d=re . compile ( ’ head ’) ) 13 <h3 i d="header">Zubehoer </h3> 14 > > soup . f i n d ( i d=re . compile ( ’ head ’) ) [" i d "] 15 ’ header ’

Python - HTML Parsing - BeautifulSoup - Suchen - Text 1 > > soup . f i n d ( t e x t=" Zubehoer " ) 2 u ’ Zubehoer ’ 3 > > soup . f i n d ( t e x t=" Zubehoer " ) . parent 4 <h3 i d=" header ">Zubehoer </h3> 5 > > soup . f i n d _ a l l ( t e x t=True ) 6 [ u ’ P r e i s e ’ , u ’ Zubehoer ’ , u ’ p r e i s : 5e ’ ] 7 > > soup . f i n d _ a l l ( t e x t=re . compile ( " [ pP ] r e i s " ) ) 8 [ u ’ P r e i s e ’ , u ’ p r e i s : 5e ’ ]

Zwischenstand Wir können: ■ unerkannt Content herunterladen ■ Content parsen und Information extrahieren Spezialfall: Suchanfragen auf Websites automatisieren!

Suchmasken - Automatisierung von Formularen ■ Suchmasken sind <form> -Objekte mit <input> -Feldern ■ Übermittlung per HTTP GET oder POST ■ Achtung: benötigt oft Cookies für Session Management! 1 Leechi ( c o o k i e s=True )

Informationsextraktion aus Websites Michael Haas - PowerPoint PPT Presentation

Informationsextraktion aus Websites Michael Haas <haas@computerlinguist.org> Service-Center Forschungsdaten, Universitt Mannheim 22.01.2013 Lessons Learned - Kontext Mein Hintergrund: B.A. Computerlinguistik, Universitt Heidelberg

Leacock Space Project: Update and Proposals Presentation to AUS Council February 10 th , 2016 AUS

Web Scraping Ben Williams October 9 th 2020 Non-Static Websites Dynamic Websites APIs

AUS State of the Society Presentation AUS Legislative Council September 9 th , 2015 Your

STATE OF THE SOCIETY PR PRESENT ESENTATION TION AUS Legislative Council September 7 th ,

Olympic Games Atlanta 1996 UIC: Merle (USA), Asst UIC: Gerry (AUS), Arthur Allsopp (AUS),

Frosh 2018 Coordinators Kim Yang, VP Social, AUS Sophie Lachance, Frosh Chair, AUS Laurent

New Regulaons? A Perspecve from Aus*n, Texas Chuck

Building Information Modelling Whats in it for the Client? BIM MEP AUS Innovation Form 2014

Websites & Social Media Building your clubs online presence Agenda 1) Define Websites

Building Fast WordPress Websites the art and logic of building fast wordpress websites Whos

Online Social Media Companies By Morgan and Trevor What is Web - Newspaper Websites -- New York

Multilinguality on on Health Health Care Care Multilinguality Websites Websites Local Multi

Build websites that suit the needs and abilities of users The main goal of many websites is to

Creating Multilingual Creating Multilingual Drupal 7 Websites: Drupal 7 Websites: Part 2 Part

TAC CIRA Website Management Tools to Reach Your Digital Destination Created by Brittany Lane,

mobile library applications John Paul Anbu K. and Sanjay Kataria Introduction Desktop

How to get Twitter Data from the Twitter REST and Streaming API David M. Beskow Carnegie Mellon

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Challenges in Present Light Sources and Future Low-Emittance Rings 8-10 March 2017, Karlsruhe

JAVASCRIPT DEVELOPMENT Sasha Vodnik, Instructor SLACK BOT LAB 2 HELLO! 1. Pull changes from

Data Architecture 101 for Your Business Bence Faludi - bence@subninja.org Setting up your entire

London Ethereum Meetup swarm and web3 9 th June 2016 Viktor Trn A brief history of: Web

of Local and Global BitTorrent Workload Dynamics Niklas Carlsson Linkping University Gyrgy

The First Class A-1 Effluent Permit in Montana Moonlight Basin, Big Sky Montana Topics Effluent

Informationsextraktion aus Websites Michael Haas - PowerPoint PPT Presentation

Informationsextraktion aus Websites Michael Haas <haas@computerlinguist.org> Service-Center Forschungsdaten, Universitt Mannheim 22.01.2013 Lessons Learned - Kontext Mein Hintergrund: B.A. Computerlinguistik, Universitt Heidelberg

Leacock Space Project: Update and Proposals Presentation to AUS Council February 10 th , 2016 AUS

Web Scraping Ben Williams October 9 th 2020 Non-Static Websites Dynamic Websites APIs

AUS State of the Society Presentation AUS Legislative Council September 9 th , 2015 Your

STATE OF THE SOCIETY PR PRESENT ESENTATION TION AUS Legislative Council September 7 th ,

Olympic Games Atlanta 1996 UIC: Merle (USA), Asst UIC: Gerry (AUS), Arthur Allsopp (AUS),

Frosh 2018 Coordinators Kim Yang, VP Social, AUS Sophie Lachance, Frosh Chair, AUS Laurent

New Regula*ons? A Perspec*ve from Aus*n, Texas Chuck

Building Information Modelling Whats in it for the Client? BIM MEP AUS Innovation Form 2014

Websites &amp; Social Media Building your clubs online presence Agenda 1) Define Websites

Building Fast WordPress Websites the art and logic of building fast wordpress websites Whos

Online Social Media Companies By Morgan and Trevor What is Web - Newspaper Websites -- New York

Multilinguality on on Health Health Care Care Multilinguality Websites Websites Local Multi

Build websites that suit the needs and abilities of users The main goal of many websites is to

Creating Multilingual Creating Multilingual Drupal 7 Websites: Drupal 7 Websites: Part 2 Part

TAC CIRA Website Management Tools to Reach Your Digital Destination Created by Brittany Lane,

mobile library applications John Paul Anbu K. and Sanjay Kataria Introduction Desktop

How to get Twitter Data from the Twitter REST and Streaming API David M. Beskow Carnegie Mellon

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Challenges in Present Light Sources and Future Low-Emittance Rings 8-10 March 2017, Karlsruhe

JAVASCRIPT DEVELOPMENT Sasha Vodnik, Instructor SLACK BOT LAB 2 HELLO! 1. Pull changes from

Data Architecture 101 for Your Business Bence Faludi - bence@subninja.org Setting up your entire

London Ethereum Meetup swarm and web3 9 th June 2016 Viktor Trn A brief history of: Web

of Local and Global BitTorrent Workload Dynamics Niklas Carlsson Linkping University Gyrgy

The First Class A-1 Effluent Permit in Montana Moonlight Basin, Big Sky Montana Topics Effluent

New Regulaons? A Perspecve from Aus*n, Texas Chuck

Websites & Social Media Building your clubs online presence Agenda 1) Define Websites