International World Wide Web Conference 2006 1
Image Classification for Mobile Web Browsing Takuya Maekawa*, - - PowerPoint PPT Presentation
Image Classification for Mobile Web Browsing Takuya Maekawa*, - - PowerPoint PPT Presentation
Image Classification for Mobile Web Browsing Takuya Maekawa*, Takahiro Hara**, Shojiro Nishio** NTT* Osaka Univ.** International World Wide Web Conference 2006 1 Background Many commercial products and research studies focus on how to
International World Wide Web Conference 2006 2
Background
Many commercial products and research studies focus on how to browse large Web pages on mobile devices with a small screen.
Web page reconfiguration Web page analysis
International World Wide Web Conference 2006 3
Background
Some Web images (contents) are discarded or downsized to fit in the page layout of the small screen.
Web page reconfiguration
Deleting images for page layout Reducing photographic images Discarding contents (Personalization)
International World Wide Web Conference 2006 4
Background
Overcoming the limitations of mobile devices and supporting Web browsing activities by analyzing Web pages.
Web page analysis
Automatically provide important images in the page
Live 8 - Globe unites to fight poverty Thousands march
- n EdinburghBrad Pitt shows
support at Live 8
time
Component recognition
International World Wide Web Conference 2006 5
Problems
Most studies and commercial products are prone to serious errors in detecting Web images because of their simple image role detection mechanisms.
Role of a Web image
- Menu
- Content
- Ad
...
International World Wide Web Conference 2006 6
Problems
Web page reconfiguration
Deleting images for page layout Reducing photographic images Discarding contents (Personalization)
Delete Reduce
2 Fast 2 Furious 28 Days Later The 40- Year-Old Virgin Aeon Flux Alien vs. Predator A Man Apart
Delete image and ignore table tag Delete hyperlink
International World Wide Web Conference 2006 7
Advantage
Web page analysis
Automatically provide important images in the page Component recognition
White small images
Simple rule
W and H<10pix Image for layout
32% error
International World Wide Web Conference 2006 8
Goal
- 1. Collect 3901 images from 40 Web sites
- 2. Define 11 categories of Web images
- 3. Categorize 3901 images into 11
categories manually.
- 4. Select 37 image features to automatically
categorize Web images well Automatic classification of Web images into categories according to image role
International World Wide Web Conference 2006 9
Collecting Web images
From 120 pages in 40 sites
Selected 3 pages including an index page Totally collected 3901 images
International World Wide Web Conference 2006 10
11 image categories
String images
MENU SECTION DECORATION BUTTON
Small images
ITEM ICON
TITLE MAP AD CONTENT LAYOUTER
International World Wide Web Conference 2006 11
Image categories
MENU
Images for site menu. They are set in line horizontally in the upper and/or lower portion of the page
67.6% of them had more than two horizontally in- line images at the same height.
They usually have small aspect ratios (average was 0.320).
International World Wide Web Conference 2006 12
Image categories
SECTION
Headers of a section or a column of the page. They have text following them (92.8%). They usually have small aspect ratios (average: 0.142).
International World Wide Web Conference 2006 13
Image categories
DECORATION Images for decorative text. They represent text which would be difficult to create by using only HTML tags. These images don’t have hyperlinks.
International World Wide Web Conference 2006 14
Image categories
BUTTON Images with hyperlinks. These images have neighboring text and have the hyperlinks to the associated pages. They have text around them.
Above: 16.1%, Below: 8.0%, Left: 36.8%, Right: 13.8%S
International World Wide Web Conference 2006 15
Image categories
ITEM
Line head images of an itemization. ITEM images with the same width are set in line vertically (74.6%) Images have neighboring text on the right (99.4 ). ITEM images usually have aspect ratios of about 1 (average: 1.052).
International World Wide Web Conference 2006 16
Image categories
ICON
Images that represent some kind of object. ICON images have neighboring text on the right or left. (right: 58.3%, left: 22.0%) ICON images usually have aspect ratios of about 1 (average: 0.942).
International World Wide Web Conference 2006 17
Image categories
TITLE
Title images of the page. TITLE images have hyperlinks to the index page of the site or to themselves.
MAP
Image maps.
<MAP NAME=“world"> <AREA href=“map.gif” … > </MAP>
International World Wide Web Conference 2006 18
Image categories
AD
Advertisement images. Some AD images have hyperlinks to other
- domains. (average: 25.5%).
AD images usually have small aspect ratios (average: 0.459).
International World Wide Web Conference 2006 19
Image categories
CONTENT
Content images that are associated with the main contents of the page. CONTENT images have neighboring text on the right or below them (right: 35.1%, below: 51.7%). 55.4% of the CONTENT images were in JPEG format (remaining images: 6.6%).
International World Wide Web Conference 2006 20
Image categories
LAYOUTER
Images to control the design and layout of
- ther images and/or text on the page.
Most LAYOUTER images are whole-colored. LAYOUTER images usually appear many times on a page (average: 10.7).
International World Wide Web Conference 2006 21
Distribution of collected images
We manually categorized collected images.
Category number
MENU 686 SECTION 469 DECORATION 69 BUTTON 87 ITEM 311 ICON 264 TITLE 141 MAP 53 AD 329 CONTENT 951 LAYOUTER 541
International World Wide Web Conference 2006 22
Image features
We defined 37 of image features (F1-37) to classify Web images. All mobile devices cannot extract all features. We grouped features according to sources.
F1-F20: HTML source analysis F21, F22: Web server F23-F30: Rendering information F31-F37: Image processing
International World Wide Web Conference 2006 23
Image features (HTML)
F1: Dimension F2: Width F3: Height F4: Aspect ratio F5: Uses Map or not {TRUE, FALSE} F6: Has a hyperlink or not {TRUE, FALSE}
International World Wide Web Conference 2006 24
Image features (HTML)
F7: Has an outlink or not {TRUE, FALSE}
Outlink: a hyperlink to another domain
F8: Has a loop-back-link or not {TRUE, FALSE}
A loop-back-link: a hyperlink to the index page of the site or a link to the page that it is
- n.
TITLE images and MENU images are usually set as ‘TRUE’.
International World Wide Web Conference 2006 25
Image features (HTML)
F9: Has an ALT string or not {TRUE, FALSE}
String images and other text images are usually set as ‘TRUE’.
MENU:85.4%, SECTION:74.0%, DECORATION:66.7%, BUTTON:63.2%
F10: Number of characters in an ALT string
International World Wide Web Conference 2006 26
Image features (HTML)
F11: Number of characters in neighboring text F12: JPEG image or not {TRUE, FALSE} F13: Index in the HTML source
The index is the order of the corresponding tag in a HTML source. TITLE images have small values (average: 48.4, average of all images: 424.7).
International World Wide Web Conference 2006 27
Image features (HTML)
F14: Number of appearances on a page F15: Number of images with the same dimension on a page
CONTENT:7.5, ICON:4.3, ITEM:4.0
International World Wide Web Conference 2006 28
Image features (HTML)
F16: Number of images with the same width on a page
CONTENT: 8.1, AD: 3.5, ICON: 4.3, ITEM:4.5, SECTION: 4.4
F17: Number of images with the same height on a page
CONTENT: 8.1, MENU: 8.5, SECTION: 4.8, ICON: 4.4, ITEM: 4.8
International World Wide Web Conference 2006 29
Image features (HTML)
F18-F20: Number of neighboring images with the same attribute
Height Width Dimension
International World Wide Web Conference 2006 30
Image features (Web server)
F21: Byte size F22: Byte size per dimension
CONTENT: 0.83, AD: 0.71 ICON: 1.2, ITEM:1.0, LAYOUTER: 8.9
International World Wide Web Conference 2006 31
Image features (Rendering info.)
F23-F30: Features extracted when rendering the page
X coordinate Y coordinate Number of images with the same X coordinate Number of images with the same Y coordinate
...
International World Wide Web Conference 2006 32
Image features (Image processing)
F31: Number of colors F32: Number of concolorous regions F33: Minimum similarity to neighboring images
International World Wide Web Conference 2006 33
Image features (Image processing)
F34: Animation GIF or not
14.29% of AD images had animation GIFs. (Other images: 0.36%)
F35: Has rounded corner rectangle or not (BUTTON: 37.9%)
International World Wide Web Conference 2006 34
Image features (Image processing)
F36: Text region occupancy ratio
LAYOUTER: 0.40%, SECTION: 37.89%, DECORATION: 55.19%, TITLE: 44.85%
F37: Number of text regions
AD: 2.75, MENU: 1.04, SECTION: 1.19
International World Wide Web Conference 2006 35
Experiment
We performed forty classification tests (Decision tree)
Training set: images at thirty nine sites Test set: images at a rest of Web site
[Conditions] C1: HTML source analysis (F1-20) C2: HTML+Web server (F1-22) C3: HTML+Web server+Rendering Info.(F1-30) C4: HTML+Web server+Image processing (F1-22, F31-37) C5: All features
International World Wide Web Conference 2006 36
Result
C1 C2 C3 C4 C5 0.749 0.768 0.796 0.766 0.831
C1 C2 C3 C4 C5 MENU 0.85 0.89 0.85 0.83 0.88 SECTION 0.75 0.74 0.86 0.77 0.87 DECORATION 0.18 0.11 0.11 0.17 0.29 BUTTON 0.39 0.36 0.36 0.37 0.46 ITEM 0.58 0.65 0.80 0.57 0.83 ICON 0.39 0.48 0.65 0.54 0.67 TITLE 0.68 0.73 0.73 0.68 0.80 MAP 0.98 0.98 0.98 0.97 0.97 AD 0.71 0.74 0.65 0.71 0.69 CONTENT 0.89 0.88 0.91 0.91 0.91 LAYOUTER 0.75 0.76 0.78 0.84 0.88
Precision F-Measure
International World Wide Web Conference 2006 37
Result
International World Wide Web Conference 2006 38