Image Classification for Mobile Web Browsing Takuya Maekawa*, - - PowerPoint PPT Presentation

image classification for mobile web browsing
SMART_READER_LITE
LIVE PREVIEW

Image Classification for Mobile Web Browsing Takuya Maekawa*, - - PowerPoint PPT Presentation

Image Classification for Mobile Web Browsing Takuya Maekawa*, Takahiro Hara**, Shojiro Nishio** NTT* Osaka Univ.** International World Wide Web Conference 2006 1 Background Many commercial products and research studies focus on how to


slide-1
SLIDE 1

International World Wide Web Conference 2006 1

Image Classification for Mobile Web Browsing

Takuya Maekawa*, Takahiro Hara**, Shojiro Nishio** NTT* Osaka Univ.**

slide-2
SLIDE 2

International World Wide Web Conference 2006 2

Background

Many commercial products and research studies focus on how to browse large Web pages on mobile devices with a small screen.

Web page reconfiguration Web page analysis

slide-3
SLIDE 3

International World Wide Web Conference 2006 3

Background

Some Web images (contents) are discarded or downsized to fit in the page layout of the small screen.

Web page reconfiguration

Deleting images for page layout Reducing photographic images Discarding contents (Personalization)

slide-4
SLIDE 4

International World Wide Web Conference 2006 4

Background

Overcoming the limitations of mobile devices and supporting Web browsing activities by analyzing Web pages.

Web page analysis

Automatically provide important images in the page

Live 8 - Globe unites to fight poverty Thousands march

  • n EdinburghBrad Pitt shows

support at Live 8

time

Component recognition

slide-5
SLIDE 5

International World Wide Web Conference 2006 5

Problems

Most studies and commercial products are prone to serious errors in detecting Web images because of their simple image role detection mechanisms.

Role of a Web image

  • Menu
  • Content
  • Ad

...

slide-6
SLIDE 6

International World Wide Web Conference 2006 6

Problems

Web page reconfiguration

Deleting images for page layout Reducing photographic images Discarding contents (Personalization)

Delete Reduce

2 Fast 2 Furious 28 Days Later The 40- Year-Old Virgin Aeon Flux Alien vs. Predator A Man Apart

Delete image and ignore table tag Delete hyperlink

slide-7
SLIDE 7

International World Wide Web Conference 2006 7

Advantage

Web page analysis

Automatically provide important images in the page Component recognition

White small images

Simple rule

W and H<10pix Image for layout

32% error

slide-8
SLIDE 8

International World Wide Web Conference 2006 8

Goal

  • 1. Collect 3901 images from 40 Web sites
  • 2. Define 11 categories of Web images
  • 3. Categorize 3901 images into 11

categories manually.

  • 4. Select 37 image features to automatically

categorize Web images well Automatic classification of Web images into categories according to image role

slide-9
SLIDE 9

International World Wide Web Conference 2006 9

Collecting Web images

From 120 pages in 40 sites

Selected 3 pages including an index page Totally collected 3901 images

slide-10
SLIDE 10

International World Wide Web Conference 2006 10

11 image categories

String images

MENU SECTION DECORATION BUTTON

Small images

ITEM ICON

TITLE MAP AD CONTENT LAYOUTER

slide-11
SLIDE 11

International World Wide Web Conference 2006 11

Image categories

MENU

Images for site menu. They are set in line horizontally in the upper and/or lower portion of the page

67.6% of them had more than two horizontally in- line images at the same height.

They usually have small aspect ratios (average was 0.320).

slide-12
SLIDE 12

International World Wide Web Conference 2006 12

Image categories

SECTION

Headers of a section or a column of the page. They have text following them (92.8%). They usually have small aspect ratios (average: 0.142).

slide-13
SLIDE 13

International World Wide Web Conference 2006 13

Image categories

DECORATION Images for decorative text. They represent text which would be difficult to create by using only HTML tags. These images don’t have hyperlinks.

slide-14
SLIDE 14

International World Wide Web Conference 2006 14

Image categories

BUTTON Images with hyperlinks. These images have neighboring text and have the hyperlinks to the associated pages. They have text around them.

Above: 16.1%, Below: 8.0%, Left: 36.8%, Right: 13.8%S

slide-15
SLIDE 15

International World Wide Web Conference 2006 15

Image categories

ITEM

Line head images of an itemization. ITEM images with the same width are set in line vertically (74.6%) Images have neighboring text on the right (99.4 ). ITEM images usually have aspect ratios of about 1 (average: 1.052).

slide-16
SLIDE 16

International World Wide Web Conference 2006 16

Image categories

ICON

Images that represent some kind of object. ICON images have neighboring text on the right or left. (right: 58.3%, left: 22.0%) ICON images usually have aspect ratios of about 1 (average: 0.942).

slide-17
SLIDE 17

International World Wide Web Conference 2006 17

Image categories

TITLE

Title images of the page. TITLE images have hyperlinks to the index page of the site or to themselves.

MAP

Image maps.

<MAP NAME=“world"> <AREA href=“map.gif” … > </MAP>

slide-18
SLIDE 18

International World Wide Web Conference 2006 18

Image categories

AD

Advertisement images. Some AD images have hyperlinks to other

  • domains. (average: 25.5%).

AD images usually have small aspect ratios (average: 0.459).

slide-19
SLIDE 19

International World Wide Web Conference 2006 19

Image categories

CONTENT

Content images that are associated with the main contents of the page. CONTENT images have neighboring text on the right or below them (right: 35.1%, below: 51.7%). 55.4% of the CONTENT images were in JPEG format (remaining images: 6.6%).

slide-20
SLIDE 20

International World Wide Web Conference 2006 20

Image categories

LAYOUTER

Images to control the design and layout of

  • ther images and/or text on the page.

Most LAYOUTER images are whole-colored. LAYOUTER images usually appear many times on a page (average: 10.7).

slide-21
SLIDE 21

International World Wide Web Conference 2006 21

Distribution of collected images

We manually categorized collected images.

Category number

MENU 686 SECTION 469 DECORATION 69 BUTTON 87 ITEM 311 ICON 264 TITLE 141 MAP 53 AD 329 CONTENT 951 LAYOUTER 541

slide-22
SLIDE 22

International World Wide Web Conference 2006 22

Image features

We defined 37 of image features (F1-37) to classify Web images. All mobile devices cannot extract all features. We grouped features according to sources.

F1-F20: HTML source analysis F21, F22: Web server F23-F30: Rendering information F31-F37: Image processing

slide-23
SLIDE 23

International World Wide Web Conference 2006 23

Image features (HTML)

F1: Dimension F2: Width F3: Height F4: Aspect ratio F5: Uses Map or not {TRUE, FALSE} F6: Has a hyperlink or not {TRUE, FALSE}

slide-24
SLIDE 24

International World Wide Web Conference 2006 24

Image features (HTML)

F7: Has an outlink or not {TRUE, FALSE}

Outlink: a hyperlink to another domain

F8: Has a loop-back-link or not {TRUE, FALSE}

A loop-back-link: a hyperlink to the index page of the site or a link to the page that it is

  • n.

TITLE images and MENU images are usually set as ‘TRUE’.

slide-25
SLIDE 25

International World Wide Web Conference 2006 25

Image features (HTML)

F9: Has an ALT string or not {TRUE, FALSE}

String images and other text images are usually set as ‘TRUE’.

MENU:85.4%, SECTION:74.0%, DECORATION:66.7%, BUTTON:63.2%

F10: Number of characters in an ALT string

slide-26
SLIDE 26

International World Wide Web Conference 2006 26

Image features (HTML)

F11: Number of characters in neighboring text F12: JPEG image or not {TRUE, FALSE} F13: Index in the HTML source

The index is the order of the corresponding tag in a HTML source. TITLE images have small values (average: 48.4, average of all images: 424.7).

slide-27
SLIDE 27

International World Wide Web Conference 2006 27

Image features (HTML)

F14: Number of appearances on a page F15: Number of images with the same dimension on a page

CONTENT:7.5, ICON:4.3, ITEM:4.0

slide-28
SLIDE 28

International World Wide Web Conference 2006 28

Image features (HTML)

F16: Number of images with the same width on a page

CONTENT: 8.1, AD: 3.5, ICON: 4.3, ITEM:4.5, SECTION: 4.4

F17: Number of images with the same height on a page

CONTENT: 8.1, MENU: 8.5, SECTION: 4.8, ICON: 4.4, ITEM: 4.8

slide-29
SLIDE 29

International World Wide Web Conference 2006 29

Image features (HTML)

F18-F20: Number of neighboring images with the same attribute

Height Width Dimension

slide-30
SLIDE 30

International World Wide Web Conference 2006 30

Image features (Web server)

F21: Byte size F22: Byte size per dimension

CONTENT: 0.83, AD: 0.71 ICON: 1.2, ITEM:1.0, LAYOUTER: 8.9

slide-31
SLIDE 31

International World Wide Web Conference 2006 31

Image features (Rendering info.)

F23-F30: Features extracted when rendering the page

X coordinate Y coordinate Number of images with the same X coordinate Number of images with the same Y coordinate

...

slide-32
SLIDE 32

International World Wide Web Conference 2006 32

Image features (Image processing)

F31: Number of colors F32: Number of concolorous regions F33: Minimum similarity to neighboring images

slide-33
SLIDE 33

International World Wide Web Conference 2006 33

Image features (Image processing)

F34: Animation GIF or not

14.29% of AD images had animation GIFs. (Other images: 0.36%)

F35: Has rounded corner rectangle or not (BUTTON: 37.9%)

slide-34
SLIDE 34

International World Wide Web Conference 2006 34

Image features (Image processing)

F36: Text region occupancy ratio

LAYOUTER: 0.40%, SECTION: 37.89%, DECORATION: 55.19%, TITLE: 44.85%

F37: Number of text regions

AD: 2.75, MENU: 1.04, SECTION: 1.19

slide-35
SLIDE 35

International World Wide Web Conference 2006 35

Experiment

We performed forty classification tests (Decision tree)

Training set: images at thirty nine sites Test set: images at a rest of Web site

[Conditions] C1: HTML source analysis (F1-20) C2: HTML+Web server (F1-22) C3: HTML+Web server+Rendering Info.(F1-30) C4: HTML+Web server+Image processing (F1-22, F31-37) C5: All features

slide-36
SLIDE 36

International World Wide Web Conference 2006 36

Result

C1 C2 C3 C4 C5 0.749 0.768 0.796 0.766 0.831

C1 C2 C3 C4 C5 MENU 0.85 0.89 0.85 0.83 0.88 SECTION 0.75 0.74 0.86 0.77 0.87 DECORATION 0.18 0.11 0.11 0.17 0.29 BUTTON 0.39 0.36 0.36 0.37 0.46 ITEM 0.58 0.65 0.80 0.57 0.83 ICON 0.39 0.48 0.65 0.54 0.67 TITLE 0.68 0.73 0.73 0.68 0.80 MAP 0.98 0.98 0.98 0.97 0.97 AD 0.71 0.74 0.65 0.71 0.69 CONTENT 0.89 0.88 0.91 0.91 0.91 LAYOUTER 0.75 0.76 0.78 0.84 0.88

Precision F-Measure

slide-37
SLIDE 37

International World Wide Web Conference 2006 37

Result

slide-38
SLIDE 38

International World Wide Web Conference 2006 38

Conclusion

Image classification for mobile Web browsing

3901 images 11 categories 83.1% precision