Combining Text and Image Processing in an Automa6c Image - - PowerPoint PPT Presentation

▶

Apr 01, 2023 45 likes •277 views

Combining Text and Image Processing in an Automa6c Image Annota6on System Iulian Ilie (SHSS, Jacobs University) Joint work with Arne Jacobs, OFhein

SLIDE 1

Combining ¡Text ¡and ¡Image ¡Processing ¡in ¡ an ¡Automa6c ¡Image ¡Annota6on ¡System ¡

Iulian ¡Ilieș ¡(SHSS, ¡Jacobs ¡University) ¡ Joint ¡work ¡with ¡Arne ¡Jacobs, ¡OFhein ¡Herzog ¡(TZI, ¡Universität ¡ Bremen), ¡and ¡Adalbert ¡Wilhelm ¡(SHSS, ¡Jacobs ¡University) ¡ Supported ¡by ¡the ¡Deutsche ¡ForschungsgemeinschaO ¡(DFG) ¡

SLIDE 2

Overview ¡

 Mo#va#on ¡and ¡approach ¡  Current ¡work: ¡

 Framework ¡of ¡concept ¡propaga#on ¡  Data ¡and ¡algorithms ¡employed ¡  Comparison ¡of ¡different ¡classifiers ¡  Effect ¡of ¡visual ¡vocabulary ¡size ¡

 Summary ¡and ¡outlook ¡

SLIDE 3

Mo6va6on ¡

 Con#nuously ¡increasing ¡quan#ty ¡of ¡image ¡data ¡available ¡

n ¡the ¡Internet, ¡which ¡necessitates ¡efficient ¡classifica#on ¡

and ¡indexing ¡methods ¡for ¡easy ¡access ¡and ¡usage ¡

 Exis#ng ¡methods, ¡especially ¡mainstream, ¡do ¡not ¡exploit ¡

all ¡available ¡informa#on: ¡

 Text-‑based ¡search, ¡using ¡file ¡names ¡and/or ¡cap#ons ¡  Pure ¡visual ¡search, ¡relying ¡only ¡on ¡image ¡features ¡  Seman#c ¡search, ¡via ¡image ¡understanding ¡techniques ¡

SLIDE 4

Approach ¡

 Combine ¡the ¡advantages ¡of ¡these ¡different ¡viewpoints ¡

into ¡an ¡integrated ¡framework, ¡which ¡would ¡allow ¡the ¡ classifica#on ¡of ¡images ¡using ¡keywords, ¡features, ¡or ¡both ¡

 Focus ¡on ¡the ¡construc#on ¡of ¡a ¡dual-‑layered ¡linkage ¡

scheme ¡between ¡images, ¡based ¡on ¡the ¡co-‑occurrence ¡of ¡ keywords, ¡and ¡on ¡similari#es ¡between ¡visual ¡features ¡

 Define ¡visual ¡words, ¡and ¡associate ¡them ¡to ¡keywords ¡

SLIDE 5

Classifier ¡

Framework ¡

Images ¡ Visual ¡words ¡ (prototype ¡features) ¡ Cap6ons ¡ Keywords ¡ Textual ¡concept ¡detector ¡ Visual ¡concept ¡detector ¡ Clustering ¡algorithm ¡

SLIDE 6

Concept ¡propaga6on ¡

 Directly ¡transfer ¡the ¡associa#ons ¡with ¡keywords ¡from ¡

cap#ons ¡to ¡related ¡images, ¡and ¡further ¡to ¡the ¡visual ¡ features ¡found ¡in ¡these ¡images ¡

 For ¡each ¡visual ¡word, ¡average ¡across ¡the ¡visual ¡features ¡

that ¡have ¡it ¡as ¡prototype, ¡and ¡contrast ¡the ¡obtained ¡ value ¡with ¡the ¡corresponding ¡global ¡average ¡

 These ¡opera#ons ¡can ¡be ¡performed ¡in ¡reversed ¡order! ¡

SLIDE 7

Training ¡ Tes6ng ¡

Classifier ¡

Visual ¡features ¡ Visual ¡words ¡ (clusters) ¡ Images ¡ Feature-‑concept ¡ associa6ons ¡ Cluster-‑concept ¡ associa6ons ¡ Image-‑concept ¡ associa6ons ¡ Cap6ons ¡ Feature-‑concept ¡ associa6ons ¡ Image-‑concept ¡ associa6ons ¡ Visual ¡features ¡ Test ¡images ¡

SLIDE 8

Data ¡employed ¡

 Images ¡and ¡related ¡text ¡(e.g. ¡cap#ons, ¡#tles) ¡harvested ¡

from ¡news ¡websites ¡

 Strongly ¡structured ¡

ar#cles, ¡that ¡can ¡be ¡parsed ¡automa#cally ¡

SLIDE 9

Concept ¡detectors ¡

 Specialized ¡keyword ¡detector: ¡

 Person ¡names ¡extracted ¡from ¡cap#ons ¡by ¡a ¡named ¡

en#ty ¡recognizer ¡(NER; ¡Drozdzynski ¡et ¡al. ¡2004), ¡ complemented ¡by ¡manual ¡annota#ons ¡

 Generic ¡visual ¡feature ¡detector: ¡

 Interest-‑point ¡descriptors ¡extracted ¡from ¡images ¡by ¡

the ¡SIFT ¡algorithm ¡(Lowe ¡1999), ¡clustered ¡into ¡a ¡ vocabulary ¡of ¡visual ¡words ¡(Sivic ¡& ¡Zisserman ¡2003) ¡ ¡

SLIDE 10

Data ¡set ¡

 Approx. ¡1000 ¡images ¡(some ¡duplicated) ¡and ¡associated ¡

cap#ons, ¡harvested ¡from ¡German ¡news ¡websites ¡

 Over ¡50 ¡different ¡person ¡names ¡detected ¡in ¡the ¡cap#ons ¡

by ¡the ¡NER ¡algorithm: ¡

 81% ¡precision ¡and ¡87% ¡recall ¡vs. ¡ground-‑truth ¡

 Approx. ¡175000 ¡interest ¡point ¡descriptors ¡extracted ¡from ¡

the ¡images ¡with ¡the ¡SIFT ¡algorithm ¡

SLIDE 11

Current ¡experiments ¡

 Used ¡a ¡standard ¡classifica#on ¡procedure: ¡

 Par##oned ¡the ¡data ¡set ¡into ¡6 ¡stra#fied ¡subsets ¡– ¡5 ¡

cross-‑valida#on ¡sets, ¡and ¡a ¡test-‑only ¡set ¡

 Trained ¡with ¡respect ¡to ¡the ¡F1-‑measure ¡(the ¡harmonic ¡

average ¡of ¡precision ¡and ¡recall) ¡

 Using ¡the ¡simplex ¡search ¡algorithm ¡of ¡Lagarias ¡et ¡al. ¡

(1998) ¡for ¡objec#ve ¡func#on ¡maximiza#on ¡

SLIDE 12

Transfer ¡func6ons ¡

 Defined ¡several ¡methods ¡for ¡calcula#ng ¡associa#on ¡

probabili#es ¡between ¡keywords ¡and ¡visual ¡prototypes: ¡

 Use ¡the ¡significance ¡of ¡the ¡chi-‑square ¡test ¡contras#ng ¡

the ¡within-‑cluster ¡(-‑prototype) ¡and ¡global ¡averages ¡

 Apply ¡a ¡sigmoid ¡func#on ¡to ¡the ¡ra#o ¡of ¡these ¡averages ¡  Apply ¡a ¡sigmoid ¡to ¡the ¡logarithm ¡of ¡the ¡ra#o ¡  Simply ¡truncate ¡the ¡ra#o ¡to ¡an ¡interval ¡centered ¡at ¡or ¡

near ¡1, ¡and ¡then ¡map ¡to ¡the ¡unit ¡interval ¡

SLIDE 13

Experiment ¡1 ¡-‑ ¡classifying ¡procedures ¡

 Used ¡visual ¡vocabularies ¡of ¡100 ¡words ¡(clusters), ¡

btained ¡with ¡the ¡k-‑means ¡algorithm ¡

 Tested ¡the ¡four ¡methods ¡for ¡calcula#ng ¡the ¡degrees ¡of ¡

associa#on ¡between ¡visual ¡prototypes ¡and ¡keywords ¡

 Tested ¡three ¡training ¡strategies ¡– ¡for ¡each ¡keyword ¡

separately, ¡globally, ¡and ¡with ¡predefined ¡parameters ¡

 Trained ¡using ¡ground-‑truth ¡or ¡cap#on-‑based ¡associa#ons ¡

SLIDE 14

Experiment ¡1 ¡– ¡results ¡

 Minor ¡differences ¡between ¡the ¡four ¡averaging ¡methods ¡  Best ¡results ¡obtained ¡when ¡using ¡ground-‑truth ¡data, ¡and ¡

training ¡each ¡concept ¡separately: ¡

 F1-‑score ¡of ¡56% ¡at ¡training ¡and ¡34% ¡at ¡tes#ng ¡

SLIDE 15

Experiment ¡2 ¡– ¡vocabulary ¡size ¡

 Different ¡clustering ¡algorithms ¡and ¡numbers ¡of ¡clusters: ¡

 K-‑means ¡with ¡100 ¡clusters ¡(6 ¡hrs) ¡ ¡  K-‑medians ¡with ¡100 ¡clusters ¡(10 ¡hrs) ¡  TwoStep ¡(SPSS ¡algorithm ¡for ¡large ¡data ¡sets) ¡with ¡100, ¡

500, ¡1000, ¡and ¡2000 ¡clusters ¡(10 ¡min ¡– ¡2 ¡hrs) ¡

 Using ¡cap#on-‑based ¡data ¡only ¡(realis#c ¡seing), ¡and ¡

training ¡each ¡concept ¡separately ¡(best ¡performance) ¡

SLIDE 16

Experiment ¡2 ¡– ¡results ¡

 Performance ¡increased ¡with ¡the ¡number ¡of ¡clusters, ¡with ¡

close ¡to ¡perfect ¡training ¡at ¡approximately ¡2000 ¡clusters ¡

 (Data ¡did ¡not ¡have ¡enough ¡variance ¡to ¡produce ¡more ¡

clusters ¡with ¡the ¡default ¡seings ¡for ¡TwoStep) ¡

SLIDE 17

Experiment ¡3 ¡

 Repeated ¡the ¡first ¡experiment ¡ ¡(tes#ng ¡different ¡

classifiers) ¡at ¡the ¡op#mal ¡vocabulary ¡size: ¡

 Significantly ¡improved ¡results, ¡with ¡F1-‑scores ¡on ¡the ¡

test ¡images ¡of ¡65% ¡– ¡71% ¡and ¡close ¡to ¡perfect ¡training ¡

SLIDE 18

Experiment ¡3 ¡– ¡further ¡results ¡

 Best ¡performance ¡using ¡ground-‑truth ¡data, ¡training ¡each ¡

concept ¡separately ¡– ¡F1-‑score ¡of ¡ ¡71% ¡on ¡test ¡images ¡

 No ¡difference ¡between ¡training ¡each ¡concept ¡separately ¡

and ¡training ¡globally ¡when ¡using ¡the ¡cap#ons ¡as ¡source ¡ data ¡or ¡measuring ¡the ¡performance ¡on ¡test ¡images ¡

 The ¡impact ¡of ¡training ¡data ¡(ground-‑truth ¡vs. ¡cap#ons-‑

based) ¡is ¡significantly ¡reduced ¡on ¡tes#ng ¡images ¡

SLIDE 19

Training ¡ ¡Tes6ng ¡

Some ¡examples ¡

SLIDE 20

Summary ¡

 Presented ¡a ¡novel ¡image ¡classifica#on ¡approach, ¡relying ¡

n ¡keywords ¡extracted ¡from ¡cap#ons, ¡and ¡on ¡a ¡visual ¡

vocabulary ¡of ¡features ¡extracted ¡from ¡the ¡actual ¡images ¡

 Described ¡a ¡method ¡of ¡propaga#ng ¡associa#ons ¡with ¡

keywords ¡from ¡training ¡images ¡to ¡visual ¡prototypes, ¡and ¡ subsequently ¡to ¡test ¡images ¡

 Applied ¡successfully ¡the ¡developed ¡methodology ¡in ¡a ¡

person ¡classifica#on ¡task ¡

SLIDE 21

Future ¡work ¡

 Generic ¡keyword ¡detector ¡(in ¡progress): ¡

 Clusters ¡of ¡stemmed ¡words, ¡obtained ¡via ¡latent ¡

seman#c ¡analysis ¡(LSA, ¡Deerwester ¡et ¡al. ¡1990) ¡ performed ¡on ¡the ¡TF-‑IDF ¡weighted ¡term-‑image ¡matrix ¡

 Using ¡an ¡enlarged ¡data ¡set ¡of ¡approx. ¡150000 ¡images ¡  Include ¡color ¡data ¡and ¡spa#al ¡informa#on ¡about ¡the ¡

(rela#ve) ¡posi#on ¡of ¡(groups ¡of) ¡visual ¡features ¡ ¡

 Further ¡generaliza#on ¡with ¡respect ¡to ¡concepts ¡(e.g. ¡

news ¡category) ¡and ¡data ¡(unstructured ¡websites) ¡

SLIDE 22

References ¡

Drozdzynski ¡W., ¡Krieger ¡H.-‑U., ¡Piskorski ¡J., ¡Schäfer ¡U., ¡and ¡Xu ¡F. ¡(2004). ¡Shallow ¡Processing ¡with ¡ Unifica#on ¡ and ¡ Typed ¡ Feature ¡ Structures ¡ -‑ ¡ Founda#ons ¡ and ¡ Applica#ons. ¡ Künstliche ¡ Intelligenz, ¡vol. ¡1, ¡pp. ¡17-‑23. ¡ Lagarias ¡J. ¡C., ¡Reeds ¡J. ¡A., ¡Wright ¡M. ¡H., ¡and ¡Wright ¡P. ¡E. ¡(1998). ¡Convergence ¡Proper#es ¡of ¡the ¡ Nelder-‑Mead ¡Simplex ¡Method ¡in ¡Low ¡Dimensions. ¡SIAM ¡Journal ¡of ¡Op#miza#on, ¡vol. ¡9(1), ¡

pp. ¡112-‑147. ¡

Lowe ¡D. ¡G. ¡(1999). ¡Object ¡Recogni#on ¡from ¡Local ¡Scale-‑Invariant ¡Features. ¡Proceedings ¡of ¡the ¡ Interna#onal ¡Conference ¡on ¡Computer ¡Vision, ¡vol. ¡2, ¡p. ¡1150 ¡ Salton ¡ G., ¡ and ¡ Buckley, ¡ C. ¡ (1988). ¡ Term-‑Weigh#ng ¡ Approaches ¡ in ¡ Automa#c ¡ Text ¡ Retrieval. ¡ Informa#on ¡Processing ¡and ¡Management, ¡vol. ¡24, ¡pp. ¡513-‑523. ¡ Sivic ¡J., ¡and ¡Zisserman ¡A. ¡(2003). ¡Video ¡Google: ¡A ¡text ¡retrieval ¡approach ¡to ¡object ¡matching ¡in ¡

videos. ¡ Proceedings ¡ of ¡ the ¡ 9th ¡ IEEE ¡ Interna#onal ¡ Conference ¡ on ¡ Computer ¡ Vision, ¡ Nice, ¡

France, ¡pp. ¡1470-‑1477. ¡

Preliminary ¡results ¡in: ¡

Jacobs ¡A., ¡Herzog ¡O., ¡Wilhelm ¡Adalbert ¡F.X., ¡and ¡Ilies ¡I. ¡(2008). ¡Relaxa#on-‑based ¡data ¡mining ¡on ¡ images ¡and ¡text ¡from ¡news ¡web ¡sites. ¡4th ¡World ¡Conference ¡of ¡the ¡IASC, ¡Yokohama, ¡Japan, ¡

pp. ¡736–743. ¡

Ilies ¡ I., ¡ Jacobs ¡ A., ¡ Wilhelm, ¡ A.F.X., ¡ and ¡ Herzog, ¡ O. ¡ (2009). ¡ Classifica#on ¡ of ¡ News ¡ Images ¡ Using ¡ Cap#ons ¡and ¡a ¡Visual ¡Vocabulary. ¡Technical ¡Report ¡No. ¡50, ¡TZI, ¡Universität ¡Bremen. ¡

SLIDE 23

Combining ¡Text ¡and ¡Image ¡Processing ¡in ¡ an ¡Automa6c ¡Image ¡Annota6on ¡System ¡

Iulian ¡Ilieș ¡(SHSS, ¡Jacobs ¡University) ¡ Joint ¡work ¡with ¡Arne ¡Jacobs, ¡OFhein ¡Herzog ¡(TZI, ¡Universität ¡ Bremen), ¡and ¡Adalbert ¡Wilhelm ¡(SHSS, ¡Jacobs ¡University) ¡ Supported ¡by ¡the ¡Deutsche ¡ForschungsgemeinschaO ¡(DFG) ¡

Overview ¡

Mo6va6on ¡

and ¡indexing ¡methods ¡for ¡easy ¡access ¡and ¡usage ¡

all ¡available ¡informa#on: ¡

Approach ¡

into ¡an ¡integrated ¡framework, ¡which ¡would ¡allow ¡the ¡ classifica#on ¡of ¡images ¡using ¡keywords, ¡features, ¡or ¡both ¡

scheme ¡between ¡images, ¡based ¡on ¡the ¡co-­‑occurrence ¡of ¡ keywords, ¡and ¡on ¡similari#es ¡between ¡visual ¡features ¡

Classifier ¡

Framework ¡

Images ¡ Visual ¡words ¡ (prototype ¡features) ¡ Cap6ons ¡ Keywords ¡ Textual ¡concept ¡detector ¡ Visual ¡concept ¡detector ¡ Clustering ¡algorithm ¡

Concept ¡propaga6on ¡

cap#ons ¡to ¡related ¡images, ¡and ¡further ¡to ¡the ¡visual ¡ features ¡found ¡in ¡these ¡images ¡

that ¡have ¡it ¡as ¡prototype, ¡and ¡contrast ¡the ¡obtained ¡ value ¡with ¡the ¡corresponding ¡global ¡average ¡

Training ¡ Tes6ng ¡

Classifier ¡

Data ¡employed ¡

from ¡news ¡websites ¡

ar#cles, ¡that ¡can ¡be ¡parsed ¡automa#cally ¡

Concept ¡detectors ¡

en#ty ¡recognizer ¡(NER; ¡Drozdzynski ¡et ¡al. ¡2004), ¡ complemented ¡by ¡manual ¡annota#ons ¡

the ¡SIFT ¡algorithm ¡(Lowe ¡1999), ¡clustered ¡into ¡a ¡ vocabulary ¡of ¡visual ¡words ¡(Sivic ¡& ¡Zisserman ¡2003) ¡ ¡

Data ¡set ¡

cap#ons, ¡harvested ¡from ¡German ¡news ¡websites ¡

by ¡the ¡NER ¡algorithm: ¡

the ¡images ¡with ¡the ¡SIFT ¡algorithm ¡

Current ¡experiments ¡

cross-­‑valida#on ¡sets, ¡and ¡a ¡test-­‑only ¡set ¡

average ¡of ¡precision ¡and ¡recall) ¡

(1998) ¡for ¡objec#ve ¡func#on ¡maximiza#on ¡

Transfer ¡func6ons ¡

probabili#es ¡between ¡keywords ¡and ¡visual ¡prototypes: ¡

the ¡within-­‑cluster ¡(-­‑prototype) ¡and ¡global ¡averages ¡

near ¡1, ¡and ¡then ¡map ¡to ¡the ¡unit ¡interval ¡

Experiment ¡1 ¡-­‑ ¡classifying ¡procedures ¡

associa#on ¡between ¡visual ¡prototypes ¡and ¡keywords ¡

separately, ¡globally, ¡and ¡with ¡predefined ¡parameters ¡

Experiment ¡1 ¡– ¡results ¡

training ¡each ¡concept ¡separately: ¡

Experiment ¡2 ¡– ¡vocabulary ¡size ¡

500, ¡1000, ¡and ¡2000 ¡clusters ¡(10 ¡min ¡– ¡2 ¡hrs) ¡

training ¡each ¡concept ¡separately ¡(best ¡performance) ¡

Experiment ¡2 ¡– ¡results ¡

close ¡to ¡perfect ¡training ¡at ¡approximately ¡2000 ¡clusters ¡

clusters ¡with ¡the ¡default ¡seings ¡for ¡TwoStep) ¡

Experiment ¡3 ¡

classifiers) ¡at ¡the ¡op#mal ¡vocabulary ¡size: ¡

test ¡images ¡of ¡65% ¡– ¡71% ¡and ¡close ¡to ¡perfect ¡training ¡

Experiment ¡3 ¡– ¡further ¡results ¡

concept ¡separately ¡– ¡F1-­‑score ¡of ¡ ¡71% ¡on ¡test ¡images ¡

and ¡training ¡globally ¡when ¡using ¡the ¡cap#ons ¡as ¡source ¡ data ¡or ¡measuring ¡the ¡performance ¡on ¡test ¡images ¡

based) ¡is ¡significantly ¡reduced ¡on ¡tes#ng ¡images ¡

Training ¡ ¡Tes6ng ¡

Some ¡examples ¡

Summary ¡

vocabulary ¡of ¡features ¡extracted ¡from ¡the ¡actual ¡images ¡

keywords ¡from ¡training ¡images ¡to ¡visual ¡prototypes, ¡and ¡ subsequently ¡to ¡test ¡images ¡

person ¡classifica#on ¡task ¡

Future ¡work ¡

seman#c ¡analysis ¡(LSA, ¡Deerwester ¡et ¡al. ¡1990) ¡ performed ¡on ¡the ¡TF-­‑IDF ¡weighted ¡term-­‑image ¡matrix ¡

(rela#ve) ¡posi#on ¡of ¡(groups ¡of) ¡visual ¡features ¡ ¡

news ¡category) ¡and ¡data ¡(unstructured ¡websites) ¡

References ¡

Preliminary ¡results ¡in: ¡

Thank ¡you! ¡

Any ¡ques6ons? ¡

scheme ¡between ¡images, ¡based ¡on ¡the ¡co-‑occurrence ¡of ¡ keywords, ¡and ¡on ¡similari#es ¡between ¡visual ¡features ¡

cross-‑valida#on ¡sets, ¡and ¡a ¡test-‑only ¡set ¡

the ¡within-‑cluster ¡(-‑prototype) ¡and ¡global ¡averages ¡

Experiment ¡1 ¡-‑ ¡classifying ¡procedures ¡

concept ¡separately ¡– ¡F1-‑score ¡of ¡ ¡71% ¡on ¡test ¡images ¡

seman#c ¡analysis ¡(LSA, ¡Deerwester ¡et ¡al. ¡1990) ¡ performed ¡on ¡the ¡TF-‑IDF ¡weighted ¡term-‑image ¡matrix ¡