Searching for patterns in the World Color Survey Gerhard J ager - - PowerPoint PPT Presentation

searching for patterns in the world color survey
SMART_READER_LITE
LIVE PREVIEW

Searching for patterns in the World Color Survey Gerhard J ager - - PowerPoint PPT Presentation

Searching for patterns in the World Color Survey Gerhard J ager gerhard.jaeger@uni-tuebingen.de July 2, 2009 University of Frankfurt 1/115 Overview Structure of the talk the psychological color space Berlin and Kays 1969 study the


slide-1
SLIDE 1

Searching for patterns in the World Color Survey

Gerhard J¨ ager gerhard.jaeger@uni-tuebingen.de

July 2, 2009

University of Frankfurt

1/115

slide-2
SLIDE 2

Overview

Structure of the talk the psychological color space Berlin and Kay’s 1969 study the World Color Survey the distribution of focal colors categorization Principal Component Analysis clustering color categories are (more or less) convex

2/115

slide-3
SLIDE 3

The psychological color space

physical color space has infinite dimensionality — every wavelength within the visible spectrum is one dimension psychological color space is only 3-dimensional this fact is employed in technical devices like computer screens (additive color space) or color printers (subtractive color space) additive color space subtractive color space

3/115

slide-4
SLIDE 4

The psychological color space

psychologically correct color space should not only correctly represent the topology of, but also the distances between colors distance is inverse function of perceived similarity L*a*b* color space has this property three axes:

black — white red — green blue — yellow

irregularly shaped 3d color solid

4/115

slide-5
SLIDE 5

The color solid

5/115

slide-6
SLIDE 6

The Munsell chart

for psychological investigations, the Munsell chart is being used 2d-rendering of the surface of the color solid

8 levels of lightness 40 hues

plus: black–white axis with 8 shaded of grey in between neighboring chips differ in the minimally perceivable way

6/115

slide-7
SLIDE 7

Berlin and Kay 1969

pilot study how different languages carve up the color space into categories informants: speakers of 20 typologically distant languages (who happened to be around the Bay area at the time) questions (using the Munsell chart):

What are the basic color terms of your native language? What is the extension of these terms? What are the prototypical instances of these terms?

results are not random indicate that there are universal tendencies in color naming systems

7/115

slide-8
SLIDE 8

Berlin and Kay 1969

distribution of focal colors: essentially correspond to the centers of the English categories black, white, red, green, yellow, blue, purple, orange, brown, grey, pink

8/115

slide-9
SLIDE 9

Berlin and Kay 1969

extensions

Arabic 9/115

slide-10
SLIDE 10

Berlin and Kay 1969

extensions

Bahasa Indonesia 10/115

slide-11
SLIDE 11

Berlin and Kay 1969

extensions

Bulgarian 11/115

slide-12
SLIDE 12

Berlin and Kay 1969

extensions

Cantonese 12/115

slide-13
SLIDE 13

Berlin and Kay 1969

extensions

Catalan 13/115

slide-14
SLIDE 14

Berlin and Kay 1969

extensions

English 14/115

slide-15
SLIDE 15

Berlin and Kay 1969

extensions

Hebrew 15/115

slide-16
SLIDE 16

Berlin and Kay 1969

extensions

Hungarian 16/115

slide-17
SLIDE 17

Berlin and Kay 1969

extensions

Ibibo 17/115

slide-18
SLIDE 18

Berlin and Kay 1969

extensions

Japanese 18/115

slide-19
SLIDE 19

Berlin and Kay 1969

extensions

Korean 19/115

slide-20
SLIDE 20

Berlin and Kay 1969

extensions

Mandarin 20/115

slide-21
SLIDE 21

Berlin and Kay 1969

extensions

Mexican Spanish 21/115

slide-22
SLIDE 22

Berlin and Kay 1969

extensions

Pomo 22/115

slide-23
SLIDE 23

Berlin and Kay 1969

extensions

Swahili 23/115

slide-24
SLIDE 24

Berlin and Kay 1969

extensions

Tagalog 24/115

slide-25
SLIDE 25

Berlin and Kay 1969

extensions

Thai 25/115

slide-26
SLIDE 26

Berlin and Kay 1969

extensions

Tzeltal 26/115

slide-27
SLIDE 27

Berlin and Kay 1969

extensions

Urdu 27/115

slide-28
SLIDE 28

Berlin and Kay 1969

extensions

Vietnamese 28/115

slide-29
SLIDE 29

Berlin and Kay 1969

identification of absolute and implicational universals, like

all languages have words for black and white if a language has a word for yellow, it has a word for red if a language has a word for pink, it has a word for blue ...

29/115

slide-30
SLIDE 30

The World Color Survey

B&K was criticized for methodological reasons in response, in 1976 Kay and co-workers launched the world color survey investigation of 110 non-written languages from around the world around 25 informants per language two tasks:

the 330 Munsell chips were presented to each test person one after the other in random order; they had to assign each chip to basic some color term from their native language for each native basic color term, each informant identified the prototypical instance(s)

data are publicly available under http://www.icsi.berkeley.edu/wcs/

30/115

slide-31
SLIDE 31

Data digging in the WCS

distribution of focal colors across all informants:

Distribution of focal colors

Munsell chips # named as focal color 20 50 200 1000

31/115

slide-32
SLIDE 32

Data digging in the WCS

distribution of focal colors across all informants:

32/115

slide-33
SLIDE 33

Data digging in the WCS

partition of a randomly chosen informant from a randomly chosen language

33/115

slide-34
SLIDE 34

Data digging in the WCS

partition of a randomly chosen informant from a randomly chosen language

34/115

slide-35
SLIDE 35

Data digging in the WCS

partition of a randomly chosen informant from a randomly chosen language

35/115

slide-36
SLIDE 36

Data digging in the WCS

partition of a randomly chosen informant from a randomly chosen language

36/115

slide-37
SLIDE 37

Data digging in the WCS

partition of a randomly chosen informant from a randomly chosen language

37/115

slide-38
SLIDE 38

Data digging in the WCS

partition of a randomly chosen informant from a randomly chosen language

38/115

slide-39
SLIDE 39

Data digging in the WCS

partition of a randomly chosen informant from a randomly chosen language

39/115

slide-40
SLIDE 40

Data digging in the WCS

partition of a randomly chosen informant from a randomly chosen language

40/115

slide-41
SLIDE 41

Data digging in the WCS

partition of a randomly chosen informant from a randomly chosen language

41/115

slide-42
SLIDE 42

Data digging in the WCS

partition of a randomly chosen informant from a randomly chosen language

42/115

slide-43
SLIDE 43

Data digging in the WCS

extension of a randomly chosen term from a randomly chosen language, averaged over all informants from that language

43/115

slide-44
SLIDE 44

Data digging in the WCS

extension of a randomly chosen term from a randomly chosen language, averaged over all informants from that language

44/115

slide-45
SLIDE 45

Data digging in the WCS

extension of a randomly chosen term from a randomly chosen language, averaged over all informants from that language

45/115

slide-46
SLIDE 46

Data digging in the WCS

extension of a randomly chosen term from a randomly chosen language, averaged over all informants from that language

46/115

slide-47
SLIDE 47

Data digging in the WCS

extension of a randomly chosen term from a randomly chosen language, averaged over all informants from that language

47/115

slide-48
SLIDE 48

Data digging in the WCS

extension of a randomly chosen term from a randomly chosen language, averaged over all informants from that language

48/115

slide-49
SLIDE 49

Data digging in the WCS

extension of a randomly chosen term from a randomly chosen language, averaged over all informants from that language

49/115

slide-50
SLIDE 50

Data digging in the WCS

extension of a randomly chosen term from a randomly chosen language, averaged over all informants from that language

50/115

slide-51
SLIDE 51

Data digging in the WCS

extension of a randomly chosen term from a randomly chosen language, averaged over all informants from that language

51/115

slide-52
SLIDE 52

Data digging in the WCS

extension of a randomly chosen term from a randomly chosen language, averaged over all informants from that language

52/115

slide-53
SLIDE 53

What is the extension of categories?

data from individual informants are extremely noisy averaging over all informants from a language helps, but there is still noise, plus dialectal variation desirable: distinction between “genuine” variation and noise

53/115

slide-54
SLIDE 54

Principal Component Analysis

technique to reduce dimensionality of data input: set of vectors in an n-dimensional space first step: rotate the coordinate system, such that

the new n coordinates are orthogonal to each other the variations of the data along the new coordinates are stochastically independent

second step:

choose a suitable m < n project the data on those m new coordinates where the data have the highest variance

54/115

slide-55
SLIDE 55

Principal Component Analysis

alternative formulation:

choose an m-dimensional linear sub-manifold of your n-dimensional space project your data onto this manifold when doing so, pick your sub-manifold such that the average squared distance of the data points from the sub-manifold is minimized

intuition behind this formulation:

data are “actually” generated in an m-dimensional space

  • bservations are disturbed by n-dimensional noise

PCA is a way to reconstruct the underlying data distribution

applications: picture recognition, latent semantic analysis, statistical data analysis in general, data visualization, ...

55/115

slide-56
SLIDE 56

Applying PCA to WCS-categories

data: informant-category pairs 330 dimensions (each Munsell color is one dimension) each informant-category pair assigns 1 to the colors that belong to that category, and 0 else

principal components proportion of variance explained 0.00 0.05 0.10 0.15

first seven principal components jointly explain 60% of the variance in the data each PC after PC10 only marginally increases proportion of variance explained so let’s say m = 10

56/115

slide-57
SLIDE 57

PC1

green/blue vs. white/red/yellow

57/115

slide-58
SLIDE 58

PC2

white vs. red

58/115

slide-59
SLIDE 59

PC3

black vs. red/white

59/115

slide-60
SLIDE 60

PC4

yellow vs. black/white/blue/red

60/115

slide-61
SLIDE 61

PC5

black vs. red/green/blue

61/115

slide-62
SLIDE 62

PC6

blue/yellow vs. red/green

62/115

slide-63
SLIDE 63

PC7

purple vs. red/blue/black

63/115

slide-64
SLIDE 64

PC8

pink vs. red/yellow/white

64/115

slide-65
SLIDE 65

PC9

brown vs. black/pink

65/115

slide-66
SLIDE 66

PC10

brown vs. light blue/yellow/black

66/115

slide-67
SLIDE 67

Projecting observed data on 10d-manifold

noise removal: project observed data onto the lower-dimensional submanifold that was obtained via PCA in our case: noisy binary categories are mapped to smoothed fuzzy categories (= probability distributions over Munsell chips) some examples:

67/115

slide-68
SLIDE 68

Projecting observed data on 10d-manifold

68/115

slide-69
SLIDE 69

Projecting observed data on 10d-manifold

69/115

slide-70
SLIDE 70

Projecting observed data on 10d-manifold

70/115

slide-71
SLIDE 71

Projecting observed data on 10d-manifold

71/115

slide-72
SLIDE 72

Projecting observed data on 10d-manifold

72/115

slide-73
SLIDE 73

Projecting observed data on 10d-manifold

73/115

slide-74
SLIDE 74

Projecting observed data on 10d-manifold

74/115

slide-75
SLIDE 75

Projecting observed data on 10d-manifold

75/115

slide-76
SLIDE 76

Projecting observed data on 10d-manifold

76/115

slide-77
SLIDE 77

Projecting observed data on 10d-manifold

77/115

slide-78
SLIDE 78

Projecting observed data on 10d-manifold

78/115

slide-79
SLIDE 79

Projecting observed data on 10d-manifold

79/115

slide-80
SLIDE 80

Projecting observed data on 10d-manifold

80/115

slide-81
SLIDE 81

Projecting observed data on 10d-manifold

81/115

slide-82
SLIDE 82

Projecting observed data on 10d-manifold

82/115

slide-83
SLIDE 83

Projecting observed data on 10d-manifold

83/115

slide-84
SLIDE 84

Projecting observed data on 10d-manifold

84/115

slide-85
SLIDE 85

Projecting observed data on 10d-manifold

85/115

slide-86
SLIDE 86

Projecting observed data on 10d-manifold

86/115

slide-87
SLIDE 87

Projecting observed data on 10d-manifold

87/115

slide-88
SLIDE 88

Smoothed partitions of the color space

vocabulary of a given language does not always form a partition many cases of (near) synonymy, hyponymy, and overlap for instance language 1 (Abidjy, Ivory Coast):

88/115

slide-89
SLIDE 89

Smoothed partitions of the color space

89/115

slide-90
SLIDE 90

Smoothed partitions of the color space

if two categories of one language have a correlation of at least .5, they are treated as synonyms process is repeated if remaining categories are independent or negatively correlated after this process, each Munsell chip c is assigned to the category that assigns the highest probability to c for Abidji, we get

90/115

slide-91
SLIDE 91

Smoothed partitions of the color space

some more examples: Waorani (Ecuador)

91/115

slide-92
SLIDE 92

Smoothed partitions of the color space

some more examples: Arabela (Peru)

92/115

slide-93
SLIDE 93

Smoothed partitions of the color space

some more examples: Camsa (Colombia)

93/115

slide-94
SLIDE 94

Smoothed partitions of the color space

some more examples: Candoshi (Peru)

94/115

slide-95
SLIDE 95

Smoothed partitions of the color space

some more examples: Chinanteco (Mexico)

95/115

slide-96
SLIDE 96

Smoothed partitions of the color space

some more examples: Guarijio (Mexico)

96/115

slide-97
SLIDE 97

Smoothed partitions of the color space

some more examples: Gunu (Cameroon)

97/115

slide-98
SLIDE 98

Smoothed partitions of the color space

some more examples: Kalam (Papua New Guinea)

98/115

slide-99
SLIDE 99

Smoothed partitions of the color space

some more examples: Menye (Papua New Guinea)

99/115

slide-100
SLIDE 100

Smoothed partitions of the color space

some more examples: Tifal (Papua New Guinea)

100/115

slide-101
SLIDE 101

Convexity

note: so far, we only used information from the WCS the location of the 330 Munsell chips in L*a*b* space played no role so far still, apparently partition cells always form continuous clusters in L*a*b* space Hypothesis (G¨ ardenfors): extension of color terms always form convex regions of L*a*b* space

101/115

slide-102
SLIDE 102

Support Vector Machines

supervised learning technique smart algorithm to classify data in a high-dimensional space by a (for instance) linear boundary minimizes number of mis-classifications if the training data are not linearly separable

green red −3 −2 −1 1 2 3 −3 −2 −1 1 2 3

  • o
  • SVM classification plot

y x

102/115

slide-103
SLIDE 103

Convex partitions

a binary linear classifier divides an n-dimensional space into two convex half-spaces intersection of two convex set is itself convex hence: intersection of k binary classifications leads to convex sets procedure: if a language partitions the Munsell space into m categories, train m(m−1)

2

many binary SVMs, one for each pair

  • f categories in L*a*b* space

leads to m convex sets (which need not split the L*a*b* space exhaustively)

103/115

slide-104
SLIDE 104

Convex approximation

Waorani (Ecuador)

104/115

slide-105
SLIDE 105

Convex approximation

Arabela (Peru)

105/115

slide-106
SLIDE 106

Convex approximation

Camsa (Colombia)

106/115

slide-107
SLIDE 107

Convex approximation

Candoshi (Peru)

107/115

slide-108
SLIDE 108

Convex approximation

Chinanteco (Mexico)

108/115

slide-109
SLIDE 109

Convex approximation

Guarijio (Mexico)

109/115

slide-110
SLIDE 110

Convex approximation

Gunu (Cameroon)

110/115

slide-111
SLIDE 111

Convex approximation

Kalam (Papua New Guinea)

111/115

slide-112
SLIDE 112

Convex approximation

Menye (Papua New Guinea)

112/115

slide-113
SLIDE 113

Convex approximation

Tifal (Papua New Guinea)

113/115

slide-114
SLIDE 114

Convex approximation

  • n average, 93.7% of all Munsell chips are correctly classified

by convex approximation

  • 0.80

0.85 0.90 0.95 proportion of correctly classified Munsell chips

114/115

slide-115
SLIDE 115

Convex approximation

compare to the outcome of the same procedure without PCA:

  • without PCA

with PCA 0.0 0.2 0.4 0.6 0.8 1.0 proportion of correctly classified Munsell chips

115/115

slide-116
SLIDE 116

Conclusion

empirical support for G¨ ardenfors’ thesis that natural properties are convex sets quantitative data analysis reveals robust universal tendencies techniques from statistical pattern recognition are useful for typological studies R is a great tool

116/115