wit ith the Generalized Benford's 's Law Arnaud Soulet, Arnaud - - PowerPoint PPT Presentation

β–Ά
wit ith the generalized benford s s law
SMART_READER_LITE
LIVE PREVIEW

wit ith the Generalized Benford's 's Law Arnaud Soulet, Arnaud - - PowerPoint PPT Presentation

Representativeness of Knowledge Bases wit ith the Generalized Benford's 's Law Arnaud Soulet, Arnaud Giacometti, Batrice Markhoff and Fabian M. Suchanek University of Tours Telecom ParisTech Reliability of f queries on Knowledge Bases


slide-1
SLIDE 1

Representativeness of Knowledge Bases wit ith the Generalized Benford's 's Law

Arnaud Soulet, Arnaud Giacometti, BΓ©atrice Markhoff and Fabian M. Suchanek University of Tours Telecom ParisTech

slide-2
SLIDE 2

Reliability of f queries on Knowledge Bases

statistical query

ISWC 2018 - Monterey, CA

2

[Auer et al., 2007]

How many cities are small (<1k inhabitants) in France/Yemen?

slide-3
SLIDE 3

Does Yemen really not have any small cities?

Reliability of f queries on Knowledge Bases

statistical query

ISWC 2018 - Monterey, CA

3

[Auer et al., 2007]

slide-4
SLIDE 4

Reliability of f queries on Knowledge Bases

crowdsourcing

ISWC 2018 - Monterey, CA

4

Voluntary bias

[Callahan and Herring, 2011;Wagner et al., 2015]

statistical query

slide-5
SLIDE 5

Reliability of f queries on Knowledge Bases

crowdsourcing

ISWC 2018 - Monterey, CA

5

We do not know the KB biases, but the statistics can give us a hint?

statistical query

Voluntary bias

[Callahan and Herring, 2011;Wagner et al., 2015]

slide-6
SLIDE 6

Missing facts

Several methods for estimating the completeness for facts

ISWC 2018 - Monterey, CA

6

Sanaa Aden Taiz

Yemeni city:

1,937,451

Population:

760,923 615,222 missing

[Darari et al., 2016; Galarraga et al., 2017; Lajus and Suchanek, 2018; Razniewski et al., 2015; Razniewski et al., 2016]

slide-7
SLIDE 7

Missing facts β‰  Missing entities + + missing facts

Missing facts due to missing entities are ignored!

ISWC 2018 - Monterey, CA

7

Sanaa Aden Taiz Haid al- Jazil

Yemeni city:

1,937,451

Population:

760,923 615,222 few missing missing missing

slide-8
SLIDE 8

Completeness

= #present facts / / (#present facts + + #mis issin ing facts)

Assuming that π’§βˆ— is an ideal KB (= correct + complete):

ISWC 2018 - Monterey, CA

8

Small cities Big cities

π’§βˆ— 𝒧1

Small cities Big cities

π’§βˆ— 𝒧2

What is the best KB between π“›πŸ and π“›πŸ‘ for statistical queries?

slide-9
SLIDE 9

Completeness β‰  Representativeness

Representativeness is more important than completeness for statistics!

Small cities Big cities

π’§βˆ— 𝒧1

Small cities Big cities

π’§βˆ— 𝒧2 More complete, less representative Less complete, more representative

ISWC 2018 - Monterey, CA

9

Assuming that π’§βˆ— is an ideal KB (= correct + complete):

 οƒΌ

slide-10
SLIDE 10

Representativeness of f Knowledge Bases

A KB 𝒧 is representative of π’§βˆ— iff the distribution remains the same for all uniform-sampling invariant measures. … … ∼ ∼ ∼

π’§βˆ— 𝒧

∼

ISWC 2018 - Monterey, CA

10

<1k inhab. β‰₯1k inhab. <1k inhab. β‰₯1k inhab.

slide-11
SLIDE 11

… … ∼ ∼ ∼

π’§βˆ— 𝒧

∼

Representativeness of f Knowledge Bases

A KB 𝒧 is representative of π’§βˆ— iff the distribution remains the same for all uniform-sampling invariant measures. Challenge: How to estimate the representativeness?

ISWC 2018 - Monterey, CA

<1k inhab. β‰₯1k inhab.

11

<1k inhab. β‰₯1k inhab.

The ideal knowledge base π’§βˆ— is unknown!

slide-12
SLIDE 12

Example: population of f capitals

ISWC 2018 - Monterey, CA

12

Abidjan Bangkok Conakry Kingston Mogadishu Santiago Abuja Beijing Dakar Kinshasa Montevideo Seoul Accra Belgrade Damascus Kuala Lumpur Nairobi Sofia Addis Ababa Berlin Dhaka Lilongwe Niamey Taipei Algiers Bogota Doha Lima Ouagadougou Tashkent Amman Brasilia Erbil London Paris Tbilisi Ankara Brazzaville Freetown Luanda Phnom Penh Tegucigalpa Antananarivo Bucharest Havana Lusaka Prague Tokyo Ashgabat Budapest Islamabad Madrid Pyongyang Tripoli Bahawalpur Buenos Aires Jakarta Managua Quito Tunis Baku Cairo Kabul Maputo Riyadh Ulaanbaatar Bamako Caracas Khartoum Mexico City Sana'a Vienna

slide-13
SLIDE 13

Example: population of f capitals

ISWC 2018 - Monterey, CA

13

4 707 404 8 280 925 1 660 973 1 041 084 1 750 000 6 158 080 1 235 880 21 700 000 1 146 053 10 125 000 1 305 082 9 971 111 2 291 352 1 166 763 1 711 000 1 768 000 3 138 369 1 260 120 3 384 569 3 610 156 6 970 105 1 077 116 1 302 910 2 704 974 3 415 811 7 878 783 1 351 000 8 852 000 1 626 950 2 309 600 4 007 526 2 556 149 1 025 000 8 673 713 2 229 621 1 118 035 4 587 558 1 827 000 1 050 301 2 825 311 1 501 725 1 157 509 1 613 375 1 883 425 2 106 146 1 742 979 1 267 449 13 617 445 1 031 992 1 759 407 1 900 000 3 141 991 2 581 076 1 126 000 1 052 000 2 890 151 9 607 787 2 205 676 2 671 191 1 056 247 2 122 300 10 230 350 3 678 034 1 766 184 7 125 180 1 372 000 1 809 106 3 273 863 5 185 000 8 918 653 1 937 451 1 852 997

slide-14
SLIDE 14

Example: population of f capitals

ISWC 2018 - Monterey, CA

14

4 707 404 8 280 925 1 660 973 1 041 084 1 750 000 6 158 080 1 235 880 21 700 000 1 146 053 10 125 000 1 305 082 9 971 111 2 291 352 1 166 763 1 711 000 1 768 000 3 138 369 1 260 120 3 384 569 3 610 156 6 970 105 1 077 116 1 302 910 2 704 974 3 415 811 7 878 783 1 351 000 8 852 000 1 626 950 2 309 600 4 007 526 2 556 149 1 025 000 8 673 713 2 229 621 1 118 035 4 587 558 1 827 000 1 050 301 2 825 311 1 501 725 1 157 509 1 613 375 1 883 425 2 106 146 1 742 979 1 267 449 13 617 445 1 031 992 1 759 407 1 900 000 3 141 991 2 581 076 1 126 000 1 052 000 2 890 151 9 607 787 2 205 676 2 671 191 1 056 247 2 122 300 10 230 350 3 678 034 1 766 184 7 125 180 1 372 000 1 809 106 3 273 863 5 185 000 8 918 653 1 937 451 1 852 997

What is the distribution of the first significant digit of capital inhabitants?

slide-15
SLIDE 15

Benford’s law

Population of cities

0.00 0.10 0.20 0.30

1 2 3 4 5 6 7 8 9

ISWC 2018 - Monterey, CA

15

Benford’s law

slide-16
SLIDE 16

Benford’s law

Population of cities

0.00 0.10 0.20 0.30

1 2 3 4 5 6 7 8 9

0.00 0.10 0.20 0.30

1 2 3 4 5 6 7 8 9

Length of rivers

0.00 0.10 0.20 0.30

1 2 3 4 5 6 7 8 9

Discharge of rivers

ISWC 2018 - Monterey, CA

16

slide-17
SLIDE 17

Benford’s law

Population of cities

0.00 0.10 0.20 0.30

1 2 3 4 5 6 7 8 9

0.00 0.10 0.20 0.30

1 2 3 4 5 6 7 8 9

Length of rivers

0.00 0.10 0.20 0.30

1 2 3 4 5 6 7 8 9

Discharge of rivers

𝑄 𝑔𝑗𝑠𝑑𝑒 𝑒𝑗𝑕𝑗𝑒 π‘Œ = 𝑒 = log 1 + 1 𝑒

ISWC 2018 - Monterey, CA

17

[Newcomb, 1881;Benford, 1938]

slide-18
SLIDE 18

The Generalized Benford’s Law

Population of cities

0.00 0.10 0.20 0.30

1 2 3 4 5 6 7 8 9

0.00 0.10 0.20 0.30

1 2 3 4 5 6 7 8 9

Length of rivers

0.00 0.10 0.20 0.30

1 2 3 4 5 6 7 8 9

Discharge of rivers

𝑄 𝑔𝑗𝑠𝑑𝑒 𝑒𝑗𝑕𝑗𝑒 π‘Œ = 𝑒 = 1 + 𝑒 𝛽 βˆ’ 𝑒𝛽 10𝛽 βˆ’ 1

ISWC 2018 - Monterey, CA

18

Ξ± β†’ 0 Ξ± β†’ 0 Ξ± β†’ 0 [HΓΌrlimann, 2014]

slide-19
SLIDE 19

0.00 0.25 0.50 0.75

1 2 3 4 5 6 7 8 9

The Generalized Benford’s Law

Population of cities

0.00 0.10 0.20 0.30

1 2 3 4 5 6 7 8 9

0.00 0.10 0.20 0.30

1 2 3 4 5 6 7 8 9

Length of rivers

0.00 0.10 0.20 0.30

1 2 3 4 5 6 7 8 9

Discharge of rivers

0.00 0.25 0.50 0.75

1 2 3 4 5 6 7 8 9

Actors per movie

0.00 0.25 0.50 0.75

1 2 3 4 5 6 7 8 9

Persons per birth place Out-degree of wikipedia pages

Ξ±=-0.155 Ξ±=-0.149 Ξ±=-0.486 Ξ± β†’ 0 Ξ± β†’ 0 Ξ± β†’ 0

ISWC 2018 - Monterey, CA

19

slide-20
SLIDE 20

1 2 3 4 5 6 7 8 9

Population in France

Representativeness = 97%

1 2 3 4 5 6 7 8 9

Key idea of f our method

representativeness = compliance with the Generalized Benford’s Law

Population in Yemen

Representativeness = 79%

ISWC 2018 - Monterey, CA

20

=

#𝒒𝒔𝒇𝒕𝒇𝒐𝒖_π’ˆπ’ƒπ’…π’–π’• #𝒒𝒔𝒇𝒕𝒇𝒐𝒖_π’ˆπ’ƒπ’…π’–π’•+#𝒏𝒋𝒕𝒕𝒋𝒐𝒉_π’ˆπ’ƒπ’…π’–π’•_π’ˆπ’‘π’”_π’…π’‘π’π’’π’Žπ’‹π’ƒπ’π’…π’‡ DBpedia

slide-21
SLIDE 21

Our method in superv rvised context xt

ISWC 2018 - Monterey, CA

21

distribution of the fsd facts of 𝑠 on π’§βˆ— facts of 𝑠 on 𝒧 Benford’s law

Using the known distribution of the first significant digit

slide-22
SLIDE 22

Our method in superv rvised context xt

Computing the minimum number of facts for retrieving Benford’s law

distribution of the fsd facts of 𝑠 on π’§βˆ—

50 100 150 200 1 2 3 4 5 6 7 8 9 378 present facts 101 missing facts

facts of 𝑠 on 𝒧 Benford’s law Representativeness: = πŸ’πŸ–πŸ— πŸ’πŸ–πŸ— + 𝟐𝟏𝟐 = 79%

ISWC 2018 - Monterey, CA

22

Population in Yemen

378 101

slide-23
SLIDE 23

distribution of the fsd facts of 𝑠 on π’§βˆ— facts of 𝑠 on 𝒧 GBL with Ξ±=0.12

Our method in unsuperv rvised context xt

Learning the parameter Ξ± of the Generalized Benford’s Law

ISWC 2018 - Monterey, CA

23

ideal distribution is unknown!

slide-24
SLIDE 24

ideal distribution is unknown!

distribution of the fsd facts of 𝑠 on π’§βˆ—

50 100 150 200 1 2 3 4 5 6 7 8 9 378 present facts 78 missing facts

facts of 𝑠 on 𝒧 GBL with Ξ±=0.12 Representativeness: = πŸ’πŸ–πŸ— πŸ’πŸ–πŸ— + πŸ–πŸ— = 82%

Our method in unsuperv rvised context xt

Computing the minimum number of facts for retrieving Benford’s law

ISWC 2018 - Monterey, CA

24

Population in Yemen

378 78

slide-25
SLIDE 25

Experimental study

Evaluation protocol

  • 1. Take a correct and complete relation as gold standard
  • 2. Degrade the completeness by discarding facts
  • 3. Approximate the representativeness

Gold standard: population in French cities according to govt statistics Degradation:

  • Most-populated: remove the least populated cities
  • Least-populated: remove the most populated cities
  • Random: remove cities randomly

ISWC 2018 - Monterey, CA

25

slide-26
SLIDE 26

Population of f French cit itie ies

ISWC 2018 - Monterey, CA

26

Representativeness approximates well the bias

Representativeness is an upper

bound of completeness

Most/least-populated degradation:

tight bound if number of cities > 22k

Random degradation: the

representativeness is high

slide-27
SLIDE 27

Population of f French cit itie ies

ISWC 2018 - Monterey, CA

27

Learning the parameter Ξ± does not perturbate the approximation

slide-28
SLIDE 28

Auditing DBpedia (France)

1,487 relations (out of 2,920) have a distribution statistically compliant with the GBL

ISWC 2018 - Monterey, CA

28 117 461 855 45 972 923

Present facts Missing facts

Representativeness: 72%

slide-29
SLIDE 29

Conclusion

The representativeness is more important than the completeness

for achieving statistics.

First use of Benford’s law for approximating the proportion of

missing data

The approximate representativeness based on the GBL is an upper

bound of the true representativeness and the true completeness.

Future work:

  • How to correct sparql queries with representativeness information?
  • How to scale up the approach to audit the LOD?

ISWC 2018 - Monterey, CA

29