Author profiling 1 zhiming Zho u 2017/ 6/ 26 O UT LINE 2 I - - PowerPoint PPT Presentation

author profiling
SMART_READER_LITE
LIVE PREVIEW

Author profiling 1 zhiming Zho u 2017/ 6/ 26 O UT LINE 2 I - - PowerPoint PPT Presentation

Author profiling 1 zhiming Zho u 2017/ 6/ 26 O UT LINE 2 I ntro duc tio n Pre vio us wo rk Syste m o ve rvie w Alg o rithm Co nc lusio n 2017/ 6/ 26 O UT LINE 3 I ntro duc tio n Pre vio us wo rk Syste m o


slide-1
SLIDE 1

Author profiling

zhiming Zho u

2017/ 6/ 26

1

slide-2
SLIDE 2

O UT LINE

  • I

ntro duc tio n

  • Pre vio us wo rk
  • Syste m o ve rvie w
  • Alg o rithm
  • Co nc lusio n

2017/ 6/ 26

2

slide-3
SLIDE 3

O UT LINE

  • I

ntro duc tio n

  • Pre vio us wo rk
  • Syste m o ve rvie w
  • Alg o rithm
  • Co nc lusio n

2017/ 6/ 26

3

slide-4
SLIDE 4

O UT LINE

  • I

ntro duc tio n

  • Pre vio us wo rk
  • Syste m o ve rvie w
  • Alg o rithm
  • Co nc lusio n

2017/ 6/ 26

4

slide-5
SLIDE 5

PREVIO US WO RK

2017/ 6/ 26

5

slide-6
SLIDE 6

O UT LINE

  • I

ntro duc tio n

  • Pre vio us wo rk
  • Syste m o ve rvie w
  • Alg o rithm
  • Co nc lusio n

2017/ 6/ 26

6

slide-7
SLIDE 7

SYST EM O VERVIEW

  • Ma ximize the re c a ll
  • Ma ximize the pre c isio n

2017/ 6/ 26

7

slide-8
SLIDE 8

O UT LINE

  • I

ntro duc tio n

  • Pre vio us wo rk
  • Syste m o ve rvie w
  • Alg o rithm
  • Co nc lusio n

2017/ 6/ 26

8

slide-9
SLIDE 9

ALG O RIT HM

Pre - proc e ssing

Cle a n the da ta :

1. No isy F irst o r L a st Na me s 2. Mista ke nly Se pa ra te d o r Me rg e d Na me Units

2017/ 6/ 26

9

slide-10
SLIDE 10

ALG O RIT HM

Improving the Re c a ll

String -b a se d Co nside ra tio n:

1. L e ve nshte in E dit Dista nc e 2. So unde x Dista nc e 3. Ove rla pping Na me Units

2017/ 6/ 26

10

Na me -Spe c ific Co nside ra tio n:

1. Na me Suffixe s a nd Pre fixe s 2. Nic kna me s 3. Na me I nitia ls 4. Asia n Na me s a nd We ste rn Na me s

slide-11
SLIDE 11

ALG O RIT HM

Improving the Pre c ision

Me ta -Pa th-b a se d Simila rity:

T he se le c te d me ta -pa ths a re APA, AOA, APAPA, APV PA, APK PA, APT PA a nd APY PA. T he we ig hts fo r the m a re de c re a sing pro g re ssive ly.

2017/ 6/ 26

11

slide-12
SLIDE 12

ALG O RIT HM

Improving the Pre c ision

Me ta -Pa th-b a se d Simila rity:

T he se le c te d me ta -pa ths a re APA, AOA, APAPA, APV PA, APK PA, APT PA a nd APY PA. T he we ig hts fo r the m a re de c re a sing pro g re ssive ly.

2017/ 6/ 26

12

slide-13
SLIDE 13

ALG O RIT HM

Improving the Pre c ision

Me ta -Pa th-b a se d Simila rity:

T he se le c te d me ta -pa ths a re APA, AOA, APAPA, APV PA, APK PA, APT PA a nd APY PA. T he we ig hts fo r the m a re de c re a sing pro g re ssive ly.

2017/ 6/ 26

13

slide-14
SLIDE 14

ALG O RIT HM

Improving the Pre c ision

Ra nking -b a se d Me rg ing

We do a sc a n fro m the to p ra nke d I D pa ir to the lo we r ra nke d o ne s to he lp infe r the a utho r e ntity. And we will skip the c o nflic t I Ds, find o ne tha t ha s hig h simila rity b ut a lso pa sse s the na me ma tc hing c o mpa riso n, we b e lie ve the se two I Ds ha ving hig h pro b a b ility to b e the re a l duplic a te . Afte r tha t, if A is the duplic a te o f B a nd B is the duplic a te o f C, we will c o nside r tha t a is the duplic a te o f C. Ano the r impo rta nt stra te g y is to e xpa nd the a utho r na me s c o rre spo nding to the I Ds o nc e we a re c o nfide nt a b o ut two I Ds to b e the duplic a te . T his ide a is use ful b e c a use it c a n he lp a vo id the mista ke nly de te c te d c o nflic ts.

2017/ 6/ 26

14

slide-15
SLIDE 15

ALG O RIT HM

Post- proc e ssing

Unc o nfide nt duplic a te a utho r I Ds sho uld b e re mo ve d e ve n tho ug h the ir na me s a re c o mpa tib le a nd the ir me ta -pa th- b a se d simila rity sc o re s a re a c c e pta b le . T his ste p is c ruc ia l in tha t the la te r ite ra tive fra me wo rk re q uire s hig hly c o nfide nt o utput to g ra dua lly re fine the re sults.

2017/ 6/ 26

15

slide-16
SLIDE 16

ALG O RIT HM

Ite ra tive F

ra me work

An ite ra tive fra me wo rk whic h ta ke s the de te c te d duplic a te s o f the

la st ite ra tio n a s pa rt o f the input:

1. we a re a b le to g e ne ra te muc h b e tte r me ta -pa th-b a se d simila rity sc o re s 2. re c a ll the na me e xpa nsio n mo dule intro duc e d a t the e nd o f the p- ste p

2017/ 6/ 26

16

slide-17
SLIDE 17

O UT LINE

  • I

ntro duc tio n

  • Pre vio us wo rk
  • Syste m o ve rvie w
  • Alg o rithm
  • Co nc lusio n

2017/ 6/ 26

17

slide-18
SLIDE 18

C O NC LUSIO N

2017/ 6/ 26

18

We ha ve trie d to disa mb ig ua tio n the a utho r na me , a nd we ha ve fo und a b e tte r a lg o rithm whic h is undo ub te dly pra c tic a l in K DD Cup Da ta Mining Co nte st 2013. But the re is still lo ts o f wo rk ne e d to b e do ne . I n the future , we ne e d to a djust the c o de to o ur da ta b a se , a nd we ne e d to c ha ng e so me o f the pa ra me te rs to o b ta in the b e st re sult. I a m lo o king fo rwa rd to the da y we c o mple te the wo rk, a nd I a m firmly b e lie ve d tha t o ur wo rk will turn o ut to b e a ve ry impo rta nt impro ve me nt o f the Ac e ma p.

slide-19
SLIDE 19

2017/ 6/ 26

19

Q&A

slide-20
SLIDE 20

2017/ 6/ 26

20

Thank You!