Author profiling
zhiming Zho u
2017/ 6/ 26
1
Author profiling 1 zhiming Zho u 2017/ 6/ 26 O UT LINE 2 I - - PowerPoint PPT Presentation
Author profiling 1 zhiming Zho u 2017/ 6/ 26 O UT LINE 2 I ntro duc tio n Pre vio us wo rk Syste m o ve rvie w Alg o rithm Co nc lusio n 2017/ 6/ 26 O UT LINE 3 I ntro duc tio n Pre vio us wo rk Syste m o
zhiming Zho u
2017/ 6/ 26
1
ntro duc tio n
2017/ 6/ 26
2
ntro duc tio n
2017/ 6/ 26
3
ntro duc tio n
2017/ 6/ 26
4
PREVIO US WO RK
2017/ 6/ 26
5
ntro duc tio n
2017/ 6/ 26
6
SYST EM O VERVIEW
2017/ 6/ 26
7
ntro duc tio n
2017/ 6/ 26
8
ALG O RIT HM
Pre - proc e ssing
Cle a n the da ta :
1. No isy F irst o r L a st Na me s 2. Mista ke nly Se pa ra te d o r Me rg e d Na me Units
2017/ 6/ 26
9
ALG O RIT HM
Improving the Re c a ll
String -b a se d Co nside ra tio n:
1. L e ve nshte in E dit Dista nc e 2. So unde x Dista nc e 3. Ove rla pping Na me Units
2017/ 6/ 26
10
Na me -Spe c ific Co nside ra tio n:
1. Na me Suffixe s a nd Pre fixe s 2. Nic kna me s 3. Na me I nitia ls 4. Asia n Na me s a nd We ste rn Na me s
ALG O RIT HM
Improving the Pre c ision
Me ta -Pa th-b a se d Simila rity:
T he se le c te d me ta -pa ths a re APA, AOA, APAPA, APV PA, APK PA, APT PA a nd APY PA. T he we ig hts fo r the m a re de c re a sing pro g re ssive ly.
2017/ 6/ 26
11
ALG O RIT HM
Improving the Pre c ision
Me ta -Pa th-b a se d Simila rity:
T he se le c te d me ta -pa ths a re APA, AOA, APAPA, APV PA, APK PA, APT PA a nd APY PA. T he we ig hts fo r the m a re de c re a sing pro g re ssive ly.
2017/ 6/ 26
12
ALG O RIT HM
Improving the Pre c ision
Me ta -Pa th-b a se d Simila rity:
T he se le c te d me ta -pa ths a re APA, AOA, APAPA, APV PA, APK PA, APT PA a nd APY PA. T he we ig hts fo r the m a re de c re a sing pro g re ssive ly.
2017/ 6/ 26
13
ALG O RIT HM
Improving the Pre c ision
Ra nking -b a se d Me rg ing
We do a sc a n fro m the to p ra nke d I D pa ir to the lo we r ra nke d o ne s to he lp infe r the a utho r e ntity. And we will skip the c o nflic t I Ds, find o ne tha t ha s hig h simila rity b ut a lso pa sse s the na me ma tc hing c o mpa riso n, we b e lie ve the se two I Ds ha ving hig h pro b a b ility to b e the re a l duplic a te . Afte r tha t, if A is the duplic a te o f B a nd B is the duplic a te o f C, we will c o nside r tha t a is the duplic a te o f C. Ano the r impo rta nt stra te g y is to e xpa nd the a utho r na me s c o rre spo nding to the I Ds o nc e we a re c o nfide nt a b o ut two I Ds to b e the duplic a te . T his ide a is use ful b e c a use it c a n he lp a vo id the mista ke nly de te c te d c o nflic ts.
2017/ 6/ 26
14
ALG O RIT HM
Post- proc e ssing
Unc o nfide nt duplic a te a utho r I Ds sho uld b e re mo ve d e ve n tho ug h the ir na me s a re c o mpa tib le a nd the ir me ta -pa th- b a se d simila rity sc o re s a re a c c e pta b le . T his ste p is c ruc ia l in tha t the la te r ite ra tive fra me wo rk re q uire s hig hly c o nfide nt o utput to g ra dua lly re fine the re sults.
2017/ 6/ 26
15
ALG O RIT HM
Ite ra tive F
ra me work
An ite ra tive fra me wo rk whic h ta ke s the de te c te d duplic a te s o f the
la st ite ra tio n a s pa rt o f the input:
1. we a re a b le to g e ne ra te muc h b e tte r me ta -pa th-b a se d simila rity sc o re s 2. re c a ll the na me e xpa nsio n mo dule intro duc e d a t the e nd o f the p- ste p
2017/ 6/ 26
16
ntro duc tio n
2017/ 6/ 26
17
C O NC LUSIO N
2017/ 6/ 26
18
We ha ve trie d to disa mb ig ua tio n the a utho r na me , a nd we ha ve fo und a b e tte r a lg o rithm whic h is undo ub te dly pra c tic a l in K DD Cup Da ta Mining Co nte st 2013. But the re is still lo ts o f wo rk ne e d to b e do ne . I n the future , we ne e d to a djust the c o de to o ur da ta b a se , a nd we ne e d to c ha ng e so me o f the pa ra me te rs to o b ta in the b e st re sult. I a m lo o king fo rwa rd to the da y we c o mple te the wo rk, a nd I a m firmly b e lie ve d tha t o ur wo rk will turn o ut to b e a ve ry impo rta nt impro ve me nt o f the Ac e ma p.
2017/ 6/ 26
19
2017/ 6/ 26
20