author profiling
play

Author profiling 1 zhiming Zho u 2017/ 6/ 26 O UT LINE 2 I - PowerPoint PPT Presentation

Author profiling 1 zhiming Zho u 2017/ 6/ 26 O UT LINE 2 I ntro duc tio n Pre vio us wo rk Syste m o ve rvie w Alg o rithm Co nc lusio n 2017/ 6/ 26 O UT LINE 3 I ntro duc tio n Pre vio us wo rk Syste m o


  1. Author profiling 1 zhiming Zho u 2017/ 6/ 26

  2. O UT LINE 2 • I ntro duc tio n • Pre vio us wo rk • Syste m o ve rvie w • Alg o rithm • Co nc lusio n 2017/ 6/ 26

  3. O UT LINE 3 • I ntro duc tio n • Pre vio us wo rk • Syste m o ve rvie w • Alg o rithm • Co nc lusio n 2017/ 6/ 26

  4. O UT LINE 4 • I ntro duc tio n • Pre vio us wo rk • Syste m o ve rvie w • Alg o rithm • Co nc lusio n 2017/ 6/ 26

  5. PREVIO US WO RK 5 2017/ 6/ 26

  6. O UT LINE 6 • I ntro duc tio n • Pre vio us wo rk • Syste m o ve rvie w • Alg o rithm • Co nc lusio n 2017/ 6/ 26

  7. SYST EM O VERVIEW 7 • Ma ximize the re c a ll • Ma ximize the pre c isio n 2017/ 6/ 26

  8. O UT LINE 8 • I ntro duc tio n • Pre vio us wo rk • Syste m o ve rvie w • Alg o rithm • Co nc lusio n 2017/ 6/ 26

  9. ALG O RIT HM 9  Pre - proc e ssing  Cle a n the da ta : 1. No isy F irst o r L a st Na me s 2. Mista ke nly Se pa ra te d o r Me rg e d Na me Units 2017/ 6/ 26

  10. ALG O RIT HM 10  Improving the Re c a ll  Na me -Spe c ific Co nside ra tio n:  String -b a se d Co nside ra tio n: 1. Na me Suffixe s a nd 1. L e ve nshte in E dit Pre fixe s Dista nc e 2. Nic kna me s 2. So unde x Dista nc e 3. Na me I nitia ls 3. Ove rla pping Na me Units 4. Asia n Na me s a nd We ste rn Na me s 2017/ 6/ 26

  11. ALG O RIT HM 11  Improving the Pre c ision  Me ta -Pa th-b a se d Simila rity: T he se le c te d me ta -pa ths a re APA, AOA, APAPA, APV PA, APK PA, APT PA a nd APY PA. T he we ig hts fo r the m a re de c re a sing pro g re ssive ly. 2017/ 6/ 26

  12. ALG O RIT HM 12  Improving the Pre c ision  Me ta -Pa th-b a se d Simila rity: T he se le c te d me ta -pa ths a re APA, AOA, APAPA, APV PA, APK PA, APT PA a nd APY PA. T he we ig hts fo r the m a re de c re a sing pro g re ssive ly. 2017/ 6/ 26

  13. ALG O RIT HM 13  Improving the Pre c ision  Me ta -Pa th-b a se d Simila rity: T he se le c te d me ta -pa ths a re APA, AOA, APAPA, APV PA, APK PA, APT PA a nd APY PA. T he we ig hts fo r the m a re de c re a sing pro g re ssive ly. 2017/ 6/ 26

  14. ALG O RIT HM 14  Improving the Pre c ision  Ra nking -b a se d Me rg ing We do a sc a n fro m the to p ra nke d I D pa ir to the lo we r ra nke d o ne s to he lp infe r the a utho r e ntity. And we will skip the c o nflic t I Ds, find o ne tha t ha s hig h simila rity b ut a lso pa sse s the na me ma tc hing c o mpa riso n, we b e lie ve the se two I Ds ha ving hig h pro b a b ility to b e the re a l duplic a te . Afte r tha t, if A is the duplic a te o f B a nd B is the duplic a te o f C, we will c o nside r tha t a is the duplic a te o f C. Ano the r impo rta nt stra te g y is to e xpa nd the a utho r na me s c o rre spo nding to the I Ds o nc e we a re c o nfide nt a b o ut two I Ds to b e the duplic a te . T his ide a is use ful b e c a use it c a n he lp a vo id the mista ke nly de te c te d c o nflic ts. 2017/ 6/ 26

  15. ALG O RIT HM 15  Post- proc e ssing Unc o nfide nt duplic a te a utho r I Ds sho uld b e re mo ve d e ve n tho ug h the ir na me s a re c o mpa tib le a nd the ir me ta -pa th- b a se d simila rity sc o re s a re a c c e pta b le . T his ste p is c ruc ia l in tha t the la te r ite ra tive fra me wo rk re q uire s hig hly c o nfide nt o utput to g ra dua lly re fine the re sults. 2017/ 6/ 26

  16. ALG O RIT HM 16  Ite ra tive F ra me work  An ite ra tive fra me wo rk whic h ta ke s the de te c te d duplic a te s o f the la st ite ra tio n a s pa rt o f the input: 1. we a re a b le to g e ne ra te muc h b e tte r me ta -pa th-b a se d simila rity sc o re s 2. re c a ll the na me e xpa nsio n mo dule intro duc e d a t the e nd o f the p- ste p 2017/ 6/ 26

  17. O UT LINE 17 • I ntro duc tio n • Pre vio us wo rk • Syste m o ve rvie w • Alg o rithm • Co nc lusio n 2017/ 6/ 26

  18. C O NC LUSIO N 18 We ha ve trie d to disa mb ig ua tio n the a utho r na me , a nd we ha ve fo und a b e tte r a lg o rithm whic h is undo ub te dly pra c tic a l in K DD Cup Da ta Mining Co nte st 2013. But the re is still lo ts o f wo rk ne e d to b e do ne . I n the future , we ne e d to a djust the c o de to o ur da ta b a se , a nd we ne e d to c ha ng e so me o f the pa ra me te rs to o b ta in the b e st re sult. I a m lo o king fo rwa rd to the da y we c o mple te the wo rk, a nd I a m firmly b e lie ve d tha t o ur wo rk will turn o ut to b e a ve ry impo rta nt impro ve me nt o f the Ac e ma p. 2017/ 6/ 26

  19. 19 Q&A 2017/ 6/ 26

  20. 20 Thank You! 2017/ 6/ 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend