alyona medelyan zelandiya anna divoli annadivoli problem
play

Alyona Medelyan @zelandiya Anna Divoli @annadivoli Problem 1 - PowerPoint PPT Presentation

Mining Unstructured Data: Practical Applications Alyona Medelyan @zelandiya Anna Divoli @annadivoli Problem 1 London New York How do lawyers scan, file, store & share clients case documents efficiently? Images: Ambro /


  1. Mining Unstructured Data: Practical Applications Alyona Medelyan @zelandiya Anna Divoli @annadivoli

  2. Problem 1 London New York How do lawyers scan, file, store & share client’s case documents efficiently? Images: Ambro / FreeDigitalPhotos.net

  3. ! !"#$ slambo_42@flickr !%#$ &"#$ Anoto AB@flickr How do doctors, patients & researchers distribute & share medical records efficiently?

  4. The FATCA Legislation Problem 3 Takes effect 1 January 2013 )((0)+!$%6#$/ !!!789!1&/2#+:&('!/);! "#$%&'(!"&()(*&)+! ,(-./0.#(! 1)&<%$! 1&/2!,34!)'$%%5%(/! =>4>!)**#0(/!2#+:%$-! =>4>!#1(%$-2&6!%(..%-! 1&/2! 1&/2#0/! ?0-/#:&)(!@)(A! 1)&<%$! 1)&<%$! 1&/2#0/!,34!)'$%%5%(/! 789!1&/2#+:&('!/);! How can a financial institution find U.S. citizens in masses of paperwork efficiently?

  5. How much time do we actually spend on … LM! 4%)$*2&('B!')/2%$&('!&(C#! LN! D$&.('!%5)&+-! L7! ?$%).('!:#*-! E()+FG&('!&(C#! L8! O! 3%<&%1&('!:#*-! M! H$')(&G&('!:#*-! ?$%).('!6$%-%(/).#(-! M! P! I:&.('!&5)'%-! !"#$%&#'(%)'*)#$$+#&),*%'%-) P! I(/%$&('!:)/)! 4%)$*2Q!LM2!R!1%%A!S!T7MB888!R!F%)$ ! N! E66$#<&('!:#*-! N! J0@+&-2&('!:#*-! IDC: Hidden cost of information L average hours / week K$)(-+).('!:#*-!

  6. introduction unstructured data conclusions real life problems unstructured data compliance & text analytics in finance metadata healthcare in legal domain records issues

  7. 5*6,$ 9+:(./$ !-.(/,$ %*)(.$ 7-.8*,$ 43)(+$ ;.1.<.,*,$ '()*+,$ 0(1*2.132*$ =/+8,$

  8. unstructured data Search Linguistics Statistics Data Extraction Text Processing Document Organization Machine Learning Business Intelligence Opinion Mining Natural Language Processing Text Mining

  9. What can one mine from unstructured data? keywords text text text text text text tags text text text sentiment text text text text text text text text text genre categories taxonomy terms entities biochemical names patterns … text text text entities text text text ! text text text ! text text text ! text text text ! text text text !

  10. 5*6,$ 9+:(./$ !-.(/,$ %*)(.$ 7-.8*,$ 43)(+$ ;.1.<.,*,$ '()*+,$ 0(1*2.132*$ =/+8,$

  11. text text text text text text text text text text text text text text text text text text People U.S. politicians News about U.S. politicians News Structured & unstructured data interplay =(&U0%!&:%(.V%$-! 4/$0*/0$%:!! @&#+#'&*)+! W&/%$)/0$%!$%C%$%(*%-! :)/)! I;6%$/-X! )((#/).#(! YC$%%!/%;/Z!

  12. introduction unstructured data conclusions real life problems compliance unstructured data in finance & text analytics metadata healthcare in legal domain records issues

  13. Legal document processing pipeline -*)(! -)<%! #*$! 5%/):)/)! New York London :5-! Images: Ambro / FreeDigitalPhotos.net

  14. jacockshaw@flickr Assigning metadata (approximation) 15 docs per day 3 min per doc 0.75 h per day 240 working days per year $200 hourly charge $36,000 per year per lawyer Keyword extraction 0.0027 min per doc 10 min for yearly worth of docs

  15. ,(/%'$).('! !! 5%/):)/)!! %;/$)*.#(! ! 1&/2!! -*)((&('! 2[6QRR111>F#0/0@%>*#5R1)/*2\<SA+0]6^_06)'!

  16. Efficient (legal) document processing pipeline keywords tags 5%/):)/)! :5-!

  17. introduction unstructured data conclusions real life problems compliance unstructured data in finance & text analytics metadata healthcare in legal domain records issues

  18. !%#$ &"#$ !"#$ ! slambo_42@flickr Anoto AB@flickr

  19. `).#()+!E++&)(*%!C#$!a%)+/2!,(C#$5).#(!K%*2(#+#'F! !%#$ Y`Ea,KZ! :%V(&.#(-!! ! !"#$ ! \! &"#$ ! b&-*#(.(0%:c! ! ! ! L> L> `)5%B!@&$/2!:)/%B!@+##:!/F6%! `)5%B!@&$/2!:)/%B!@+##:!/F6%! ! ^> ^> I5%$'%(*F!*#(/)*/Y-Z! I5%$'%(*F!*#(/)*/Y-Z! 7> 7> J$&5)$F!*)$%'&<%$R62#(%!(05@%$! J$&5)$F!*)$%'&<%$R62#(%!(05@%$! ! N> N> d%:&*&(%-B!:#-)'%-B!)(:!2#1!+#('! d%:&*&(%-B!:#-)'%-B!)(:!2#1!+#('! $ /)A%(! /)A%(! _> _> E++%$'&%-R)++%$'&*!$%)*.#(-! E++%$'&%-R)++%$'&*!$%)*.#(-! P> P> b)/%!#C!+)-/!62F-&*)+! b)/%!#C!+)-/!62F-&*)+! M> M> b)/%-R$%-0+/-!#C!/%-/-!)(:! b)/%-R$%-0+/-!#C!/%-/-!)(:! -*$%%(&('-! -*$%%(&('-! e> e> d)f#$!&++(%--%-R-0$'%$&%-!)(:!/2%&$! d)f#$!&++(%--%-R-0$'%$&%-!)(:!/2%&$! :)/%-! :)/%-! &"7$ O> O> ?2$#(&*!:&-%)-%-! ?2$#(&*!:&-%)-%-! L8> ")5&+F!&++(%--!2&-/#$F! L8> ")5&+F!&++(%--!2&-/#$F! LL> g! LL> g! >?@ABB666CD/-CD(>C8+EB-*)/(D*@/3,B-.8.F(D*B$ >?@ABB666CD/-CD(>C8+EB-*)/(D*@/3,B-.8.F(D*B$ )*G()*DHI:.H+D$@2+:*,,$

  20. d%:&*)+!$%-%)$*2%$-! g!$%*#$:-!1&/2!$%5#<%:!Ja,Q! 0-%!6).%(/!$%*#$:-! &(C#$5).#(!C$#5!-/$0*/0$%:!V%+:-! C#$!!:&-*#<%$&%-g! @0/!5#-/+F!C$#5!C$%%!/%;/c! Ed,E!^8L^!

  21. $ $ ,(/(:+D.D8/*C:+-B</+8B$ $ 666C>:@2+C:+-$ 666C(DJ+2-.H+DG.8*C:+-$ hK2%!a%)+/2!,(-0$)(*%!J#$/)@&+&/F!)(:!E**#0(/)@&+&/F!E*/!#C! LOOP!Ya,JEEZ!J$&<)*F!)(:!4%*0$&/F!30+%-i! ! hK2%!J).%(/!4)C%/F!)(:!j0)+&/F!,56$#<%5%(/!E*/!#C!^88_! YJ4j,EZ!J).%(/!4)C%/F!30+%i! !

  22. 18 identifiers! &"7$ `)5%- ! ]%2&*+%!&:%(.V%$-! m! -%$&)+!(05@%$- B!&(*+>!+&*%(-%! ! k%#'$)62&*!-0@:&<&-&#(-! 6+)/%!(05@%$-! -5)++%$!/2)(!)!4/)/%Q!-/$%%/!)::$%--B! ! ! *&/FB!*#0(/FB!6$%*&(*/B!G&6!*#:%g! b%<&*%!&:%(.V%$-! m! ! ! -%$&)+!(05@%$-! b)/%-! Y%;*%6/!F%)$ZQ!@&$/2B! ! ):5&--&#(B!:&-*2)$'%g! ! =3W-!! !!R!!!!!!! ,J!)::$%--%- ! ! ! J2#(%!R!");! (05@%$- ! ! ! n&#5%/$&*!&:%(.V%$- B! I5)&+ !)::$%--%-! &(*+0:&('!V('%$!)(:!<#&*%!6$&(/-! ! ! ! 4#*&)+!-%*0$&/F!l ! ! ")*%!62#/#!&5)'%-!! d%:&*)+!$%*#$:-!!l ! m!)(F!*#56)$)@+%!&5)'%-! a%)+/2!6+)(!@%(%V*&)$Fl ! ! ! E(F!#/2%$!0(&U0%!,b-!%/*>! E**#0(/-!!l !

  23. K2)(A-!C#$!:&-*0--&#(-Q! slambo_42@flickr !!!`&')5!42)2B!4/)(C#$:! !!!I(%&:)!d%(:#(*)B!=D&(-*#-&(B!d):&-#(! !!!,$%()!46)-&*B!?)$:&o!=(&<%$-&/F! text text text text text text ! text text text ! text text text ! text text text ! text text text ! keywords tags Anoto AB@flickr

  24. introduction unstructured data conclusions real life problems compliance unstructured data in finance & text analytics metadata healthcare in legal domain records issues

  25. The FATCA Legislation Takes effect 1 January 2013 )((0)+!$%6#$/ !!!789!1&/2#+:&('!/);! "#$%&'(!"&()(*&)+! 1)&<%$! ,(-./0.#(! 1&/2!,34!)'$%%5%(/! =>4>!)**#0(/!2#+:%$-! =>4>!#1(%$-2&6!%(..%-! 1&/2! 1&/2#0/! ?0-/#:&)(!@)(A! 1)&<%$! 1)&<%$! 1&/2#0/!,34!)'$%%5%(/! 789!1&/2#+:&('!/);!

  26. FATCA COMPLIANCE – STEP 1 Detect U.S. citizenship indicators

  27. Recommended Solution from FATCA Legislation: • “Query an electronic database using standard queries in programming languages” • “Adopt similar approaches as used for the Anti-money-laundering and Know-your-customer requirements” • “Note that information, data, or files are not electronically searchable if they are stored as images”

  28. 1)+5&(AB!/2#51)/-#(pq&A$! FATCA COMPLIANCE – STEP 2 Contact client for additional info or a waver

  29. Actual Solution for the FATCA Legislation: ')/2%$!/2%!/$)&+!*+&%(/X-!:)/)! +&(A!)()+F-&-! #*$! *#(<%$/!)++!&5)'%-!/#!/%;/! %(./F!%;/$)*.#(! :%/%*/!+#*).#(-B!@)(A!(05@%$-! )0/#r*)/%'#$&G%! )()+F-&-! *2%*A! $%-#+<%!&(*#(-&-/%(*&%-!

  30. Efficient FATCA Compliance

  31. introduction unstructured data conclusions real life problems compliance unstructured data in finance & text analytics metadata healthcare healthcare in legal domain records issues records issues

  32. Alyona Medelyan, PhD Anna Divoli, PhD @zelandiya @annadivoli Natural Language Processing Biomedical Text Mining Text Mining Search User Interfaces Wikipedia Mining Human Factors Machine Learning Knowledge Discovery Try out text analytics provided by the Pingar API! Online demo: apidemo.pingar.com Free Sandbox account: pingar.com/get-the-api

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend