Alyona Medelyan @zelandiya Anna Divoli @annadivoli Problem 1 - - PowerPoint PPT Presentation

alyona medelyan zelandiya anna divoli annadivoli problem
SMART_READER_LITE
LIVE PREVIEW

Alyona Medelyan @zelandiya Anna Divoli @annadivoli Problem 1 - - PowerPoint PPT Presentation

Mining Unstructured Data: Practical Applications Alyona Medelyan @zelandiya Anna Divoli @annadivoli Problem 1 London New York How do lawyers scan, file, store & share clients case documents efficiently? Images: Ambro /


slide-1
SLIDE 1

Mining Unstructured Data: Practical Applications

Alyona Medelyan @zelandiya Anna Divoli @annadivoli

slide-2
SLIDE 2

New York London

Problem 1

Images: Ambro / FreeDigitalPhotos.net

How do lawyers scan, file, store & share client’s case documents efficiently?

slide-3
SLIDE 3

slambo_42@flickr Anoto AB@flickr

!

!"#$ !%#$ &"#$

How do doctors, patients & researchers distribute & share medical records efficiently?

slide-4
SLIDE 4

"#$%&'(!"&()(*&)+! ,(-./0.#(!

1&/2!,34!)'$%%5%(/! )((0)+!$%6#$/ !!!789!1&/2#+:&('!/);! 1)&<%$!

1&/2! 1)&<%$! 1&/2#0/! 1)&<%$!

=>4>!)**#0(/!2#+:%$-! =>4>!#1(%$-2&6!%(..%-!

789!1&/2#+:&('!/);!

?0-/#:&)(!@)(A!

1&/2#0/!,34!)'$%%5%(/!

The FATCA Legislation

Takes effect 1 January 2013

Problem 3 How can a financial institution find U.S. citizens in masses of paperwork efficiently?

slide-5
SLIDE 5

How much time do we actually spend on …

4%)$*2&('B!')/2%$&('!&(C#! D$&.('!%5)&+-! ?$%).('!:#*-! E()+FG&('!&(C#! 3%<&%1&('!:#*-! H$')(&G&('!:#*-! ?$%).('!6$%-%(/).#(-! I:&.('!&5)'%-! I(/%$&('!:)/)! E66$#<&('!:#*-! J0@+&-2&('!:#*-! K$)(-+).('!:#*-!

LM! LN! L7! L8! O! M! M! P! P! N! N! L

!"#$%&#'(%)'*)#$$+#&),*%'%-)

4%)$*2Q!LM2!R!1%%A!S!T7MB888!R!F%)$!

IDC: Hidden cost of information

average hours / week

slide-6
SLIDE 6

introduction unstructured data real life problems unstructured data & text analytics metadata in legal domain healthcare records issues conclusions compliance in finance

slide-7
SLIDE 7

'()*+,$ !-.(/,$ 0(1*2.132*$ 43)(+$ 5*6,$ 7-.8*,$ 9+:(./$ %*)(.$ ;.1.<.,*,$ =/+8,$

slide-8
SLIDE 8

Text Mining Natural Language Processing

unstructured data

Opinion Mining Business Intelligence Document Organization Data Extraction Search Machine Learning Text Processing Statistics Linguistics

slide-9
SLIDE 9

What can one mine from unstructured data?

text text text text text text text text text text text text text text text text text text

sentiment keywords tags genre categories taxonomy terms entities names patterns biochemical entities …

text text text text text text ! text text text ! text text text ! text text text! text text text!

slide-10
SLIDE 10

'()*+,$ !-.(/,$ 0(1*2.132*$ 43)(+$ 5*6,$ 7-.8*,$ 9+:(./$ %*)(.$ ;.1.<.,*,$ =/+8,$

slide-11
SLIDE 11

text text text text text text text text text text text text text text text text text text

People U.S. politicians News about U.S. politicians News

4/$0*/0$%:!! @&#+#'&*)+! :)/)! =(&U0%!&:%(.V%$-! W&/%$)/0$%!$%C%$%(*%-! I;6%$/-X! )((#/).#(! YC$%%!/%;/Z!

Structured & unstructured data interplay

slide-12
SLIDE 12

introduction unstructured data real life problems unstructured data & text analytics metadata in legal domain healthcare records issues conclusions compliance in finance

slide-13
SLIDE 13
  • *)(!

#*$! 5%/):)/)! :5-!

  • )<%!

Legal document processing pipeline

Images: Ambro / FreeDigitalPhotos.net

New York London

slide-14
SLIDE 14

Assigning metadata

(approximation)

15 docs per day 3 min per doc 0.75 h per day 240 working days per year $200 hourly charge $36,000 per year per lawyer

Keyword extraction

0.0027 min per doc 10 min for yearly worth of docs

jacockshaw@flickr

slide-15
SLIDE 15

,(/%'$).('! !! 5%/):)/)!! %;/$)*.#(! ! 1&/2!!

  • *)((&('!

2[6QRR111>F#0/0@%>*#5R1)/*2\<SA+0]6^_06)'!

slide-16
SLIDE 16

5%/):)/)! :5-! Efficient (legal) document processing pipeline

keywords tags

slide-17
SLIDE 17

introduction unstructured data real life problems unstructured data & text analytics metadata in legal domain healthcare records issues conclusions compliance in finance

slide-18
SLIDE 18

!

!"#$ !%#$ &"#$

slambo_42@flickr Anoto AB@flickr

slide-19
SLIDE 19

!%#$

!

!"#$

!

&"#$

!

! ! !

$

`).#()+!E++&)(*%!C#$!a%)+/2!,(C#$5).#(!K%*2(#+#'F! Y`Ea,KZ! :%V(&.#(-!!

b&-*#(.(0%:c!

\!

!

L> `)5%B!@&$/2!:)/%B!@+##:!/F6%! ^> I5%$'%(*F!*#(/)*/Y-Z! 7> J$&5)$F!*)$%'&<%$R62#(%!(05@%$! N> d%:&*&(%-B!:#-)'%-B!)(:!2#1!+#('! /)A%(! _> E++%$'&%-R)++%$'&*!$%)*.#(-! P> b)/%!#C!+)-/!62F-&*)+! M> b)/%-R$%-0+/-!#C!/%-/-!)(:!

  • *$%%(&('-!

e> d)f#$!&++(%--%-R-0$'%$&%-!)(:!/2%&$! :)/%-! O> ?2$#(&*!:&-%)-%-! L8> ")5&+F!&++(%--!2&-/#$F! LL> g!

>?@ABB666CD/-CD(>C8+EB-*)/(D*@/3,B-.8.F(D*B$

&"7$

)*G()*DHI:.H+D$@2+:*,,$

!

L> `)5%B!@&$/2!:)/%B!@+##:!/F6%! ^> I5%$'%(*F!*#(/)*/Y-Z! 7> J$&5)$F!*)$%'&<%$R62#(%!(05@%$! N> d%:&*&(%-B!:#-)'%-B!)(:!2#1!+#('! /)A%(! _> E++%$'&%-R)++%$'&*!$%)*.#(-! P> b)/%!#C!+)-/!62F-&*)+! M> b)/%-R$%-0+/-!#C!/%-/-!)(:!

  • *$%%(&('-!

e> d)f#$!&++(%--%-R-0$'%$&%-!)(:!/2%&$! :)/%-! O> ?2$#(&*!:&-%)-%-! L8> ")5&+F!&++(%--!2&-/#$F! LL> g!

>?@ABB666CD/-CD(>C8+EB-*)/(D*@/3,B-.8.F(D*B$

slide-20
SLIDE 20

d%:&*)+!$%-%)$*2%$-! 0-%!6).%(/!$%*#$:-! C#$!!:&-*#<%$&%-g! g!$%*#$:-!1&/2!$%5#<%:!Ja,Q! &(C#$5).#(!C$#5!-/$0*/0$%:!V%+:-! @0/!5#-/+F!C$#5!C$%%!/%;/c!

Ed,E!^8L^!

slide-21
SLIDE 21

$ 666C>:@2+C:+-$ $ ,(/(:+D.D8/*C:+-B</+8B$ $ 666C(DJ+2-.H+DG.8*C:+-$

hK2%!a%)+/2!,(-0$)(*%!J#$/)@&+&/F!)(:!E**#0(/)@&+&/F!E*/!#C! LOOP!Ya,JEEZ!J$&<)*F!)(:!4%*0$&/F!30+%-i! ! hK2%!J).%(/!4)C%/F!)(:!j0)+&/F!,56$#<%5%(/!E*/!#C!^88_! YJ4j,EZ!J).%(/!4)C%/F!30+%i!

!

slide-22
SLIDE 22

`)5%-!

!

k%#'$)62&*!-0@:&<&-&#(-!

  • 5)++%$!/2)(!)!4/)/%Q!-/$%%/!)::$%--B!

*&/FB!*#0(/FB!6$%*&(*/B!G&6!*#:%g!

! !

b)/%-!Y%;*%6/!F%)$ZQ!@&$/2B!

):5&--&#(B!:&-*2)$'%g!

! !

J2#(%!R!");!(05@%$-! I5)&+!)::$%--%-!

! !

4#*&)+!-%*0$&/F!l! d%:&*)+!$%*#$:-!!l! a%)+/2!6+)(!@%(%V*&)$Fl! E**#0(/-!!l!

&"7$

18 identifiers!

]%2&*+%!&:%(.V%$-!m!

  • %$&)+!(05@%$-B!&(*+>!+&*%(-%!

6+)/%!(05@%$-!

! !

b%<&*%!&:%(.V%$-!m!

  • %$&)+!(05@%$-!

! !

=3W-!!!!R!!!!!!!,J!)::$%--%-!

! !

n&#5%/$&*!&:%(.V%$-B!

&(*+0:&('!V('%$!)(:!<#&*%!6$&(/-!

! !

")*%!62#/#!&5)'%-!!

m!)(F!*#56)$)@+%!&5)'%-!

! !

E(F!#/2%$!0(&U0%!,b-!%/*>!

slide-23
SLIDE 23

K2)(A-!C#$!:&-*0--&#(-Q! !!!`&')5!42)2B!4/)(C#$:! !!!I(%&:)!d%(:#(*)B!=D&(-*#-&(B!d):&-#(! !!!,$%()!46)-&*B!?)$:&o!=(&<%$-&/F!

keywords tags

slambo_42@flickr Anoto AB@flickr

text text text text text text ! text text text ! text text text ! text text text! text text text!

slide-24
SLIDE 24

introduction unstructured data real life problems metadata in legal domain conclusions compliance in finance unstructured data & text analytics healthcare records issues

slide-25
SLIDE 25

"#$%&'(!"&()(*&)+! ,(-./0.#(!

1&/2!,34!)'$%%5%(/! )((0)+!$%6#$/ !!!789!1&/2#+:&('!/);! 1)&<%$!

1&/2! 1)&<%$! 1&/2#0/! 1)&<%$!

=>4>!)**#0(/!2#+:%$-! =>4>!#1(%$-2&6!%(..%-!

789!1&/2#+:&('!/);!

?0-/#:&)(!@)(A!

1&/2#0/!,34!)'$%%5%(/!

The FATCA Legislation

Takes effect 1 January 2013

slide-26
SLIDE 26

FATCA COMPLIANCE – STEP 1

Detect U.S. citizenship indicators

slide-27
SLIDE 27

Recommended Solution from FATCA Legislation:

  • “Query an electronic database using

standard queries in programming languages”

  • “Adopt similar approaches as used for the

Anti-money-laundering and Know-your-customer requirements”

  • “Note that information, data, or files are not

electronically searchable if they are stored as images”

slide-28
SLIDE 28

1)+5&(AB!/2#51)/-#(pq&A$!

FATCA COMPLIANCE – STEP 2

Contact client for additional info or a waver

slide-29
SLIDE 29

Actual Solution for the FATCA Legislation:

#*$! +&(A!)()+F-&-! %(./F!%;/$)*.#(! )()+F-&-! ')/2%$!/2%!/$)&+!*+&%(/X-!:)/)! *#(<%$/!)++!&5)'%-!/#!/%;/! :%/%*/!+#*).#(-B!@)(A!(05@%$-! )0/#r*)/%'#$&G%! *2%*A! $%-#+<%!&(*#(-&-/%(*&%-!

slide-30
SLIDE 30

Efficient FATCA Compliance

slide-31
SLIDE 31

introduction unstructured data real life problems metadata in legal domain healthcare records issues conclusions compliance in finance unstructured data & text analytics healthcare records issues

slide-32
SLIDE 32

Alyona Medelyan, PhD @zelandiya Anna Divoli, PhD @annadivoli

Natural Language Processing Text Mining Wikipedia Mining Machine Learning

Try out text analytics provided by the Pingar API!

Online demo: apidemo.pingar.com Free Sandbox account: pingar.com/get-the-api

Biomedical Text Mining Search User Interfaces Human Factors Knowledge Discovery