Coding the Twitter Sphere: Humans and Machines Learning Together - - PowerPoint PPT Presentation

coding the twitter sphere
SMART_READER_LITE
LIVE PREVIEW

Coding the Twitter Sphere: Humans and Machines Learning Together - - PowerPoint PPT Presentation

Coding the Twitter Sphere: Humans and Machines Learning Together Dr. Stuart Shulman @stuartwshulman stu@texifter.com 1 Acknowledgements The National Science Foundation Mark J. Hoy 2 Conflict of Interest Disclosure I am the


slide-1
SLIDE 1

Coding the Twitter Sphere: 


Humans and Machines Learning Together

  • Dr. Stuart Shulman 


@stuartwshulman stu@texifter.com 


1

slide-2
SLIDE 2

Acknowledgements

The National Science Foundation
 Mark J. Hoy

2

slide-3
SLIDE 3

Conflict of Interest Disclosure

I am the sole manager of Texifter We sell DiscoverText licenses We sell Gnip data licenses

3

slide-4
SLIDE 4

A Master Metaphor: Sifter

4

slide-5
SLIDE 5

An Open Source Kernel

5

slide-6
SLIDE 6

Three Primary Tasks in CA T

6

slide-7
SLIDE 7

Classification of Text

A 2500 year-old problem Plato argued it would be frustrating It still is…

7

slide-8
SLIDE 8

Grimmer & Stewart “Text as Data”


Political Analysis (2013)

Volume is a problem for scholars Coders are expensive Groups struggle to accurately label text at scale Validation of both humans and machines is “essential” Some models are easier to validate than others All models are wrong Automated models enhance/amplify, but don’t replace humans There is no one right way to do this “Validate, validate, validate”

“What should be avoided then, is the blind use 


  • f any method without a validation step.”

8

slide-9
SLIDE 9

9

(Patent Pending)

slide-10
SLIDE 10

Three Important Books

10

slide-11
SLIDE 11

One Particularly Important Idea

11

slide-12
SLIDE 12

Five Pillars of Text Analytics

Search
 Filter
 Code
 Cluster
 Classify You can execute all five using DT

12

slide-13
SLIDE 13

Pillar #1: Search

13

slide-14
SLIDE 14

Search for Negative Cases

14

slide-15
SLIDE 15

Defined Search (Multi-term)

15

slide-16
SLIDE 16

Pillar #2: Filters

16

slide-17
SLIDE 17

Another Common Filter

17

slide-18
SLIDE 18

18

slide-19
SLIDE 19

Pillar#3: Human Coding

19

slide-20
SLIDE 20

Keystroke Coding is Fast

20

slide-21
SLIDE 21

Coding Off a List is Faster

21

slide-22
SLIDE 22

Data Cleaning is Fundamental

22

slide-23
SLIDE 23

Pillar #4: Clustering

23

slide-24
SLIDE 24

24

slide-25
SLIDE 25

Latent Dirichlet Allocation 
 (LDA) Topic Models

25

slide-26
SLIDE 26

LDA on the Christie Data

26

Topic 1 : christie, sandy, christies, funds, relief, feds, investigating, daily, gov, feminized Topic 2 : with, daniel, didnt, after, murder, time, agatha, death, former, mayor Topic 3 : bridge, about, traffic, more, scandal, chris, nj, some, just, says Topic 4 : like, gop, bridgegate, what, 2016, know, now, will, bully, dont Topic 5 : obama, benghazi, impeachment, dem, have, probe, lawmaker, floats, possibility, gwb Topic 6 : jersey, over, stages, still, aides, grief, bogus, hes, news, subpoenas Topic 7 : rove, closures, karl, york, while, federal, party, tea, governor, president Topic 8 : irs, political, been, show, republicans, media, get, laws, word, scandals

slide-27
SLIDE 27

Pillar#5: Machine-Learning

27

slide-28
SLIDE 28

Create a Dataset to Code

Any archive or bucket Use the random sampling tool Standard: All coders get all items Triage: Coders get next uncoded item

28

slide-29
SLIDE 29

Select from Three Coding Styles

Default: Mutually Exclusive Codes Option 1: Non-Mutually Exclusive Codes Option 2: User-Defined Codes (Grounded Theory)

29

slide-30
SLIDE 30

Assign Peers to Code a Dataset

How many coders? How many items need to be coded? How many test or training sets? There are no cookbook answers

30

slide-31
SLIDE 31

Look at Inter-Rater Reliability

Highly reliable coding (easy tasks) Unreliable coding (interesting tasks) If humans can’t, neither can machines Some tasks better suited for machines

31

slide-32
SLIDE 32

Adjudication: The Secret Sauce

Expert review or consensus process Invalidate false positives Identify strong and weak coders Exclude false positives from training sets

32

slide-33
SLIDE 33

33

slide-34
SLIDE 34

34

slide-35
SLIDE 35

Use Classification Scores as Filters

Iteration plays a critical role Train, classify, filter Repeat until the model is trusted Each round weeds out false positives

35

slide-36
SLIDE 36

Classifier Histograms: More Filtering

36

slide-37
SLIDE 37

http://sifter.texifter.com

slide-38
SLIDE 38
slide-39
SLIDE 39

Thanks for Listening

  • Dr. Stuart Shulman


@stuartwshulman
 stu@texifter.com
 discovertext.com
 sifter.texifter.com

39