Classifying the Terms of Service Capstone Presentation | Sam - - PowerPoint PPT Presentation

classifying the terms of service
SMART_READER_LITE
LIVE PREVIEW

Classifying the Terms of Service Capstone Presentation | Sam - - PowerPoint PPT Presentation

Classifying the Terms of Service Capstone Presentation | Sam Beardsworth Goal Build a model to make Terms of Service easier to read How? Identify the content Extract the meaning Highlight important terms Approach No shortage of


slide-1
SLIDE 1

Classifying the Terms of Service

Capstone Presentation | Sam Beardsworth

slide-2
SLIDE 2
slide-3
SLIDE 3

Goal

Build a model to make Terms of Service easier to read

How?

  • Identify the content
  • Extract the meaning
  • Highlight important terms
slide-4
SLIDE 4

Approach

No shortage of data: it's literally on every website But how to make sense of it?

Answer:

Use a pre-classified dataset (courtesy of ToS;DR)

slide-5
SLIDE 5

ToS;DR

  • started in June 2012
  • aims to review and score Terms of Service policies of major

web services

  • users can look up terms through website / browser

extension

  • public, transparent, community-driven
  • volunteer project
slide-6
SLIDE 6
slide-7
SLIDE 7

Data Gathering

API: broken but was able to obtain the same info via public repos Additional challenges

  • ToS;DR has had 3 incarnations
  • API only has good data for incarnation #2
  • Scrape all 3 and merge by ID

Some manual cleaning needed

slide-8
SLIDE 8

Dataset

1688 observations (extracts) mean length: 65 words max length: 1410 words! 107340 words total / 6469 unique 17 columns - discarded 9 as purely administrative

slide-9
SLIDE 9

Dataset

ID Status Service Source quote Topic Case Point 1720 pending facebook Cookie Policy 'We use cookies to help us show ads...' Tracking Personal data used for advertising bad 1311 approved nokia T&C 'Except as set forth in the Privacy Policy...' Content Service retains deleted content bad 2261 approved whatsapp NA 'When you delete your WhatsApp account...' Right to leave Data deleted after account closure good unique: 179 22 143 4

slide-10
SLIDE 10

Dataset: Filling the gaps

slide-11
SLIDE 11

EDA

slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14

Lemmatization

slide-15
SLIDE 15

Topic Exploration

  • 22 topics, inbalanced
  • dropped <25 observations
  • remember to balance during

classification

slide-16
SLIDE 16
slide-17
SLIDE 17

Modelling

19 topics Baseline accuracy: 0.117 70-30 train-test split, stratified by topic Basic, untuned logistic regression Test accuracy: 0.615

slide-18
SLIDE 18

Improving the score

  • TF-IDF to reduce feature importance
  • f common words
  • imblearn's RandomOverSampler to

reduce class imbalance in the training set

  • GridSearchCV for optimal Logistic

Regression hyperparameters Improved test accuracy: 0.641

slide-19
SLIDE 19

Beyond Logistic Regression

the sklearn 'try everything' approach... ...optimised with GridSearch

slide-20
SLIDE 20

Model Comparison

slide-21
SLIDE 21

Alternative Models

word2vec

  • 3.5 GB dictionary pre-trained on news articles
  • applied to pre-lemmatized tokens (corpus)
  • performed differently but more poorly

Accuracy score: 0.613 Principle Component Analysis / SVD

  • explanatory value relatively low
  • 19% across PC1-2, 37% across PC1-10
slide-22
SLIDE 22

Alternative Models

Latent Dirichlet Allocation (LDA) "a technique to extract the hidden topics from large volumes of text... The challenge is how to extract good quality of topics that are clear, segregated and meaningful" Some themes:

  • Consistently identified 'virtual currency' as a topic
  • Change and modification
  • Damage and waiver
slide-23
SLIDE 23

LDA

Heatmap comparing unsupervised sorting into 19 topics, versus human- classified topics

slide-24
SLIDE 24

Where from here?

slide-25
SLIDE 25

Quiz

You agree to provide Grammarly with accurate and complete registration information and to promptly notify Grammarly in the event of any changes to any such information. Anonymity & Tracking Personal Data ??? ???

slide-26
SLIDE 26

Quiz

You agree to provide Grammarly with accurate and complete registration information and to promptly notify Grammarly in the event of any changes to any such information. Anonymity & Tracking Personal Data Human Model

slide-27
SLIDE 27

Quiz

Nothing here should be considered legal advice. We express our

  • pinion with no guarantee and we do not endorse any service in

any way. Please refer to a qualified attorney for legal advice. Governance Guarantee ??? ???

slide-28
SLIDE 28

Quiz

Nothing here should be considered legal advice. We express our

  • pinion with no guarantee and we do not endorse any service in

any way. Please refer to a qualified attorney for legal advice. Governance Guarantee Model Human

slide-29
SLIDE 29

Quiz

For revisions to this Privacy Policy that may be materially less restrictive on our use or disclosure of personal information you have provided to us, we will make reasonable efforts to notify you and obtain your consent before implementing revisions with respect to such information.

Personal Data Changes to Terms ??? ???

slide-30
SLIDE 30

Quiz

For revisions to this Privacy Policy that may be materially less restrictive on our use or disclosure of personal information you have provided to us, we will make reasonable efforts to notify you and obtain your consent before implementing revisions with respect to such information.

Personal Data Changes to Terms Model Human

slide-31
SLIDE 31

Practical Application

Unfavourable Terms

  • r: classifying into good and bad
slide-32
SLIDE 32

Extract Review

slide-33
SLIDE 33

Model Performance

Same approach as before Best performer:

  • K-Nearest Neighbours: 0.71

What if we focus solely on unfavourable terms?

slide-34
SLIDE 34

Predicting Unfavourable Terms

  • Do people really care about good or neutral statements?
  • Real value is in being able to highlight potential

unfavourable terms Reclassify:

  • Good + Neutral = Neutral
  • Bad = Warning
slide-35
SLIDE 35
slide-36
SLIDE 36

Binary Classification

Improved performance Best performers:

  • K-Nearest Neighbours: 0.75
  • LinearSVC: 0.76

Additional benefit: ability to tune the model to correctly predict more warning statements at expense of more 'false' warnings.

slide-37
SLIDE 37
slide-38
SLIDE 38
slide-39
SLIDE 39

Evaluation / Next Steps

There are three areas for next steps:

  • 1. Building a proof of concept for an end-user classification

tool

  • 2. Improve the model
  • 3. Subject matter expertise
slide-40
SLIDE 40

Questions?