Classifying the Terms of Service
Capstone Presentation | Sam Beardsworth
Classifying the Terms of Service Capstone Presentation | Sam - - PowerPoint PPT Presentation
Classifying the Terms of Service Capstone Presentation | Sam Beardsworth Goal Build a model to make Terms of Service easier to read How? Identify the content Extract the meaning Highlight important terms Approach No shortage of
Capstone Presentation | Sam Beardsworth
Build a model to make Terms of Service easier to read
No shortage of data: it's literally on every website But how to make sense of it?
Use a pre-classified dataset (courtesy of ToS;DR)
web services
extension
API: broken but was able to obtain the same info via public repos Additional challenges
Some manual cleaning needed
1688 observations (extracts) mean length: 65 words max length: 1410 words! 107340 words total / 6469 unique 17 columns - discarded 9 as purely administrative
ID Status Service Source quote Topic Case Point 1720 pending facebook Cookie Policy 'We use cookies to help us show ads...' Tracking Personal data used for advertising bad 1311 approved nokia T&C 'Except as set forth in the Privacy Policy...' Content Service retains deleted content bad 2261 approved whatsapp NA 'When you delete your WhatsApp account...' Right to leave Data deleted after account closure good unique: 179 22 143 4
classification
19 topics Baseline accuracy: 0.117 70-30 train-test split, stratified by topic Basic, untuned logistic regression Test accuracy: 0.615
reduce class imbalance in the training set
Regression hyperparameters Improved test accuracy: 0.641
the sklearn 'try everything' approach... ...optimised with GridSearch
word2vec
Accuracy score: 0.613 Principle Component Analysis / SVD
Latent Dirichlet Allocation (LDA) "a technique to extract the hidden topics from large volumes of text... The challenge is how to extract good quality of topics that are clear, segregated and meaningful" Some themes:
Heatmap comparing unsupervised sorting into 19 topics, versus human- classified topics
You agree to provide Grammarly with accurate and complete registration information and to promptly notify Grammarly in the event of any changes to any such information. Anonymity & Tracking Personal Data ??? ???
You agree to provide Grammarly with accurate and complete registration information and to promptly notify Grammarly in the event of any changes to any such information. Anonymity & Tracking Personal Data Human Model
Nothing here should be considered legal advice. We express our
any way. Please refer to a qualified attorney for legal advice. Governance Guarantee ??? ???
Nothing here should be considered legal advice. We express our
any way. Please refer to a qualified attorney for legal advice. Governance Guarantee Model Human
For revisions to this Privacy Policy that may be materially less restrictive on our use or disclosure of personal information you have provided to us, we will make reasonable efforts to notify you and obtain your consent before implementing revisions with respect to such information.
Personal Data Changes to Terms ??? ???
For revisions to this Privacy Policy that may be materially less restrictive on our use or disclosure of personal information you have provided to us, we will make reasonable efforts to notify you and obtain your consent before implementing revisions with respect to such information.
Personal Data Changes to Terms Model Human
Same approach as before Best performer:
What if we focus solely on unfavourable terms?
unfavourable terms Reclassify:
Improved performance Best performers:
Additional benefit: ability to tune the model to correctly predict more warning statements at expense of more 'false' warnings.
There are three areas for next steps:
tool