Spam Detection in Voice-over-IP Calls through Semi-Supervised - PDF document

Spam Detection in Voice-over-IP Calls through Semi-Supervised Clustering Yu Sung Wu Saurabh Bagchi Yu-Sung Wu, Saurabh Bagchi Purdue University, USA Ratsameetip Wita Navjot Singh Chulalongkorn University, Avaya Labs, USA Thailand Slide 1/29 Voice-over-IP (VoIP) Overview • Session Initiation Protocol (SIP) or H.323 for signaling • Real-time Transport Protocol (RTP) for media • Media flow happens after a successful call setup, which is achieved through signaling • Real-time Transport Protocol (RTCP) for feedback • Other supporting protocols: DNS, DHCP, ICMP Slide 2/29

Sample Call Flow in VoIP A S2 S1 B (Phone) (Proxy) (Proxy) (Phone) Invite F1 Invite F2 100 Trying F3 100 Trying F3 Invite F4 Invite F4 100 Trying F5 180 Ringing F6 180 Ringing F7 180 Ringing F8 200 OK F9 200 OK F10 200 OK F11 ACK F12 Media Session BYE F13 200 OK F14 Slide 3/29 Outline 1. VoIP Overview 2. Challenges in VoIP Spam Detection 3. System Architecture 4. Semi-supervised Clustering 5. Efficient Clustering for Spam Detection: e-MPCK- Means, p-MPCK-Means 6. Call Trace and Experiments 7 7. Conclusions Conclusions Slide 4/29

Spam Calls in VoIP Systems • SPam over Internet Telephony (SPIT) • Unsolicited and unwanted phone calls from (malicious) parties – Telemarketing calls – Harassing calls – Survey / polling calls • Why is this a growing phenomenon? – VoIP calls are cheap to make – SPIT is very easy to automate • Comparison with e-mail spam: – Motives and impacts are analogous – But, more disruptively, a VoIP spam intrudes in real-time Slide 5/29 Challenges for Dealing with VoIP Spam • A spam call in many ways appears like a normal (non- SPIT) call – Both follow the same protocols (SIP, H.323, RTP, RTCP) – No malformed packets N lf d k – No exploitation of protocol vulnerabilities – Existing NIDS systems (Snort, S CI D IVE [1] ,…) do not apply • VoIP is a real-time system – Before you pick up the call, can you tell if it’s going to be a spam call? spam call? [1] Y-S. Wu, S. Bagchi, S. Garg, N. Singh, T. Tsai, “SCIDIVE: A Stateful and Cross Protocol Intrusion Detection Architecture for Voice-over-IP Environments,” DSN 05, pp. 401-410. Slide 6/29

Challenges for Dealing with VoIP Spam • VoIP system is a dynamic environment – Call duration, call frequency, the words you say, … can all be changing from one deployment to another – Different persons have different perspectives on what constitute Different persons have different perspectives on what constitute a spit call • Some might be interested in buying merchandise from telemarketers while they do dislike other harassing phone calls. – Therefore, fixed threshold-based rules for detection are not suitable for filtering spam calls Slide 7/29 Contribution • Identify features from a VoIP call for spam detection • Clustering of VoIP calls to identify spam calls • Use of user-feedback and semi-supervised clustering technique to differentiate between spam and legitimate calls • Adapting the original MPCK-Means [2] algorithm into: – eMPCK-Means : A O(N) algorithm for clustering a batch of VoIP calls – pMPCK-Means : A real-time algorithm for detecting VoIP MPCK M A l ti l ith f d t ti V IP spam [2] M. Bilenko, S. Basu, and R. J. Mooney, "Integrating constraints and metric learning in semi-supervised clustering," in ICML , 2004, pp. 81-88. Slide 8/29

Outline 1. VoIP Overview 2. Challenges in VoIP Spam Detection 3. System Architecture 4. Semi-supervised Clustering 5. Efficient Clustering for Spam Detection: e-MPCK- Means, p-MPCK-Means 6. Call Trace and Experiments 7. Conclusions 7 Conclusions Slide 9/29 System Architecture Legend : normal user Our Contribution S S : spitter : spitter SIP based SIP based Server-side VoIP Proxy VoIP Proxy Detector Server #1 Spit Detector Server #2 Client-side Client-side Client-side Detector Detector Detector S S F E A B C Slide 10/29

VoIP Call Features 17 call features extracted from VoIP signaling and media traffic used here for clustering B. Media Stream A. Call C. Call Tear Down (RTP/RTCP) / Call Establishment Establishment Maintenance 1-2. From/To URI 3. Start time 4. Duration 5. # of SIP INVITE messages 6. # of SIP ACK messages 7-8. # of SIP BYE messages from caller/callee 9. Time since the last call from the originator of the current call 10-15. # of 1xx, 2xx, 3xx, 4xx, 5xx, and 6xx SIP response messages 16. Call frequency of the originator of the current call 17. Ratio of non-silence duration of the callee to the caller media streams Slide 11/29 Outline 1. VoIP Overview 2. Challenges in VoIP Spam Detection 3. System Architecture 4. Semi-supervised Clustering 5. Efficient Clustering for Spam Detection: eMPCK- Means, pMPCK-Means 6. Call Trace and Experiments 7 7. Conclusions Conclusions Slide 12/29

Basic Clustering • Objective: Cluster calls into legitimate and spam calls • Classic K-Means clustering 2 K ∑ ∑ ∑ ∑ − μ μ is minimized x i i i i = ∈ j 1 x X i j • Objective function puts weight on each feature evenly • However, there may be only a few call features that can distinguish between the different clusters • Putting equal weight on all the selected features can drown out the influence of these distinguishing features Slide 13/29 Semi-supervised clustering • MPCK-Means ( ) ⎛ ( ) ⎞ ∑ 2 • Distance from centroids τ = − μ − ⎜ ⎟ x log det A mpckm i l l ⎝ i A i ⎠ ( (reweighted by A matrix) g y ) li ∈ χ χ x i i ∑ ( ) ⎡ ⎤ ≠ + w f x ,x 1 l l ⎣ ⎦ • Cost from violating ij M i j i j ( ) ∈ x ,x x M must-link constraints i j i ∑ ( ) ⎡ ⎤ = ( pairs of data points which + w f x ,x 1 l l ⎣ ⎦ ij C i j i j ( ) should be put in the same ∈ x ,x x C i j i cluster ) • Cost from violating C t f i l ti ( ) ( ) 2 T − μ = − μ − μ x x A x cannot-link constraints i i i i l i i A i l i ( pairs of data points which should be put in different τ mpckm is miminized. clusters ) Slide 14/29

How to Update A matrix τ ∂ = mpckm • The A matrix A h for cluster h is acquired by solving 0 ∂ A h • Covariance of data ⎛ ⎛ ∑ points in cluster h points in cl ster h ( )( ) T = − μ − μ A X ⎜ x x h h i h i h ⎝ ∈ x X i h ∑ ( )( ) 1 T • Cost from violating ⎡ ⎤ + − − ≠ w x x x x 1 l l ( ) ⎣ ⎦ ij i j i j i j ∈ x x , M 2 must-link constraints i j h ( )( ) ∑ ⎛ related to cluster h T + ' − '' ' − '' ⎜ w x x x x ( ) ij h h h h ∈ ⎝ , x x C i j h • Cost from violating − 1 1 ⎞ ⎞ ( )( ) ⎞ cannot-link T ⎡ ⎤ − − − = ⎟ ⎟ x x x x 1 l l ⎣ ⎦⎠⎠ i j i j i j constraints related to cluster h Slide 15/29 Outline 1. VoIP Overview 2. Challenges in VoIP Spam Detection 3. System Architecture 4. Semi-supervised Clustering 5. Efficient Clustering for Spam Detection: e-MPCK- Means, p-MPCK-Means 6. Call Trace and Experiments 7 7. Conclusions Conclusions Slide 16/29

Our Contribution: eMPCK-Means • Batch mode of operation • Improvement in runtime: – A O(N) approximation version of MPCK-Means • MPCK-Means is O(N 3 ) – O(N) complexity cluster initialization • Skip the pair-wise constraints => O( N 2 ) • Use the set of flagged spam calls, flagged legitimate calls, and the set of the rest of calls directly for cluster initialization – Efficient estimation of maximally separated points • Embed the estimation in the distance calculation – Use a constant number of constraints in cluster assignment step • Experiment results from [2] suggest that MPCK-Means can work reasonably well with only a few constraints Slide 17/29 Our Contribution: eMPCK-Means • Improvement in clustering quality: – Pre metrics update on the starting cluster(s) • Update A matrix once before entering the main-loop of MPCK-Means • Results in an initial A matrix which reflects the user feedback • Results in an initial A matrix which reflects the user feedback information better • In comparison, an identity matrix is used as the initial A matrix in MPCK-Means Slide 18/29

Spam Detection in Voice-over-IP Calls through Semi-Supervised - PDF document

Spam Detection in Voice-over-IP Calls through Semi-Supervised Clustering Yu Sung Wu Saurabh Bagchi Yu-Sung Wu, Saurabh Bagchi Purdue University, USA Ratsameetip Wita Navjot Singh Chulalongkorn University, Avaya Labs, USA Thailand Slide 1/29

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Slide 1 Page: 1 The Leader's Voice Slide 3 Page: 5 The Leader's Voice Slide 4 Page: 6 The

Voice Activity Detection Voice Activity Detection Speaker Recognition Feature Extraction

Getting Sta rted with Voice API Lorna Mitchell Getting Sta rted with Voice API Use the Voice

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and

Digital Voice VHF, UHF, and HF Analog Voice - AM/SSB Analog Voice - FM Digital Voice GMSK UHF

Spam-URL Detection via Redirects Heeyoung Kwon Mirza Basim Baig Leman Akoglu Era of Spam Era

BCPS FY 2019-20 Open Enrollment Communication Conducting these information meetings.

EQUIPMENT BREAKDOWN INSURANCE PROGRAM General Management WHAT DOES EQUIPMENT BREAKDOWN DO?

Administrative Administrative Policies and Procedures Policies and Procedures Deb Bartlett

Maria Fuentes, MSW Senior Services Manager Adult and Older Adult Division Two sources guide the

Information Technology Department FY 2017/2018 MOE BUDGET PRESENTATION TIM DUPUIS, CIO/REGISTRAR

F Y2017 2023 RE COMME NDE D CIP Bo ard o f Co mmissio ne rs Wo rk Se ssio n Ma y 9, 2016

Dynamic language model adaptation using presentation slides for lecture speech recognition

New technologies and inclusion E-commerce Video conferencing Some consumers Voice

Spam Detection in Voice-over-IP Calls through Semi-Supervised - PDF document

Spam Detection in Voice-over-IP Calls through Semi-Supervised Clustering Yu Sung Wu Saurabh Bagchi Yu-Sung Wu, Saurabh Bagchi Purdue University, USA Ratsameetip Wita Navjot Singh Chulalongkorn University, Avaya Labs, USA Thailand Slide 1/29

Spam, Spam, Spam Why is spam interesting? Everyone can observe spam. Spam / Anti-spam is a

Opinion Spam and Analysis NITIN JINDAL &amp; BING LIU, WSDM 08 UIUC Opinion/Review Spam All

Link Spam Alliances Zoltn Gyngyi Hector Garcia-Molina Class List Spam 101 Intro to

Spam Fighting at CERN 28 April 2004 Emmanuel Ormancey 1 What is Spam ? What is Spam ? Spam

Spam Filtering with Naive Bayes Classifier Yuriy Arabskyy June 6, 2017 Table of contents What

Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrcken, June 24, 2010 Databases and

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Web Spam Marc Spaniol Marc Spaniol Saarbrcken, July 23, 2009 Databases and Information

Slide 1 Page: 1 The Leader's Voice Slide 3 Page: 5 The Leader's Voice Slide 4 Page: 6 The

Voice Activity Detection Voice Activity Detection Speaker Recognition Feature Extraction

Getting Sta rted with Voice API Lorna Mitchell Getting Sta rted with Voice API Use the Voice

Detecting Product Review Spammers using Rating Behaviors Itay Dressler What is Spam? Why

The CAN-SPAM Act of 2003 D E C E M B E R 2 0 0 3 THE CAN-SPAM ACT OF 2003 Status of Legislation

DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and Digital Voice Modes DMR and

Digital Voice VHF, UHF, and HF Analog Voice - AM/SSB Analog Voice - FM Digital Voice GMSK UHF

Spam-URL Detection via Redirects Heeyoung Kwon Mirza Basim Baig Leman Akoglu Era of Spam Era

BCPS FY 2019-20 Open Enrollment Communication Conducting these information meetings.

EQUIPMENT BREAKDOWN INSURANCE PROGRAM General Management WHAT DOES EQUIPMENT BREAKDOWN DO?

Administrative Administrative Policies and Procedures Policies and Procedures Deb Bartlett

Maria Fuentes, MSW Senior Services Manager Adult and Older Adult Division Two sources guide the

Information Technology Department FY 2017/2018 MOE BUDGET PRESENTATION TIM DUPUIS, CIO/REGISTRAR

F Y2017 2023 RE COMME NDE D CIP Bo ard o f Co mmissio ne rs Wo rk Se ssio n Ma y 9, 2016

Dynamic language model adaptation using presentation slides for lecture speech recognition

New technologies and inclusion E-commerce Video conferencing Some consumers Voice

Opinion Spam and Analysis NITIN JINDAL & BING LIU, WSDM 08 UIUC Opinion/Review Spam All