Longtime Behavior of Harvesting Spam Bots Oliver Hohlfeld TU Berlin - - PowerPoint PPT Presentation

longtime behavior of harvesting spam bots
SMART_READER_LITE
LIVE PREVIEW

Longtime Behavior of Harvesting Spam Bots Oliver Hohlfeld TU Berlin - - PowerPoint PPT Presentation

Longtime Behavior of Harvesting Spam Bots Oliver Hohlfeld TU Berlin / DT Labs Thomas Graf Florin Ciucu Modas GmbH TU Berlin / DT Labs Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC12 1 / 10 Image


slide-1
SLIDE 1

Longtime Behavior of Harvesting Spam Bots

Oliver Hohlfeld

TU Berlin / DT Labs

Thomas Graf

Modas GmbH

Florin Ciucu

TU Berlin / DT Labs

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 1 / 10

slide-2
SLIDE 2

Image source: http://www.flickr.com/photos/twistermc/3382403844/ (CC BY-SA 2.0) Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 2 / 10

slide-3
SLIDE 3

Why you?

Image source: http://www.flickr.com/photos/twistermc/3382403844/ (CC BY-SA 2.0) Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 2 / 10

slide-4
SLIDE 4

Why you? Scope: Address harvesting from public web sites

Image source: http://www.flickr.com/photos/twistermc/3382403844/ (CC BY-SA 2.0) Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 2 / 10

slide-5
SLIDE 5

Approach

Our Infrastructure SMTP Servers 9 Web Sites (1 US)

Database

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 3 / 10

slide-6
SLIDE 6

Approach

Our Infrastructure Address Harvester Spammer

Addresses Money

SMTP Servers 9 Web Sites (1 US) HTTP Addresses

Database Web Crawler

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 3 / 10

slide-7
SLIDE 7

Approach

Our Infrastructure Address Harvester Spammer

Addresses Money Might pay botmaster to send

SMTP Servers Spam E-Mail

botnet

9 Web Sites (1 US) HTTP Addresses

Database Web Crawler

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 3 / 10

slide-8
SLIDE 8

Approach

Our Infrastructure Address Harvester Spammer

Addresses Money Might pay botmaster to send

SMTP Servers Spam E-Mail

botnet

9 Web Sites (1 US) HTTP Addresses

Database

ears

✜✁✂✄ ☎ ✆ ✝ ☎ta ✦ ✧

Web Crawler

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 3 / 10

slide-9
SLIDE 9

Host Properties

How many harvesting hosts?

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 4 / 10

slide-10
SLIDE 10

Host Properties

How many harvesting hosts? > 1k

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 4 / 10

slide-11
SLIDE 11

Host Properties

How many harvesting hosts? > 1k Geolocation?

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 4 / 10

slide-12
SLIDE 12

Host Properties

How many harvesting hosts? > 1k Geolocation?

DE US GB CN NL ES CI RO TW MY # Distinct IPs

200 800

2% 60.6%

Figure: by requesting IPs

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 4 / 10

slide-13
SLIDE 13

Host Properties

How many harvesting hosts? > 1k Geolocation?

DE US GB CN NL ES CI RO TW MY # Distinct IPs

200 800

2% 60.6%

Figure: by requesting IPs

RO BG DE NL CN US PL VN PT CH SpamE− Mails 50k 150k 300k 46% 26% 10% 1 Host

Figure: by spam volume

24 massive harvesting hosts in Romania (≈ 10k page requests / day)

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 4 / 10

slide-14
SLIDE 14

Host Properties

How many harvesting hosts? > 1k Geolocation?

DE US GB CN NL ES CI RO TW MY # Distinct IPs

200 800

2% 60.6%

Figure: by requesting IPs

RO BG DE NL CN US PL VN PT CH SpamE− Mails 50k 150k 300k 46% 26% 10% 1 Host

Figure: by spam volume

24 massive harvesting hosts in Romania (≈ 10k page requests / day) How are they connected?

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 4 / 10

slide-15
SLIDE 15

Host Properties

How many harvesting hosts? > 1k Geolocation?

DE US GB CN NL ES CI RO TW MY # Distinct IPs

200 800

2% 60.6%

Figure: by requesting IPs

RO BG DE NL CN US PL VN PT CH SpamE− Mails 50k 150k 300k 46% 26% 10% 1 Host

Figure: by spam volume

24 massive harvesting hosts in Romania (≈ 10k page requests / day) How are they connected? 73% hosted in ADSL / cable networks

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 4 / 10

slide-16
SLIDE 16

Host Properties

How many harvesting hosts? > 1k Geolocation?

DE US GB CN NL ES CI RO TW MY # Distinct IPs

200 800

2% 60.6%

Figure: by requesting IPs

RO BG DE NL CN US PL VN PT CH SpamE− Mails 50k 150k 300k 46% 26% 10% 1 Host

Figure: by spam volume

24 massive harvesting hosts in Romania (≈ 10k page requests / day) How are they connected? 73% hosted in ADSL / cable networks Using Tor Anonymity Service?

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 4 / 10

slide-17
SLIDE 17

Host Properties

How many harvesting hosts? > 1k Geolocation?

DE US GB CN NL ES CI RO TW MY # Distinct IPs

200 800

2% 60.6%

Figure: by requesting IPs

RO BG DE NL CN US PL VN PT CH SpamE− Mails 50k 150k 300k 46% 26% 10% 1 Host

Figure: by spam volume

24 massive harvesting hosts in Romania (≈ 10k page requests / day) How are they connected? 73% hosted in ADSL / cable networks Using Tor Anonymity Service? No

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 4 / 10

slide-18
SLIDE 18

Blocking

Does blacklisting help?

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 5 / 10

slide-19
SLIDE 19

Blocking

Does blacklisting help? → Yes (26% hosts balacklisted at access time)

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 5 / 10

slide-20
SLIDE 20

Blocking

Does blacklisting help? → Yes (26% hosts balacklisted at access time) HTTP User Agent String Fingerprinting?

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 5 / 10

slide-21
SLIDE 21

Blocking

Does blacklisting help? → Yes (26% hosts balacklisted at access time) HTTP User Agent String Fingerprinting? Variability might imply only few active parties

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 5 / 10

slide-22
SLIDE 22

Blocking

Does blacklisting help? → Yes (26% hosts balacklisted at access time) HTTP User Agent String Fingerprinting? Variability might imply only few active parties “Java/1.6.0 17” UA 3% of harvesting hosts 88% of harvesting page requests 55% of total spam volume 99.9% of Romanian harvesting bots

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 5 / 10

slide-23
SLIDE 23

Blocking

Does blacklisting help? → Yes (26% hosts balacklisted at access time) HTTP User Agent String Fingerprinting? Variability might imply only few active parties “Java/1.6.0 17” UA 3% of harvesting hosts 88% of harvesting page requests 55% of total spam volume 99.9% of Romanian harvesting bots → Blocking certain user agent strings currently helps

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 5 / 10

slide-24
SLIDE 24

Proxies Revisited: Search Engines

Search engines exploited for malicious activities Also used by harvesters?

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 6 / 10

slide-25
SLIDE 25

Proxies Revisited: Search Engines

Search engines exploited for malicious activities Also used by harvesters?

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 6 / 10

slide-26
SLIDE 26

Proxies Revisited: Search Engines

Our Infrastructure Search Engine 9 Web Sites (1 US) Address Harvester

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 7 / 10

slide-27
SLIDE 27

Proxies Revisited: Search Engines

Our Infrastructure Search Engine 9 Web Sites (1 US) HTTP Addresses

Web Crawler

Address Harvester

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 7 / 10

slide-28
SLIDE 28

Proxies Revisited: Search Engines

Our Infrastructure Search Engine 9 Web Sites (1 US) HTTP Addresses

Web Crawler

Address Harvester

ECrawl, ...

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 7 / 10

slide-29
SLIDE 29

Proxies Revisited: Search Engines

Our Infrastructure Search Engine 9 Web Sites (1 US) HTTP Addresses

Web Crawler

Address Harvester

ECrawl, ...

ECrawl v2.63: “Access to the Google cache (VERY fast harvesting)” Fast Email Harvester 1.2: “collector sup- ports all major search engines, such as Google, Yahoo, MSN”

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 7 / 10

slide-30
SLIDE 30

Proxies Revisited: Search Engines

Our Infrastructure Search Engine 9 Web Sites (1 US) HTTP Addresses

Web Crawler

Address Harvester

ECrawl, ...

0.5% of addresses spammed 0.2% of total spam

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 7 / 10

slide-31
SLIDE 31

Proxies Revisited: Search Engines

Our Infrastructure Search Engine 9 Web Sites (1 US) HTTP Addresses

Web Crawler

Address Harvester

ECrawl, ...

0.5% of addresses spammed 0.2% of total spam → You don’t want to block Google!

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 7 / 10

slide-32
SLIDE 32

Address Usage

0.2 0.4 0.6 0.8 1 0.1 1 10 100 1000 CDF Address Turnaround Time (Days) Full Data Set Search Engines faster slower faster slower 70% < 11 days

50% spammed < 4 days (general), 11 days (search engines)

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 8 / 10

slide-33
SLIDE 33

Address Usage

0.2 0.4 0.6 0.8 1 0.1 1 10 100 1000 CDF Address Turnaround Time (Days) Full Data Set Search Engines faster slower faster slower 70% < 11 days

50% spammed < 4 days (general), 11 days (search engines) Usage period: < 1 second: 11% (general), 16% (search engines) < 1 day: 17% (general), 40% (search engines) < 1 week: 78% (general), 53% (search engines)

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 8 / 10

slide-34
SLIDE 34

Webmasters Dilemma: Address Presentation

General data set Search engines

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 9 / 10

slide-35
SLIDE 35

Webmasters Dilemma: Address Presentation

MTO General data set 40.5% Search engines 61% MTO User friendly mailto link: mailto:john.doe@imc.conf

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 9 / 10

slide-36
SLIDE 36

Webmasters Dilemma: Address Presentation

MTO TXT General data set 40.5% 31% Search engines 61% 38% MTO User friendly mailto link: mailto:john.doe@imc.conf TXT: plain text john.doe@imc.conf

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 9 / 10

slide-37
SLIDE 37

Webmasters Dilemma: Address Presentation

MTO TXT OBF General data set 40.5% 31% 7% Search engines 61% 38% 0.6% MTO User friendly mailto link: mailto:john.doe@imc.conf TXT: plain text john.doe@imc.conf OBF: Obfuscated text: john [dot] doe [at] imc [dot] conf

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 9 / 10

slide-38
SLIDE 38

Webmasters Dilemma: Address Presentation

MTO TXT OBF JS General data set 40.5% 31% 7% Search engines 61% 38% 0.6% MTO User friendly mailto link: mailto:john.doe@imc.conf TXT: plain text john.doe@imc.conf OBF: Obfuscated text: john [dot] doe [at] imc [dot] conf JS: Javascript code

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 9 / 10

slide-39
SLIDE 39

Webmasters Dilemma: Address Presentation

MTO TXT OBF JS FRM CMT General data set 40.5% 31% 7% 2.5% 19% Search engines 61% 38% 0.6% 0.4% 0% MTO User friendly mailto link: mailto:john.doe@imc.conf TXT: plain text john.doe@imc.conf OBF: Obfuscated text: john [dot] doe [at] imc [dot] conf JS: Javascript code FRM: HTML form CMT: HTML comment

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 9 / 10

slide-40
SLIDE 40

Webmasters Dilemma: Address Presentation

MTO TXT OBF JS FRM CMT General data set 40.5% 31% 7% 2.5% 19% Search engines 61% 38% 0.6% 0.4% 0% MTO User friendly mailto link: mailto:john.doe@imc.conf TXT: plain text john.doe@imc.conf OBF: Obfuscated text: john [dot] doe [at] imc [dot] conf JS: Javascript code FRM: HTML form CMT: HTML comment → Simple obfuscation methods (OBF, JS) still suffice

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 9 / 10

slide-41
SLIDE 41

Conclusions

Obfuscate your e-mail addresses! User agent filtering can help Search engines used as proxies Possibly only few active harvesters operating at different spam volumes

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 10 / 10

slide-42
SLIDE 42

Conclusions

Obfuscate your e-mail addresses! User agent filtering can help Search engines used as proxies Possibly only few active harvesters operating at different spam volumes Future Work Campain analysis How many harvesting parties exist? We thank all the anonymous spammers and harvesters for making this study possible.

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 10 / 10

slide-43
SLIDE 43

Conclusions

Obfuscate your e-mail addresses! User agent filtering can help Search engines used as proxies Possibly only few active harvesters operating at different spam volumes Future Work Campain analysis How many harvesting parties exist? We thank all the anonymous spammers and harvesters for making this study possible. Need more stats? Download the data: http://ohohlfeld.com/harvesting.html

Oliver Hohlfeld (TU Berlin / DT Labs) Longtime Behavior of Harvesting Spam Bots IMC’12 10 / 10