Enabling Access to Old Wu-Tang Clan Fan Sites Facilitating - - PowerPoint PPT Presentation

enabling access to old wu tang clan fan sites
SMART_READER_LITE
LIVE PREVIEW

Enabling Access to Old Wu-Tang Clan Fan Sites Facilitating - - PowerPoint PPT Presentation

Enabling Access to Old Wu-Tang Clan Fan Sites Facilitating Interdisciplinary Web Archive Collaboration Nick Ruest (@ruebot) Ian Milligan (@ianmilligan1) Why should we even care about web archives? First, more data than ever before is being


slide-1
SLIDE 1

Enabling Access to Old Wu-Tang Clan Fan Sites

Facilitating Interdisciplinary Web Archive Collaboration

Nick Ruest (@ruebot) Ian Milligan (@ianmilligan1)

slide-2
SLIDE 2

Why should we even care about web archives?

slide-3
SLIDE 3

First, more data than ever before is being preserved...

slide-4
SLIDE 4

Second, it’ll be saved and delivered to us in very different ways

slide-5
SLIDE 5

WARC (ISO 28500:2009)

slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9

Scarcity Abundance

slide-10
SLIDE 10
slide-11
SLIDE 11

Could one study the 1990s or beyond without web archives?

slide-12
SLIDE 12

And the 1990s are history (as painful as it is to say..)

slide-13
SLIDE 13

But right now you have to use the Wayback Machine - requiring you know the URL!

slide-14
SLIDE 14
slide-15
SLIDE 15
slide-16
SLIDE 16
slide-17
SLIDE 17

We need interdisciplinary collaboration to tackle this problem!

slide-18
SLIDE 18
slide-19
SLIDE 19

Team(s)

We form like Voltron

slide-20
SLIDE 20

WARCS RULE EVERYTHING AROUND ME (US!)

slide-21
SLIDE 21

Ian Milligan

History Faculty Member

slide-22
SLIDE 22

Jimmy Lin

Computer Science Faculty Member

slide-23
SLIDE 23

Jeremy Wiebe

History PhD Candidate

slide-24
SLIDE 24

Alice Zhou

Computer Science Undergraduate

slide-25
SLIDE 25

Nick Ruest

Digital Assets Librarian

slide-26
SLIDE 26

Collaboration

My beats travel like a vortex, through your spine to the top of your cerebrum cortex #Slack & GitHub

slide-27
SLIDE 27

Platforms

Every time the horn blows, the Wu's signal's back on Transform, pack form a whole another platform

slide-28
SLIDE 28

Shine

https://github.com/ukwa/shine/

slide-29
SLIDE 29

Shine

slide-30
SLIDE 30

webarchives.ca

slide-31
SLIDE 31

CLI tools

awk, sed, grep, parallel, sort, uniq, wc, jq

slide-32
SLIDE 32

Geocities

slide-33
SLIDE 33
slide-34
SLIDE 34

Warcbase

slide-35
SLIDE 35

Warcbase

  • An open-source platform for

managing web archives

  • Two main components

○ A flexible data store: your own Wayback Machine ○ Scriptable analytics and data processing

slide-36
SLIDE 36
slide-37
SLIDE 37

Warcbase

  • Scalable

○ From Raspberry Pi to Desktop Computer to Server to Cluster, all with same scripts and commands

  • Potentially very powerful

○ Trantor: 1.2PB of disk, 25 compute nodes (each w/ 128GB memory, 2×6- core Intel Xeon E5 v3 = 3.2TB memory and 300 current-generation Intel cores)

  • In active development, led by

Jimmy Lin, collaborator with Web Archives Historical Research Group

slide-38
SLIDE 38

You can Warcbase Too! (...and Twarcbase soon!)

warcbase.org docs.warcbase.org

slide-39
SLIDE 39

Let’s do a quick walkthrough of how we’ve used it on GeoCities

slide-40
SLIDE 40
slide-41
SLIDE 41
slide-42
SLIDE 42
slide-43
SLIDE 43

Extracting all URLs

Results = 186,761,346 URLs, 9.9GB text file

slide-44
SLIDE 44

Extracting a Link Graph

slide-45
SLIDE 45

Results

slide-46
SLIDE 46

Creating Entities

403GB of link graph data.

  • http://www.geocities.com/EnchantedForest/Grove/1234/index.html
  • http://www.geocities.com/EnchantedForest/Grove/1234/pets/cats.html
  • http://www.geocities.com/EnchantedForest/Grove/1234/pets/dogs.html
  • http://www.geocities.com/EnchantedForest/Grove/1234/pets/rabbits.html
slide-47
SLIDE 47

Bash-Fu

Find all four digit numbers: sed 's/[()]*//g; s/^[^,]*,//; s/\([0-9]\{4\}\)[^,]*/\1/g' enchantedforest-links.txt > enchantedforest-entities-cleaned1.txt Then find internal: grep -P '(.*/[0-9]{4}){2}' enchantedforest-entities-cleaned1.txt > enchantedforest-entities-internal.txt

slide-48
SLIDE 48

Link Structure

slide-49
SLIDE 49
slide-50
SLIDE 50
slide-51
SLIDE 51

EnchantedForest/Glade/3891

slide-52
SLIDE 52

Historical Uses

  • The prevalence of awards pages and awards hubs within this

neighbourhood;

  • A protest movement that may have emerged when Yahoo! decided to

shut down the neighbourhood;

  • We can begin to follow links from this awards page, by highlighting it

in Gephi, to find pages that hosted awards in connection with it; We could do Shine indexing, but metadata might be the best way forward. Also lets us share datasets!

slide-53
SLIDE 53

Datasets

slide-54
SLIDE 54

Links!

  • https://uwaterloo.ca/web-archive-group/
  • https://github.com/web-archive-group/
  • https://github.com/ianmilligan1/
  • https://github.com/ruebot
  • http://dataverse.scholarsportal.info/dvn/dv/wahr
slide-55
SLIDE 55

By Napalm filled tires (Wu Tang Clan) [CC BY-SA 2.0 (http://creativecommons.org/licenses/by-sa/2.0)], via Wikimedia Commons

slide-56
SLIDE 56

Contact Nick Ruest: @ruebot ruestn@yorku.ca Ian Milligan: @ianmilligan1 i2milligan@uwaterloo.ca