W HAT ARE N ETWORKS ? 3 N ETWORK S CIENCE Challenges of Big Data - - PowerPoint PPT Presentation

w hat are n etworks
SMART_READER_LITE
LIVE PREVIEW

W HAT ARE N ETWORKS ? 3 N ETWORK S CIENCE Challenges of Big Data - - PowerPoint PPT Presentation

U NIVERSITY OF C ALIFORNIA , S ANTA B ARBARA AND T HE N ATIONAL S CIENCE F OUNDATION I NTEGRATIVE G RADUATE E DUCATION AND R ESEARCH T RAINING P ROGRAM IN N ETWORK S CIENCE 1 O UTLINE Overview of Network Science 1. My research in Network


slide-1
SLIDE 1

INTEGRATIVE GRADUATE EDUCATION AND RESEARCH TRAINING PROGRAM IN NETWORK SCIENCE

1

UNIVERSITY OF CALIFORNIA, SANTA BARBARA

AND

THE NATIONAL SCIENCE FOUNDATION

slide-2
SLIDE 2

OUTLINE

1.

Overview of Network Science

2.

My research in Network Science

3.

IGERT Network Science at UCSB

4.

My background and experiences

5.

Questions?

2

slide-3
SLIDE 3

WHAT ARE NETWORKS?

3

slide-4
SLIDE 4

NETWORK SCIENCE

Challenges of Big Data Networks  Analysis  Modeling Two significant paradigm shifts ‘Big Data’-driven discovery  ranging from biological and engineering to

social sciences and psychology

Holistic study of systems

 interconnections between individual

units, the network, affect the behavior

  • f a system much more than the

individual components.

4

slide-5
SLIDE 5

MY RESEARCH IN NETWORK SCIENCE

Fast Clustering Methods for Genetic Mapping

in collaboration with:

Aydin Buluc, Leonid Oliker, Joseph Gonzalez, Stefanie Jegelka, Jarrod Chapman, Daniel Rokhsar, John Gilbert

5

slide-6
SLIDE 6

CLUSTERING

 Finding groups of similar/highly connected

vertices in a network

6

cluster 𝐷1 𝐷2 𝐷3

slide-7
SLIDE 7

GENETIC MAPPING OVERVIEW

7

Chromosme 1 Chromosome 2 Chromosome 3

marker3 marker1 marker2 marker4

A genetic map is a list of genetic markers ordered according to their co-segregation patterns

slide-8
SLIDE 8

GENETIC MAPPING OVERVIEW

8

Chromosme 1 Chromosome 2 Chromosome 3

marker3 marker1 marker2 marker4

A genetic map is a list of genetic elements ordered according to their co-segregation patterns

marker3 Linkage Group 1 Linkage Group 2 Linkage Group 3

Genetic Map

marker1 marker4 marker2

slide-9
SLIDE 9

GENETIC MAPPING OVERVIEW

The problem of genetic mapping can essentially be divided into three parts: (1) grouping, (2) ordering, and (3) spacing.

slide-10
SLIDE 10

GENETIC MAPPING: (1) GROUPING PHASE

10

cluster 𝑀𝐻1 𝑀𝐻2

𝑗1 𝑗2 𝑗3 𝑗4 𝑗5 𝑗6 𝑛1 A B

  • A
  • 𝑛2

A B A A B A 𝑛3 A A

  • B

𝑛4 A

  • B
  • B

B 𝑛5 B

  • B

A

  • A

𝑛6 A A B A

  • 𝑛7
  • A

B B 𝑛8 A B A B

  • A

𝑛9 A B

  • B
  • 𝑛10

B B B

  • A

A 𝑛11 A A A A B B 𝑛12 B

  • A

B A

  • 𝑛13

B B

  • A

A

  • 𝑛14
  • B

A A 𝑛15 B

  • A

A B

(missing data)

Data model as a network

slide-11
SLIDE 11

PROBLEMS IN LARGE SCALE GENETIC MAPPING

 State-of-the-art mapping tools don’t scale well  Hundreds of thousands of genetic markers available,

but current software can only handle up to ~10,000 markers

 Bottleneck is the linkage-group-finding phase  Popular mapping tools all handle this phase the same

way, with an 𝑃(𝑛2) clustering algorithm for 𝑛 markers

 Our solution: A fast, scalable clustering

algorithm tailored to genetic marker data

11

slide-12
SLIDE 12

OUR APPROACH: BUBBLECLUSTER ALGORITHM

 Assume: clusters have a “linear” structure

12

slide-13
SLIDE 13

OUR APPROACH: BUBBLECLUSTER ALGORITHM

 Assume: clusters have a “linear” structure  Key idea: Maintain a set of representative points

per cluster, the union of which “spans” a entire cluster

13

representative points threshold distance

slide-14
SLIDE 14

m Iteration i: find 𝑠𝑁𝐽𝑂 ∶= 𝑠

𝑘 for which 𝑒(𝑛, 𝑠 𝑘) is minimal;

set 𝐷𝑁𝐽𝑂 ≔ 𝐷𝐿 ∈ 𝐷 containing 𝑠𝑁𝐽𝑂 𝑒(𝑛, 𝑠

𝑘)

𝑠

𝑘

𝐷1 𝐷2

BUBBLECLUSTER ALGORITHM

slide-15
SLIDE 15

m 𝑠𝑁𝐽𝑂 𝐷1 𝐷2 If (d 𝑛, 𝑠𝑁𝐽𝑂 > 𝒖𝒊𝒔𝒇𝒕𝒊𝒑𝒎𝒆)

BUBBLECLUSTER ALGORITHM

slide-16
SLIDE 16

m If (𝑒 𝑛, 𝑠𝑁𝐽𝑂 > 𝒖𝒊𝒔𝒇𝒕𝒊𝒑𝒎𝒆) 𝐷 = 𝐷 ∪ {𝑛} 𝐷1 𝐷2 𝐷3

BUBBLECLUSTER ALGORITHM

slide-17
SLIDE 17

Our Approach: “Lod score bubbles” algorithm

m 𝐷𝑁𝐵𝑌 = 𝐷1 𝐷2 Else If ( IS_INTERIOR 𝑠

𝑁𝐽𝑂 )

𝑠𝑁𝐽𝑂

slide-18
SLIDE 18

Our Approach: “Lod score bubbles” algorithm

m Else If ( IS_INTERIOR 𝑠

𝑁𝐽𝑂 )

𝐷𝑁𝐽𝑂 = 𝐷𝑁𝐽𝑂 ∪ {𝑛} 𝐷𝑁𝐵𝑌 = 𝐷1 𝐷2 𝑠𝑁𝐽𝑂

slide-19
SLIDE 19

Our Approach: “Lod score bubbles” algorithm

m Else If ( IS_EXTERIOR 𝑛, 𝑠

𝑁𝐽𝑂 )

𝑠𝑁𝐽𝑂 𝐷𝑁𝐵𝑌 = 𝐷1 𝐷2

slide-20
SLIDE 20

Our Approach: “Lod score bubbles” algorithm

m Else If ( IS_EXTERIOR 𝑛, 𝑠

𝑁𝐽𝑂 )

Add 𝑛 to representative points of 𝐷𝑁𝐽𝑂 Add 𝑛 to 𝐷𝑁𝐽𝑂 𝑠𝑁𝐽𝑂 𝐷𝑁𝐵𝑌 = 𝐷1 𝐷2

slide-21
SLIDE 21

Our Approach: “Lod score bubbles” algorithm

m Else // 𝑛 is interior to the outer point 𝑠

𝑁𝐽𝑂

𝑠𝑁𝐽𝑂 𝐷𝑁𝐵𝑌 = 𝐷1 𝐷2

slide-22
SLIDE 22

Our Approach: “Lod score bubbles” algorithm

m Else Add 𝑛 to 𝐷𝑁𝐽𝑂 𝑠𝑁𝐽𝑂 𝐷𝑁𝐽𝑂 = 𝐷1 𝐷2

slide-23
SLIDE 23

Our Approach: “Lod score bubbles” algorithm

m If the marker has a distance below the threshold to two clusters 𝐷𝑁𝐽𝑂 = 𝐷1 𝐷2

slide-24
SLIDE 24

Our Approach: “Lod score bubbles” algorithm

If the marker has a distance below the threshold to two clusters, merge the clusters and add m to the merged cluster m 𝐷𝑂𝐹𝑋 = 𝐷1 ∪ 𝐷2

slide-25
SLIDE 25

Evaluation Metric for Cluster Quality

 F-score

 range: 0 to 1  Given a “golden standard clustering”, the F-score measures the

quality of another clustering by comparing it to the golden standard

 The F-score is a harmonic mean of precision and recall

 An F-score of 1 indicates perfect precision and perfect recall for

EVERY golden standard cluster

slide-26
SLIDE 26

Data

 Real dataset: switchgrass

 ~100,000 markers  Jarrod Chapman at JGI provided us with his clustering results

 Simulated data: 𝑇𝑄𝐵𝐻𝐼𝐹𝑈𝑈𝐽[1]  In our simulations, marker numbers varied from 1,000 to

100,000 markers

 Missing data experiments performed on 100,000 marker

dataset, with missing data rate varying from 5% to 50%

 Error rate was ~1%

[1] SPAGHETTI: Simulation Software to Test Genetic Mapping Programs. Nicholas A. Tinker

slide-27
SLIDE 27

Results: Our Algorithm Applied to Real Data

 Switchgrass dataset

 ~113,000 markers  ~37% missing data  No existing mapping tools could handle this much data  “golden standard clusters” are those provided by Jarrod

Chapman

 Overall F-score: 0.989806

slide-28
SLIDE 28

Results: Clustering Algorithm Comparison

Simulated Data

Clustering 12.5 K Markers 25 K Markers F-Score Time F-score Time JoinMap 0.99964 14 min 0.99982 46 min MSTMap 0.99964 4.5 min 0.99982 20 min BubbleCluster 0.99944 6 sec 0.99972 15 sec

slide-29
SLIDE 29
slide-30
SLIDE 30

Conclusions

 By exploiting the structure underlying genetic marker

clusters, we were able to design a fast algorithm tailored to genetic data

 While remaining highly accurate, we outperform existing

mapping tools in runtime and scalability

 I think this is a good example of an interdisciplinary

project in network science!

slide-31
SLIDE 31

IGERT PROGRAM IN NETWORK SCIENCE

 Prepare students to

 engineer and control large networks  measure and predict the dynamics  design algorithms to operate at high scales  make such networks robust

 Growing demand from science,

commerce, and national security

 analysis of gene networks to

find new therapies

 intervention strategies in social networks

to counter the spread of misinformation

 discovery of clandestine terrorism activity 31

slide-32
SLIDE 32

IGERT PROGRAM IN NETWORK SCIENCE

 Funded by the NSF (2013-2018)  Recruit 5-7 students per year for 4 years

 Total of 25 students  First cohort begins in Fall 2014

 Fellowship for first two years

 $90,890 financial package for CA-residents  $105,992 financial package for non-residents (tuition)  Fellowships include a $30,000 stipend  Departmental RA/TA or campus fellowship for remaining

years

 Students enter any of these seven departments

32

Communication Computer Science Ecology, Evolution, and Marine Biology Electrical and Computer Engineering Geography Sociology Mechanical Engineering

slide-33
SLIDE 33

UCSB GRADUATE EDUCATION COSTS

33

* Source is the UCSB Graduate Division Website. These numbers, totaling $51,641, are used to determine financial aid for incoming graduate students.

Other Expenses

Personal Expenses $1,543 Loan Fees $122 Rent $13,468 Utilities $431 Transportation $1,239 Telephone/Cell Phone $287 Books and Supplies $1,444 Food $2,560

Total $21,094

Education Costs

Tuition $12,192 Campus Based Fees $800 Health Insurance $2,453 Non-Resident Tuition Fee $14,694 Additional Non-Resident Education Fee $408

Total $30,547

IGERT fellowships cover 100% of the Education Costs, and provide an additional annual stipend of $30,000 for the Other Expenses.

Departmental support after first two years (RA or TA) Provides monthly salary and covers Education Costs. Salaries vary by department and research experience.

slide-34
SLIDE 34

POSSIBLE IGERT STUDY-PLAN (CS)

34

Summer 2014 Fall 14 Winter 15 Spring 15 Summer 15 2-week boot camp NS 201 (4 units), Theory NS 202 (4 units), Applications NSL module 1 (4 units) Internship NS seminar (2 units) NS seminar (2 units, focus

  • n innovation and

entrepreneurship) NS seminar (2 units) Graduate course 3 Professional Development Workshop (1-day) Innovation program Graduate course 4 Graduate course 5 Fall 15 Winter 16 Spring 16 Summer 16 NS module 2 (4 units) NS module 3 (4 units) NS seminar (2 units) Internship NS seminar (2 units) Professional Development Workshop Innovation program Graduate course 6 NS seminar (2 units, focus

  • n innovation and

entepreneurship) Graduate course 8 Graduate course 7

For complete requirements: http://www.cs.ucsb.edu/graduate/phd/

Year 1 Year 2

Required Optional Dept

slide-35
SLIDE 35

DEPARTMENTAL REQUIREMENTS (CS)

35

GRE (main focus is on the quantitative score) GPA (usually at least 3.5) Letters (3) Statement of purpose Campus awards central fellowships that can

be combined with IGERT

slide-36
SLIDE 36

36

IGERT RESEARCH AREAS

Algorithms, Models, and Mining Biological Networks Cyberinfrastructure Dynamics and Control Human Networks: Users, Content and Social Organization

slide-37
SLIDE 37

NETWORK SCIENCE IGERT FACULTY

Communication

Andrew Flanagin

Cynthia Stohl

Computer Science

Divy Agrawal

Elizabeth Belding

Tevfik Bultan

Amr El Abbadi

John Gilbert

Tobias Höllerer

Linda Petzold

Ambuj K. Singh

Subhash Suri

Xifeng Yan 37

Ecology, Evolution, and Marine Biology

Todd Oakley

Stephen Proulx

Electrical and Computer Engineering

João Hespanha

Yasamin Mostofi

Geography

Rick Church

Mechanical Engineering

Francesco Bullo

Jeff Moehlis

Sociology

John Mohr

slide-38
SLIDE 38

PROGRAM RECRUITING

 PhD Training Program

 Applications for Fall 2014 due by Dec 15, 2013

 Apply to one of the seven departments and

submit an IGERT-specific application

 Contact one of the IGERT faculty early  See website for details

38

slide-39
SLIDE 39

DIVERSITY

Network Science IGERT program is

committed to promoting diversity within STEM fields.

Women and underrepresented minorities

are especially encouraged to apply.

39

slide-40
SLIDE 40

SUMMER INTERNSHIP

 8-week undergraduate research experience  June 23 – August 15  Students paired with graduate mentors and

faculty

 Program culminates in a poster presentation  Applications due in March 2014

40

slide-41
SLIDE 41

THE UCSB CAMPUS

 Ranks 10th among all US public universities

U.S. News and World Report’s guide, “America’s Best Colleges,”

 Ranks 35th in the world

Academic Ranking of World Universities (ARWU) and by the Times Higher Education World University Rankings

 Fourteen doctoral departments ranked in the top

10 by the National Research Council Assessment of Research Doctoral Programs

Six of these are IGERT departments

Communication, Comp Sci, ECE, Geography, ME and Ecology, Evolution, & Marine Biology

 Faculty include five Nobel Prize winners and scores

  • f elected members of national and international

academies and societies

41

Photos courtesy of UCSB Admissions 

989-acre campus atop bluffs, bordered by Pacific Ocean and Santa Ynez Mountains, Los Padres National Forest

50 graduate departments; 2,950 total graduate students

On-campus housing, RecCen, Library

Visit http://www.graddiv.ucsb.edu/ for more information

slide-42
SLIDE 42

 Ranked Top 10 in the nation  Over 30 distinguished faculty members

in 11 different research areas

 External research funding grown by a

factor of 6X over last 10 years

 108 PhD and 40 Masters students 

23% female

Hispanic/Latino 8%

American Indian/Alaskan Native 0.5%

Asian 8.2%

Black/African American 1.6%

Native Hawaiian/Pacific Islander 0.2%

White/Caucasian 47.4%

Race/Ethnicity Unknown 14.5%

International Students 19.6%

 40-50 new graduate students each year

42

PhDs Awarded

2009-10 2010-11 2011-12 10 14 12

Median Time-to-Degree

UCSB National 5.5 years 7.3 years

Graduation Rate

UCSB National 67.7% 41.5%

PhD Enroll (3-year avg)

Applied Admit Enroll 322 46 21

COMPUTER SCIENCE AT UCSB

slide-43
SLIDE 43

APPLY BY DECEMBER 15

HTTP://NETWORKSCIENCE.IGERT.UCSB.EDU

43

Email: info@networkscience.igert.ucsb.edu

slide-44
SLIDE 44

MY BACKGROUND

Grew up in Las Cruces, NM Took 5 years to graduate high school, Took 4.5 years to graduate college Played competitive soccer from age 8 to 22 B.S. in Applied Mathematics from UNM with a minor in Computer Science

44

slide-45
SLIDE 45

WHAT I THINK HELPED ME GET TO GRAD SCHOOL

 4 summer workshops/undergraduate research

experiences:

 Bioinformatics Summer Camp at NMSU (2007)  MCTP Workshop in Mathematics at UNM (2009)  VIGRE program in Computational Photonics at U. of

Arizona (2009)

 REU at UNM (2010)  Extremely helpful faculty at UNM  Extremely helpful mentor Roya  Good grades, for the most part  Applying for fellowships/graduate schools even

when I thought there wasn’t a snowball’s chance in hell that I’d be selected

45

slide-46
SLIDE 46

WHAT I SHOULD HAVE DONE BETTER

 Applied earlier and prepared applications for

grad school

 Studied for the GRE  and, after bombing the subject test, I should NOT

have sent it to any schools!

 Applied for fellowships  Better grades in the Spring of 2010  Applied for more summer programs in

undergraduate research

46

slide-47
SLIDE 47

WHAT I’VE LEARNED ABOUT GRADUATE SCHOOL

 There are lots of fellowship opportunities out there –

start applying now! (even if you are unsure of your research area)

 Some graduate schools consider only the GPA you have

in your upper-division classes in your major

 You are much more likely to get funding if you apply to

a Ph.D. program as opposed to a Master’s program

 Finding the right adviser is crucial  At UCSB, if you are a teaching or research assistant,

your first paycheck won’t arrive until October!

 Your professors are extremely willing to help you  It may take lots of time before you find the research

area where you belong

47

slide-48
SLIDE 48

THANK YOU

48

http://networkscience.igert.ucsb.edu

slide-49
SLIDE 49

49

IGERT TRAINING

Network Science Laboratory Network Science Training Modules New Coursework Summer Boot Camp Innovation Program Professional Development

slide-50
SLIDE 50

IGERT TRAINING

Network Science Laboratory (NSL)

 Hands-on, centrally-located laboratory for

exposure to large datasets, to software for analyzing and visualizing networks, and to the fundamental principles of Network Science

 Multi-disciplinary lab built around collaboration,

especially involving multi-disciplinary teams

 Features a dedicated library

50

slide-51
SLIDE 51

IGERT TRAINING

Network Science Training Modules

 Trainees take one team-based, four-credit

training model per quarter

 Modules:  M1: International Media Coverage of Terrorist Acts: A

Comparative Module

 M2: Discovery of the Emerging Dynamical Phenomena

in Distributed Systems

 M3: Network Reliability  M4: Discovery of Disease Controlling Modules from

Network Samples

 M5: Model Reduction of Biochemical Networks  M6: Discovering Coordinated Trends in Online Social

Networks.

51

slide-52
SLIDE 52

IGERT TRAINING

New Coursework

 NS 201: Fundamental theory and algorithms for

working with Big Data and networks.

 Topics include: embeddings, spanning trees, network flow,

random graph models, network formation and evolution, structure and attribute-based search, clustering, partitioning, and distributed dynamical systems

 NS 202: Network Science concepts in the context of

social and biological networks.

 Topics include: social structure theory, evolution of biological

networks, management of Big Data, and visualization of networks

 Numerous existing courses from participating

departments add breadth component

52

slide-53
SLIDE 53

IGERT TRAINING

Summer boot camp

 A two-week boot camp for new trainees right

before the beginning of the academic year

 Introduces and refreshes skills around

programming, software, and data.

 Week 1: Introduction to large text file processing  Week 2: Targeted hands-on mini-labs.

(1) Network algorithms lab:

 (2) Cluster/distributed network processing lab:  (3) Visualization and exploration lab

53

slide-54
SLIDE 54

IGERT TRAINING

Innovation program

 Continuation of the Training Modules to tackle

problems beyond the laboratory

 Interaction with IGERT faculty through

independent research units

 Weekly seminars focusing on topics of

innovation and entrepreneurship

 Access to the Technology Management Program

(TMP) at UCSB, including the annual New Venture Competition.

54

slide-55
SLIDE 55

IGERT TRAINING

Professional development

 Annual workshop focused on development of

professional skills

 Features lectures, panels, and hands-on workshops on

exploration of post-graduate career paths

 Other sessions focus on leadership skills, ethics in research,

research dissemination, and presentation skills.

 Access to other UCSB programs through

 The Graduate Division’s Career and Professional Development

Workshops

 Center for Science and Engineering Partnerships weekly

Professional Development Series

 Instructional Development’s Teach Pedagogy Workshops  And many other campus groups

55

slide-56
SLIDE 56

IGERT RESEARCH AREAS

Algorithms, Models, and Mining

 Big Data is forcing a rethink in the algorithmic theory of

  • networks. This research thrust will build algorithmic

foundations and theories that can scale to networks with millions of nodes and billions of edges, and will analyze specific characteristics of social and biological networks so that we can engineer algorithms to real networks.

 Proposed Thesis Topics  Geometric Embedding of Networks  Models for Information Cascades  Frameworks and Libraries for High-Performance Parallel

Graph Computations

 Scalable Network Searching and Mining 56

slide-57
SLIDE 57

IGERT RESEARCH AREAS

Biological Networks

 Biological systems can be represented as networks of

interacting units: gene regulatory networks; protein- protein interaction networks; signaling networks; metabolic networks; neuronal networks; and food

  • webs. Understanding how they evolve and function is a

core question in modern biology and requires approaches that bring together data and scientific approaches from disparate fields.

 Proposed Thesis Topics  Inferring the Control Structure and Modularity of Gene

Networks

 Constraints on Network Dynamics 57

slide-58
SLIDE 58

IGERT RESEARCH AREAS

Cyberinfrastructure

 Storage, retrieval, analysis and visualization of Network

Science data pose formidable research and development

  • challenges. In particular, the underlying data model for

Network Science applications is based on graph- structured data. Queries on such data sets are based on the structural properties of graphs, in addition to the values of attributes. Managing, programming and visualizing of complex networked datasets is another significant research problem.

 Proposed Thesis Topics  Big Data Management  Programming Big Data and Large Networks  Visualizing Networks 58

slide-59
SLIDE 59

IGERT RESEARCH AREAS

Dynamics and Control

 Networks provide the natural framework to model the

dynamical processes and interactions arising in large-scale multi-agent systems. We are developing a Network Science of dynamical systems, while targeting natural large-scale networks, artificial networks, and societal networks.

 Proposed Thesis Topics  Data-driven Models for Agents  Parameter Estimation for Dynamical Networks 59

slide-60
SLIDE 60

IGERT RESEARCH AREAS

Human Networks: Users, Content, and Social Organization

 While the roots of formal graph modeling of social

behavior go back a half a century or more, most of the early work was devoted to developing formalisms and tool

  • building. A generation ago network analysis was applied to

a much wider variety of practical problems. Now the rise

  • f the Internet and the increasing availability of Big Data

promise to transform the scientific study of social networks yet again.

 Proposed Thesis Topics  Dynamics of Opinion Propagation  Networked Narratives and Social Influence  Modeling Persuasion over Social Networks 60