SUGAR Geometry Based Data Generation O. Lindenbaum, J.S. Stanley, - - PowerPoint PPT Presentation

sugar geometry based data generation
SMART_READER_LITE
LIVE PREVIEW

SUGAR Geometry Based Data Generation O. Lindenbaum, J.S. Stanley, - - PowerPoint PPT Presentation

SUGAR Geometry Based Data Generation O. Lindenbaum, J.S. Stanley, G. Wolf, S. Krishnaswamy Yale University 2018 Lindenbaum et al. (Yale) SUGAR 2018 1 / 14 Acknowledgements This work was done in collaboration with: Jay Stanley Guy Wolf


slide-1
SLIDE 1

SUGAR Geometry Based Data Generation

  • O. Lindenbaum, J.S. Stanley, G. Wolf, S. Krishnaswamy

Yale University

2018

Lindenbaum et al. (Yale) SUGAR 2018 1 / 14

slide-2
SLIDE 2

Acknowledgements

This work was done in collaboration with: Jay Stanley Guy Wolf Smita Krishnaswamy

Research partially funded by grant from the CZI

Lindenbaum et al. (Yale) SUGAR 2018 2 / 14

slide-3
SLIDE 3

Introduction & motivation

Traditional models: density based data generation

Generative models typically infer distribution from collected data, and sample it to generate more data. Biased by sampling density May miss rare populations Does not preserve the geometry

Lindenbaum et al. (Yale) SUGAR 2018 3 / 14

slide-4
SLIDE 4

Introduction & motivation

Traditional models: density based data generation

Generative models typically infer distribution from collected data, and sample it to generate more data.

Biased by sampling density May miss rare populations Does not preserve the geometry

Lindenbaum et al. (Yale) SUGAR 2018 3 / 14

slide-5
SLIDE 5

Introduction & motivation

Traditional models: density based data generation

Generative models typically infer distribution from collected data, and sample it to generate more data.

⇐ ⇐

Biased by sampling density May miss rare populations Does not preserve the geometry

Lindenbaum et al. (Yale) SUGAR 2018 3 / 14

slide-6
SLIDE 6

Introduction & motivation

New approach: geometry based data generation

Lindenbaum et al. (Yale) SUGAR 2018 4 / 14

slide-7
SLIDE 7

Introduction & motivation

New approach: geometry based data generation

Lindenbaum et al. (Yale) SUGAR 2018 4 / 14

slide-8
SLIDE 8

Introduction & motivation

New approach: geometry based data generation

Lindenbaum et al. (Yale) SUGAR 2018 4 / 14

slide-9
SLIDE 9

Introduction & motivation

New approach: geometry based data generation

Lindenbaum et al. (Yale) SUGAR 2018 4 / 14

slide-10
SLIDE 10

Introduction & motivation

New approach: geometry based data generation

Lindenbaum et al. (Yale) SUGAR 2018 4 / 14

slide-11
SLIDE 11

Diffusion geometry

Manifold learning with random walks

Local affinities g(x, y) ⇒ transition probs. Pr[x↝y] =

g(x,y) ∥g(x,⋅)∥1

Markov chain/process ⇒ random walks on data manifold

Lindenbaum et al. (Yale) SUGAR 2018 5 / 14

slide-12
SLIDE 12

Diffusion geometry

Random walks reveal intrinsic neighborhoods

Lindenbaum et al. (Yale) SUGAR 2018 6 / 14

pt(x, y) = Pr[x

t steps

⟿ y]

slide-13
SLIDE 13

Data generation with diffusion

Walk toward the data manifold from randomly generated points

Generate random points:

Lindenbaum et al. (Yale) SUGAR 2018 7 / 14

slide-14
SLIDE 14

Data generation with diffusion

Walk toward the data manifold from randomly generated points

Generate random points: Walk towards the data manifold with diffusion: x ↦ ∑

y∈data

y ⋅ pt(x, y)

Lindenbaum et al. (Yale) SUGAR 2018 7 / 14

slide-15
SLIDE 15

Data generation with diffusion

Correct density with MGC kernel (Bermanis et al., ACHA 2016)

Separate density/geometry with new kernel: k(x,y)= ∑

r∈data g(x,r),g(y,r) density(r)

Use new diffusion process p(x, y) =

k(x,y) ∥k(x,⋅)∥1 to walk to the manifold

Lindenbaum et al. (Yale) SUGAR 2018 8 / 14

slide-16
SLIDE 16

Data generation with diffusion

Correct density with MGC kernel (Bermanis et al., ACHA 2016)

Separate density/geometry with new kernel: k(x,y)= ∑

r∈data g(x,r),g(y,r) density(r)

Use new diffusion process p(x, y) =

k(x,y) ∥k(x,⋅)∥1 to walk to the manifold

Lindenbaum et al. (Yale) SUGAR 2018 8 / 14

slide-17
SLIDE 17

Data generation with diffusion

Fill sparse areas to create uniform distribution

Question: How should we initialize new points to end up with uniform sampling from the data manifold? Answer: For each x ∈ data, initialize ˆ ℓ(x) points sampled from N(x, Σx); set ˆ ℓ as the mid-point between the upper & lower bounds in the following proposition.

Proposition

The generation level ˆ ℓ(x) required to equalize density is bounded by det (I + Σx

2σ2 )

1 2 max(ˆ

d(⋅))−ˆ d(x) ˆ d(x)+1

− 1 ≤ ˆ ℓ(x) ≤ det (I + Σx

2σ2 )

1 2 [max(ˆ

d(⋅)) − ˆ d(x)] , where σ is a scale used when defining Gaussian neighborhoods g(x, y) for the diffusion geometry, and ˆ d(x) = ∥g(x, ⋅)∥1 estimates local density.

Lindenbaum et al. (Yale) SUGAR 2018 9 / 14

slide-18
SLIDE 18

Applications & results

Alleviating class imbalance in classification

k-NN SVM RUSBoost Orig SMOTE SUGAR Orig SMOTE SUGAR ACP 0.67 0.76 0.78 0.77 0.77 0.78 0.75 ACR 0.64 0.73 0.77 0.78 0.78 0.84 0.81 MCC 0.66 0.74 0.78 0.78 0.78 0.84 0.80 Average class precision/recall (ACP/ACR), and Matthews correlation coefficient (MCC) over 61 imbalanced datasets (10-fold cross validation).

Lindenbaum et al. (Yale) SUGAR 2018 10 / 14

slide-19
SLIDE 19

Applications & results

Density correction improves clustering

Spectral Clustering Rand index of k-Means Based on 115 datasets

Lindenbaum et al. (Yale) SUGAR 2018 11 / 14

slide-20
SLIDE 20

Applications & results

Illuminate hypothetical cell types in single-cell data from Velten et al. 2017

Recovering originally-undersampled lineage in early hematopoeisis: B-cell maturation trajectory enhanced by SUGAR SUGAR equalizes the total cell distribution

Lindenbaum et al. (Yale) SUGAR 2018 12 / 14

slide-21
SLIDE 21

Applications & results

Recover gene-gene relationships in single-cell data from Velten et al. 2017

SUGAR improves module correlation and MI identified by Velten et al.

Velten et al., Nature Cell Biology, 19 (2017)

Lindenbaum et al. (Yale) SUGAR 2018 13 / 14

slide-22
SLIDE 22

Applications & results

Recover gene-gene relationships in single-cell data from Velten et al. 2017

Generated cells also follow canonical marker correlations

Li et al., Nature communications 7 (2016)

Lindenbaum et al. (Yale) SUGAR 2018 13 / 14

slide-23
SLIDE 23

Conclusion

Generate data over intrinsic geometry rather than distribution Alleviate sampling bias in supervised & unsupervised learning Enable exploration of sparse (or “hypothetical”) data regions

Lindenbaum et al. (Yale) SUGAR 2018 14 / 14