Geometry and Divergence of High- dimensional Point Clouds Peng Qiu - PowerPoint PPT Presentation

Geometry and Divergence of High- dimensional Point Clouds Peng Qiu Department of Bioinformatics and Computational Biology University of Texas MD Anderson Cancer Center

Outlines Challenge 1 – Predicting Manual Gates Challenge 3 – Predicting Vaccination Time Points

Challenge 1 – Predicting Manual Gates • 405 fcs files • Manual gates for 202 files are given • To be predicted: manual gates for 203 testing files • Approach: probability density divergence + SVM

2D plots from one training file

2D plots from another training file

Probability densities divergence + SVM • Observation from two previous slides, contours and density distributions can be quite different from file to file • Basic idea: to predict a test file, it might be better to training files that are similar to it. • How to define similarity ? Hellinger divergence of probability densities • How to define densities p and q ? Gaussian kernel based density estimator • Issue with computing time Faithful downsampling (Zare et al, 2010)

Probability densities divergence + SVM Analysis pipeline • For each of the 405 fcs files: ‒ arcsinh (cofactor=100), then per-channel 0-mean-1-var normalization ‒ faithful downsampling ‒ estimate density using downsampled points with weights • For each pair of testing file p(x) and training file q(x) : ‒ evaluate the probability of downsampled testing points w.r.t. p(x) ‒ evaluate the probability of downsampled testing points w.r.t. q(x) ‒ estimate the Hellinger distance: weighted difference b/w p and q • For each testing file: ‒ rank order training files ‒ pick the most similar 50 ‒ build two SVMs from each selected training file (one for each gate) ‒ apply the SVMs to the testing file ‒ predict by majority vote

One detail idea about the SVM ‒ When a SVM is trained from a selected training file, not all cells are used. The SVM is not trained to distinguish cells in a gate against all other cells. Instead, I only used cell in a gate and nearby cells that are not in the gate to train the SVM. ‒ Due to the way how the SVM is trained, when applied to the testing file, the SVM is only applied to classify testing cells that are near the gate in the training file. ‒ Advantages of this idea: ‒ Cell counts in a gate is in the order of tens or hundreds. I chose to use 20000 nearby cells not in the gate. The cell counts for the two classes are still unbalanced , but better than using all cells. ‒ The SVM package I used runs faster with smaller number of cells ‒ The trained SVM is more accurate in the local region near the gate, less accurate for far away cells. Intuitively, this leads to high recall and low precision in the training data. However, the final prediction is good because of the way how SVM is applied to testing file + the majority vote.

Training files Testing files Two axes are cell counts in the given two gates in the training files (left) and the predicted gates in the testing files (right).

Training files Testing files Two axes are cell counts in the given two gates in the training files (left) and the predicted gates in the testing files (right). The additional panels stratifies the samples by stimulation conditions 1, 2, 3

Main observation after phase 1 ‒ In terms of cell counts in the two gates, the distributions of the training data and the testing data appear to be consistent. ‒ After obtaining the metadata, it can be observed that: for one testing file, its most similar training files are generated by the same lab. This reflects a batch effect by lab. ‒ In all the training files that are generated by lab 20, both gates are always empty. These observations motivated a minor change in the analysis pipeline for phase 2.

Probability densities divergence + SVM

Phase 2 results Training files Testing files

Phase 1 results Training files Testing files

Main observation after phase 2 ‒ Compared to phase 1, the distribution of cell counts in the two gates becomes tighter. It appears that the prediction result from phase 2 is more consistent. ‒ If we compare the plots from training data and the plots from phase 2 prediction of the testing data, we see that the distribution of cell counts in the phase 2 prediction looks cleaner than that from the training data. This might be an indication that the predictions contain less variation than the training data.

Challenge 3 – Predicting Vaccination Time Points • 74 subjects • 6 experiments per subject ‒ 2 before vaccination unstimulated ‒ 1 before vaccination stimulated ‒ 2 after vaccination unstimulated ‒ 1 after vaccination stimulated • Before/after labels for 37 training subjects • To be predicted: time labels for 37 testing samples • Approach: SPADE + t-test + LDA

SPADE + t-test + LDA Qiu et al, Nature Biotechnology, 2011

SPADE + t-test + LDA Pipeline parameter settings

SPADE + t-test + LDA

Observation from the previous 3 slides • The previous 3 slides show the cellular distribution of 3 samples with respect to the SPADE tree. • The first two are the two controls for the same subject at the same visit, which show extremely high consistency. • The third one is the same subject and same visit, but stimulated. We can see that stimulation does induce some change. • Since SPADE over-partitions the data, we cannot interpret changes at individual node level. To derive meaningful interpretation, we need to select subtrees, which is the next step/slide.

SPADE + t-test + LDA • SPADE tree nodes: 443 • Total number of possible subtrees: 98345 • For each subtree (a bigger gate), compute the following for each subject: % in stim - % in ctrl for visit 2 % in stim - % in ctrl for visit 12 • Use the training samples to perform t-test, and select subtrees whose cell percent change b/w stim and ctrl is different between the two visits. • The t-test generates a long list of subtrees that has a lot of redundancy. Using clustering analysis, I was able to remove the redundancy and distill a small number of subtrees that show significance, shown in the next slide.

SPADE + t-test + LDA • Although the final prediction is based on LDA, here I am showing the PCA plot of the cell frequencies derived from the selected subtrees. • The overlap indicates that the final prediction accuracy won’t be high.

SPADE + t-test + LDA • If we give the training samples less invasive colors and overlap the testing samples using red dots, we see that training and testing data are well aligned. • Again, due to the overlap in the training samples, the prediction performance is not likely to be high.

Interpret selected features • If we want to know the biology behind the selected features, we can color the SPADE tree using the markers that were measured. • For each selected subtree, we can read its marker combination from the color patterns in the SPADE tree, which are shown in the next slide.

Acknowledgement • FlowCAP • Fundings: NIH and CPRIT

Geometry and Divergence of High- dimensional Point Clouds Peng Qiu - PowerPoint PPT Presentation

Geometry and Divergence of High- dimensional Point Clouds Peng Qiu Department of Bioinformatics and Computational Biology University of Texas MD Anderson Cancer Center Outlines Challenge 1 Predicting Manual Gates Challenge 3

Clouds A B Clouds A Eastern 2/3 of the U.S. Clouds Clouds on Mars are made of _____ . A.

When you look up into the sky, you will often see clouds. No two clouds are the same, and there

2 Microstructures of Warm Clouds Clouds that lie completely below the 0 C isotherm, referred to

Divergence Theorems in Path Space Denis Bell University of North Florida Motivation Divergence

48-175 Descriptive Geometry Basic Concepts of Descriptive Geometry Descriptive geometry is

6 Artificial Modification of Clouds The microstructures of clouds are influenced by the concen-

4. Droplet Growth in Warm Clouds In warm clouds, droplets can grow by condensation in a

Session 3: Hydrology & Clouds 3:00- 5:30 PM Session 3: Hydrology & Clouds 3:00- 5:30 PM

Stochastic geometry and random generation 1 Stochastic geometry and random generation

JUST THE MATHS SLIDES NUMBER 2.3 SERIES 3 (Elementary convergence and divergence) by

29. The divergence theorem Theorem 29.1 (Divergence Theorem; Gauss, Ostrogradsky) . Let S be a

Social Clouds Creating a research agenda Andrew Lippman MIT Media Lab October, 2010 Clouds and

Breathing in the Clouds: Thin Air or Bad Atmosphere? G. Vossen: Breathing in the Clouds 1

System for Intel SGX Adil Ahmad Kyungtae Kim Muhammad Ihsanulhaq Sarfaraz Byoungyoung Lee 1

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

CS133 Computational Geometry Computational Geometry Primitives 1 Point Representation A point

Test Driven Infrastructure with Puppet, Docker, Test Kitchen and Serverspec About Me Russian

ST RIVING F OR E XCE L L E NCE T om Sugar, Pre side nt, Comple te Colle ge Ame r ic a

Testing interoperability with closed-source software through scriptable diplomacy Ole Andr

The HPC Challenge Benchmark: The HPC Challenge Benchmark: A Candidate for Replacing A Candidate

A Grand Challenge for Testing Nanoelectronic Circuits Introduction B. Becker, University of

T HOUGHTS ON P ROGRESS M ADE AND C HALLENGES A HEAD IN F EW -S HOT L EARNING Hugo Larochelle

Intro to Assumption Testing November 22, 2019 Todays Agenda Intro to Assumption Testing and

Testing of embedded and mobile Qt and QML Applications Qt Developer Days 2013 by Harri Porten

Sambuz

Useful Links

Newsletter

Mail Us

Geometry and Divergence of High- dimensional Point Clouds Peng Qiu - PowerPoint PPT Presentation

Geometry and Divergence of High- dimensional Point Clouds Peng Qiu Department of Bioinformatics and Computational Biology University of Texas MD Anderson Cancer Center Outlines Challenge 1 Predicting Manual Gates Challenge 3

Clouds A B Clouds A Eastern 2/3 of the U.S. Clouds Clouds on Mars are made of _____ . A.

When you look up into the sky, you will often see clouds. No two clouds are the same, and there

2 Microstructures of Warm Clouds Clouds that lie completely below the 0 C isotherm, referred to

Divergence Theorems in Path Space Denis Bell University of North Florida Motivation Divergence

48-175 Descriptive Geometry Basic Concepts of Descriptive Geometry Descriptive geometry is

6 Artificial Modification of Clouds The microstructures of clouds are influenced by the concen-

4. Droplet Growth in Warm Clouds In warm clouds, droplets can grow by condensation in a

Session 3: Hydrology &amp; Clouds 3:00- 5:30 PM Session 3: Hydrology &amp; Clouds 3:00- 5:30 PM

Stochastic geometry and random generation 1 Stochastic geometry and random generation

JUST THE MATHS SLIDES NUMBER 2.3 SERIES 3 (Elementary convergence and divergence) by

29. The divergence theorem Theorem 29.1 (Divergence Theorem; Gauss, Ostrogradsky) . Let S be a

Social Clouds Creating a research agenda Andrew Lippman MIT Media Lab October, 2010 Clouds and

Breathing in the Clouds: Thin Air or Bad Atmosphere? G. Vossen: Breathing in the Clouds 1

System for Intel SGX Adil Ahmad Kyungtae Kim Muhammad Ihsanulhaq Sarfaraz Byoungyoung Lee 1

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

CS133 Computational Geometry Computational Geometry Primitives 1 Point Representation A point

Test Driven Infrastructure with Puppet, Docker, Test Kitchen and Serverspec About Me Russian

ST RIVING F OR E XCE L L E NCE T om Sugar, Pre side nt, Comple te Colle ge Ame r ic a

Testing interoperability with closed-source software through scriptable diplomacy Ole Andr

The HPC Challenge Benchmark: The HPC Challenge Benchmark: A Candidate for Replacing A Candidate

A Grand Challenge for Testing Nanoelectronic Circuits Introduction B. Becker, University of

T HOUGHTS ON P ROGRESS M ADE AND C HALLENGES A HEAD IN F EW -S HOT L EARNING Hugo Larochelle

Intro to Assumption Testing November 22, 2019 Todays Agenda Intro to Assumption Testing and

Testing of embedded and mobile Qt and QML Applications Qt Developer Days 2013 by Harri Porten

Sambuz

Useful Links

Newsletter

Mail Us

Session 3: Hydrology & Clouds 3:00- 5:30 PM Session 3: Hydrology & Clouds 3:00- 5:30 PM