Some Success Stories in Bridging Theory and Practice
Anima Anandkumar
Bren Professor at Caltech Director of ML Research at NVIDIA
Some Success Stories in Bridging Theory and Practice Anima - - PowerPoint PPT Presentation
Some Success Stories in Bridging Theory and Practice Anima Anandkumar Bren Professor at Caltech Director of ML Research at NVIDIA SIGNSGD: COMPRESSED OPTIMIZATION FOR NON-CONVEX PROBLEMS JEREMY BERNSTEIN, JIAWEI ZHAO, KAMYAR AZZIZADENESHELI,
Bren Professor at Caltech Director of ML Research at NVIDIA
JEREMY BERNSTEIN, JIAWEI ZHAO, KAMYAR AZZIZADENESHELI, YU-XIANG WANG, ANIMA ANANDKUMAR
With 1/2 data With 1/2 data
With 1/2 data With 1/2 data
Assumptions SGD gets rate
피 [ 1 K
K−1
∑
k=0
∥gk∥2
2] ≤
1 N [2∥ ⃗ L ∥∞(f0 − f*) + ∥ ⃗ σ ∥2
2]
signSGD gets rate
피 [ 1 K
K−1
∑
k=0
∥gk∥1]
2
≤ 1 N [ ∥ ⃗ L ∥1 (f0 − f* + 1 2 ) + 2∥ ⃗ σ ∥1]
2
➤ Objective function lower bound
➤ Coordinate-wise variance bound
➤ Coordinate-wise gradient Lipschitz
➤ Number of iterations
➤ Number of backpropagations
Define
Assumptions SGD gets rate
피 [ 1 K
K−1
∑
k=0
∥gk∥2
2] ≤
1 N [2∥ ⃗ L ∥∞(f0 − f*) + ∥ ⃗ σ ∥2
2]
signSGD gets rate
피 [ 1 K
K−1
∑
k=0
∥gk∥1]
2
≤ 1 N [ ∥ ⃗ L ∥1 (f0 − f* + 1 2 ) + 2∥ ⃗ σ ∥1]
2
➤ Objective function lower bound
➤ Coordinate-wise variance bound
➤ Coordinate-wise gradient Lipschitz
➤ Number of iterations
➤ Number of backpropagations
Define
d ∥ ⃗ L ∥∞
d∥ ⃗ σ ∥2
d∥ ⃗ g k∥2
A sparse vector A dense vector Fully dense vector……………….a sign vector Natural measure of density =1 for fully dense v ≈0 for fully sparse v
7
If gradients are unimodal and symmetric… …reasonable by central limit theorem… …majority vote with M workers converges at rate:
Same variance reduction as SGD
Under symmetric noise assumption:
P3.2x machines on AWS, Resnet50 on imagenet
Under symmetric noise assumption:
ASHISH KHETAN, ZACHARY C. LIPTON, ANIMA ANANDKUMAR
Annotator quality model (Prob. of correctness)
Repeat Posterior of ground-truth labels given annotator quality model Use trained model to infer ground-truth labels Noisy crowdsourced annotations MLE : update Annotator quality using inferred labels from model Training with weighted loss. Use posterior as weights
MS-COCO dataset. Fixed budget: 35k annotations
5% wrt Majority rule
Imagenet dataset. Simulated workers and fixed budget
natural images
different from prediction.
y x y x
category intermediate rendering image latent variables
0.5 dog 0.2 cat 0.1 horse … 1.0 dog Choose render
Upsample, select location Render NRM: Generation CNN: Inference image unpooled feature map pooled feature map rectified feature map class template masked template upsampled template rendered image
Cross-Entropy Loss for Training the CNNs with Labeled Data
min
θ∈Aγ Hp,q(y|x, zmax) ≥
min
(zi)n
i=1,θ
1 n
n
X
i=1
− log p(yi|xi, zi; θ)
<latexit sha1_base64="o8aAYTACJXrjfyUfzuzjiD9z6U=">ACkXicbVFdb9MwFHXC1ygwOvbIyxUVUiuVKZkmwYSGCrxM4mVIdJtUd5HjOqk128liBzXz/H/4Pbzxb3DaSMDGlXx1dO491/a5aSm4NlH0Kwjv3X/w8NHW496Tp8+2n/d3Xpzqoq4om9JCFNV5SjQTXLGp4Uaw87JiRKaCnaWXn9v62XdWaV6ob6Yp2VySXPGMU2I8lfR/YMlVYrFZMkMAcwVYErOkRNiPzvM5kZI4B8eJLcdXbtjcrMbXF76frYyVZOXcCHDOrmAzZ3idWO5GPh3F7sIqN4bNaIcXWUWojZ0nAeta/umBN4BFkUM5bFr1zarNY1iPer+RjyDpD6K9aB1wF8QdGKAuTpL+T7woaC2ZMlQrWdxVJq5JZXhVDXw7VmJaGXJGczDxWRTM/t2lEHrz2zgKyo/FEG1uzfCkuk1o1MfWdrl75da8n/1Wa1yd7NLVdlbZim4uyWoApoF0PLHjFqBGNB4RW3L8V6J454xfYs+bEN/+8l1wur8Xe/z1YD51NmxhV6iV2iIYvQWTdAxOkFTRIPt4CA4Cj6Eu+FhOAm73jDoNLvonwi/Aa9kMoP</latexit><latexit sha1_base64="o8aAYTACJXrjfyUfzuzjiD9z6U=">ACkXicbVFdb9MwFHXC1ygwOvbIyxUVUiuVKZkmwYSGCrxM4mVIdJtUd5HjOqk128liBzXz/H/4Pbzxb3DaSMDGlXx1dO491/a5aSm4NlH0Kwjv3X/w8NHW496Tp8+2n/d3Xpzqoq4om9JCFNV5SjQTXLGp4Uaw87JiRKaCnaWXn9v62XdWaV6ob6Yp2VySXPGMU2I8lfR/YMlVYrFZMkMAcwVYErOkRNiPzvM5kZI4B8eJLcdXbtjcrMbXF76frYyVZOXcCHDOrmAzZ3idWO5GPh3F7sIqN4bNaIcXWUWojZ0nAeta/umBN4BFkUM5bFr1zarNY1iPer+RjyDpD6K9aB1wF8QdGKAuTpL+T7woaC2ZMlQrWdxVJq5JZXhVDXw7VmJaGXJGczDxWRTM/t2lEHrz2zgKyo/FEG1uzfCkuk1o1MfWdrl75da8n/1Wa1yd7NLVdlbZim4uyWoApoF0PLHjFqBGNB4RW3L8V6J454xfYs+bEN/+8l1wur8Xe/z1YD51NmxhV6iV2iIYvQWTdAxOkFTRIPt4CA4Cj6Eu+FhOAm73jDoNLvonwi/Aa9kMoP</latexit><latexit sha1_base64="o8aAYTACJXrjfyUfzuzjiD9z6U=">ACkXicbVFdb9MwFHXC1ygwOvbIyxUVUiuVKZkmwYSGCrxM4mVIdJtUd5HjOqk128liBzXz/H/4Pbzxb3DaSMDGlXx1dO491/a5aSm4NlH0Kwjv3X/w8NHW496Tp8+2n/d3Xpzqoq4om9JCFNV5SjQTXLGp4Uaw87JiRKaCnaWXn9v62XdWaV6ob6Yp2VySXPGMU2I8lfR/YMlVYrFZMkMAcwVYErOkRNiPzvM5kZI4B8eJLcdXbtjcrMbXF76frYyVZOXcCHDOrmAzZ3idWO5GPh3F7sIqN4bNaIcXWUWojZ0nAeta/umBN4BFkUM5bFr1zarNY1iPer+RjyDpD6K9aB1wF8QdGKAuTpL+T7woaC2ZMlQrWdxVJq5JZXhVDXw7VmJaGXJGczDxWRTM/t2lEHrz2zgKyo/FEG1uzfCkuk1o1MfWdrl75da8n/1Wa1yd7NLVdlbZim4uyWoApoF0PLHjFqBGNB4RW3L8V6J454xfYs+bEN/+8l1wur8Xe/z1YD51NmxhV6iV2iIYvQWTdAxOkFTRIPt4CA4Cj6Eu+FhOAm73jDoNLvonwi/Aa9kMoP</latexit><latexit sha1_base64="o8aAYTACJXrjfyUfzuzjiD9z6U=">ACkXicbVFdb9MwFHXC1ygwOvbIyxUVUiuVKZkmwYSGCrxM4mVIdJtUd5HjOqk128liBzXz/H/4Pbzxb3DaSMDGlXx1dO491/a5aSm4NlH0Kwjv3X/w8NHW496Tp8+2n/d3Xpzqoq4om9JCFNV5SjQTXLGp4Uaw87JiRKaCnaWXn9v62XdWaV6ob6Yp2VySXPGMU2I8lfR/YMlVYrFZMkMAcwVYErOkRNiPzvM5kZI4B8eJLcdXbtjcrMbXF76frYyVZOXcCHDOrmAzZ3idWO5GPh3F7sIqN4bNaIcXWUWojZ0nAeta/umBN4BFkUM5bFr1zarNY1iPer+RjyDpD6K9aB1wF8QdGKAuTpL+T7woaC2ZMlQrWdxVJq5JZXhVDXw7VmJaGXJGczDxWRTM/t2lEHrz2zgKyo/FEG1uzfCkuk1o1MfWdrl75da8n/1Wa1yd7NLVdlbZim4uyWoApoF0PLHjFqBGNB4RW3L8V6J454xfYs+bEN/+8l1wur8Xe/z1YD51NmxhV6iV2iIYvQWTdAxOkFTRIPt4CA4Cj6Eu+FhOAm73jDoNLvonwi/Aa9kMoP</latexit>Max-Min Loss for Training the CNNs with Labeled Data
αmaxHp,q(y|x, zmax) + αminHp,q(y|x, zmin)
<latexit sha1_base64="CTyL0GiHpBuDE9+L99VekVIDKxM=">AAACSHicdZBLSwMxFIUz9VXrq+rSTbAIiiIzIuiy6KZLBauFtg530tQGM5kxuSOt4/w8Ny7d+RvcuFDEnWntwla9EPg45x6SnCCWwqDrPju5icmp6Zn8bGFufmFxqbi8cm6iRDNeZZGMdC0Aw6VQvIoCJa/FmkMYSH4RXB/3/Ytbro2I1Bn2Yt4M4UqJtmCAVvKLfgNk3IHLtIG8i2kI3Syr+Gm8c5Nt9u67O3cjzhbdpmMBof4L9J0tv1hyd93B0N/gDaFEhnPiF58arYglIVfIJBhT99wYmyloFEzyrNBIDI+BXcMVr1tUEHLTTAdFZHTDKi3ajrQ9CulA/ZlIITSmFwZ2MwTsmHGvL/7l1RNsHzZToeIEuWLfF7UTSTGi/VZpS2jOUPYsANPCvpWyDmhgaLsv2BK88S//hvO9Xc/y6X6pfDSsI0/WyDrZJB45IGVSISekShh5IC/kjbw7j86r8+F8fq/mnGFmlYxMLvcFK2q1gQ==</latexit> <latexit sha1_base64="CTyL0GiHpBuDE9+L99VekVIDKxM=">AAACSHicdZBLSwMxFIUz9VXrq+rSTbAIiiIzIuiy6KZLBauFtg530tQGM5kxuSOt4/w8Ny7d+RvcuFDEnWntwla9EPg45x6SnCCWwqDrPju5icmp6Zn8bGFufmFxqbi8cm6iRDNeZZGMdC0Aw6VQvIoCJa/FmkMYSH4RXB/3/Ytbro2I1Bn2Yt4M4UqJtmCAVvKLfgNk3IHLtIG8i2kI3Syr+Gm8c5Nt9u67O3cjzhbdpmMBof4L9J0tv1hyd93B0N/gDaFEhnPiF58arYglIVfIJBhT99wYmyloFEzyrNBIDI+BXcMVr1tUEHLTTAdFZHTDKi3ajrQ9CulA/ZlIITSmFwZ2MwTsmHGvL/7l1RNsHzZToeIEuWLfF7UTSTGi/VZpS2jOUPYsANPCvpWyDmhgaLsv2BK88S//hvO9Xc/y6X6pfDSsI0/WyDrZJB45IGVSISekShh5IC/kjbw7j86r8+F8fq/mnGFmlYxMLvcFK2q1gQ==</latexit> <latexit sha1_base64="CTyL0GiHpBuDE9+L99VekVIDKxM=">AAACSHicdZBLSwMxFIUz9VXrq+rSTbAIiiIzIuiy6KZLBauFtg530tQGM5kxuSOt4/w8Ny7d+RvcuFDEnWntwla9EPg45x6SnCCWwqDrPju5icmp6Zn8bGFufmFxqbi8cm6iRDNeZZGMdC0Aw6VQvIoCJa/FmkMYSH4RXB/3/Ytbro2I1Bn2Yt4M4UqJtmCAVvKLfgNk3IHLtIG8i2kI3Syr+Gm8c5Nt9u67O3cjzhbdpmMBof4L9J0tv1hyd93B0N/gDaFEhnPiF58arYglIVfIJBhT99wYmyloFEzyrNBIDI+BXcMVr1tUEHLTTAdFZHTDKi3ajrQ9CulA/ZlIITSmFwZ2MwTsmHGvL/7l1RNsHzZToeIEuWLfF7UTSTGi/VZpS2jOUPYsANPCvpWyDmhgaLsv2BK88S//hvO9Xc/y6X6pfDSsI0/WyDrZJB45IGVSISekShh5IC/kjbw7j86r8+F8fq/mnGFmlYxMLvcFK2q1gQ==</latexit> <latexit sha1_base64="CTyL0GiHpBuDE9+L99VekVIDKxM=">AAACSHicdZBLSwMxFIUz9VXrq+rSTbAIiiIzIuiy6KZLBauFtg530tQGM5kxuSOt4/w8Ny7d+RvcuFDEnWntwla9EPg45x6SnCCWwqDrPju5icmp6Zn8bGFufmFxqbi8cm6iRDNeZZGMdC0Aw6VQvIoCJa/FmkMYSH4RXB/3/Ytbro2I1Bn2Yt4M4UqJtmCAVvKLfgNk3IHLtIG8i2kI3Syr+Gm8c5Nt9u67O3cjzhbdpmMBof4L9J0tv1hyd93B0N/gDaFEhnPiF58arYglIVfIJBhT99wYmyloFEzyrNBIDI+BXcMVr1tUEHLTTAdFZHTDKi3ajrQ9CulA/ZlIITSmFwZ2MwTsmHGvL/7l1RNsHzZToeIEuWLfF7UTSTGi/VZpS2jOUPYsANPCvpWyDmhgaLsv2BK88S//hvO9Xc/y6X6pfDSsI0/WyDrZJB45IGVSISekShh5IC/kjbw7j86r8+F8fq/mnGFmlYxMLvcFK2q1gQ==</latexit>Input Image Max Xentropy Min Xentropy Max-Min Xentropy Shared weights
Min cross-entropy minimizes the posteriors of incorrect labels.
!"#$%& () *+,-.% &%/0%&-/1 2*,34 /5/7
Rendering Path
Semi-Supervised Learning
Supervised Learning
5 error)
Images: 3 dimensions Videos: 4 dimensions
Pairwise correlations Triplet correlations
Tensor Contraction
Extends the notion of matrix product Matrix product Mv =
vjMj
+
Tensor Contraction T(u, v, ·) =
uivjTi,j,:
+ + +
Justice Education Sports Topics
0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 5 10 15 20 25 30 50 75 100
Time in minutes Number of Topics Training time for NYTimes Spectral Time(minutes) Mallet Time (minutes)
0.00 50.00 100.00 150.00 200.00 250.00 5 10 15 20 25 50 100
Time in minutes Number of Topics Training time for PubMed Spectral Time (minutes) Mallet Time (minutes)
8 million documents
22x faster on average 12x faster on average
300000 documents
flexible + scalable
repository
ground.
Wind generation in CALTECH CAST wind tunnel
nonlinear controller to cancel it (unknown moments are very limited in landing)
based on them
and then design stable nonlinear controller (Neural-Lander)
Robotics
Dieter Fox
Learning & Perception
Jan Kautz Bill Dally Dave Luebke Alex Keller Aaron Lefohn
Graphics
Steve Keckler Dave Nellans Mike O’Connor
Architecture Programming
Michael Garland
VLSI
Brucek Khailany
Circuits
Tom Gray
Networks
Larry Dennison
Chief Scientist Computer vision Core ML
Sanja Fidler Me !
Applied research
Bryan Catanzaro
ALGORITHMS
DATA
INFRASTRUCTURE FULL STACK FOR ML