ADVANCES IN TRINITY OF AI: DATA, ALGORITHMS & COMPUTE Anima - - PowerPoint PPT Presentation

advances in trinity of ai data algorithms compute
SMART_READER_LITE
LIVE PREVIEW

ADVANCES IN TRINITY OF AI: DATA, ALGORITHMS & COMPUTE Anima - - PowerPoint PPT Presentation

ADVANCES IN TRINITY OF AI: DATA, ALGORITHMS & COMPUTE Anima Anandkumar Bren Professor at Caltech Director of ML Research at NVIDIA TRINITY FUELING ARTIFICIAL INTELLIGENCE ALGORITHMS OPTIMIZATION SCALABILITY MULTI-DIMENSIONALITY


slide-1
SLIDE 1

ADVANCES IN TRINITY OF AI: DATA, ALGORITHMS & COMPUTE

Anima Anandkumar

Bren Professor at Caltech Director of ML Research at NVIDIA

slide-2
SLIDE 2

ALGORITHMS

  • OPTIMIZATION
  • SCALABILITY
  • MULTI-DIMENSIONALITY

DATA

  • COLLECTION
  • AGGREGATION
  • AUGMENTATION

INFRASTRUCTURE FULL STACK FOR ML

  • APPLICATION SERVICES
  • ML PLATFORM
  • GPUS

TRINITY FUELING ARTIFICIAL INTELLIGENCE

slide-3
SLIDE 3
  • COLLECTION: ACTIVE LEARNING, PARTIAL LABELS..
  • AGGREGATION: CROWDSOURCING MODELS..
  • AUGMENTATION: GENERATIVE MODELS, SYMBOLIC EXPRESSIONS..

DATA

slide-4
SLIDE 4

ACTIVE LEARNING

Labeled data Unlabeled data Goal

  • Reach SOTA with a smaller dataset
  • Active learning analyzed in theory
  • In practice, only small classical models

Can it work at scale with deep learning?

slide-5
SLIDE 5

TASK: NAMED ENTITY RECOGNITION

slide-6
SLIDE 6

RESULTS

NER task on largest open benchmark (Onto-notes)

Active learning heuristics:

  • Least confidence (LC)
  • Max. normalized log

probability (MNLP)

  • Deep active learning matches :
  • SOTA with just 25% data on English, 30% on Chinese.
  • Best shallow model (on full data) with 12% data on English, 17% on Chinese.

Test F1 score vs. % of labeled words

English

20 40 60 80 70 75 80 85 Percent of words annotated Test F1 score

MNLP LC RAND Best Deep Model Best Shallow Model

20 40 60 80 100 65 70 75

MNLP LC RAND Best Deep Model Best Shallow Model

Chinese

slide-7
SLIDE 7
  • Uncertainty sampling works. Normalizing for length helps under low data.
  • With active learning, deep beats shallow even in low data regime.
  • With active learning, SOTA achieved with far fewer samples.

TAKE-AWAY

slide-8
SLIDE 8

ACTIVE LEARNING WITH PARTIAL FEEDBACK

images questions dog? dog non-dog partial labels

  • Hierarchical class labeling: Labor proportional to # of binary questions asked
  • Actively pick informative questions ?
slide-9
SLIDE 9

RESULTS ON TINY IMAGENET (100K SAMPLES)

  • Yield 8% higher accuracy at 30% questions (w.r.t. Uniform)
  • Obtain full annotation with 40% less binary questions

ALPF-ERC

active data active questions

Uniform

inactive data inactive questions

AL-ME

active data inactive questions

AQ-ERC

inactive data active questions

0.1 0.2 0.3 0.4 0.5 0% 25% 50% 75% 100%

Accuracy vs. # of Questions

Uniform AL-ME AQ-ERC ALPF-ERC

  • 40%

+8%

slide-10
SLIDE 10
  • Don’t annotate from scratch
  • Select questions actively based on the learned model
  • Don’t sleep on partial labels
  • Re-train model from partial labels

TWO TAKE-AWAYS

slide-11
SLIDE 11

CROWDSOURCING: AGGREGATION OF CROWD ANNOTATIONS Majority rule

  • Simple and common.
  • Wasteful: ignores annotator

quality of different workers. Annotator-quality models

  • Can improve accuracy.
  • Hard: needs to be estimated

without ground-truth.

slide-12
SLIDE 12

PROPOSED CROWDSOURCING ALGORITHM

Repeat Posterior of ground-truth labels given annotator quality model Use trained model to infer ground-truth labels Noisy crowdsourced annotations MLE : update Annotator quality using inferred labels from model Training with weighted loss. Use posterior as weights

slide-13
SLIDE 13

LABELING ONCE IS OPTIMAL: BOTH IN THEORY AND PRACTICE

MS-COCO dataset. Fixed budget: 35k annotations

  • No. of workers

Theorem: Under fixed budget, generalization error minimized with single annotation per sample. Assumptions:

  • Best predictor is accurate enough

(under no label noise).

  • Simplified case: All workers

have same quality.

  • Prob. of being correct > 83%

5% wrt Majority rule

slide-14
SLIDE 14

DATA AUGMENTATION 1: GENERATIVE MODELING

Merits

  • Captures statistics of

natural images

  • Learnable

Peril

  • Feedback is real vs. fake:

different from prediction.

  • Introduces artifacts

GAN

slide-15
SLIDE 15

PREDICTIVE VS GENERATIVE MODELS

y x y x

P(y | x) P(x | y) One model to do both?

  • SOTA prediction from CNN models.
  • What class of p(x|y) yield CNN models for p(y|x)?
slide-16
SLIDE 16

NEURAL DEEP RENDERING MODEL (NRM)

. . . . . .

  • bject

category intermediate rendering image latent variables

x

<latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiaglitbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg=</latexit><latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiaglitbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg=</latexit><latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiaglitbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg=</latexit><latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiaglitbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg=</latexit>

y

<latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQ=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0I/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW/4PCoenzS0XGqGLZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHsHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit><latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQ=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0I/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW/4PCoenzS0XGqGLZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHsHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit><latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQ=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0I/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW/4PCoenzS0XGqGLZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHsHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit><latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQ=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0I/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW/4PCoenzS0XGqGLZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHsHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit>

Design joint priors for latent variables based on reverse-engineering CNN predictive architectures

slide-17
SLIDE 17

NEURAL RENDERING MODEL (NRM)

0.5 dog 0.2 cat 0.1 horse … 1.0 dog Choose render

  • r not

Upsample, select location Render NRM: Generation CNN: Inference image unpooled feature map pooled feature map rectified feature map class template masked template upsampled template rendered image

slide-18
SLIDE 18

MAX-MIN CROSS-ENTROPY ➡ MAX-MIN NETWORKS

Cross-Entropy Loss for Training the CNNs with Labeled Data

min

θ∈Aγ Hp,q(y|x, zmax) ≥

min

(zi)n

i=1,θ

1 n

n

X

i=1

− log p(yi|xi, zi; θ)

<latexit sha1_base64="o8aAYTACJXrjfyUfzuzjiD9z6U=">ACkXicbVFdb9MwFHXC1ygwOvbIyxUVUiuVKZkmwYSGCrxM4mVIdJtUd5HjOqk128liBzXz/H/4Pbzxb3DaSMDGlXx1dO491/a5aSm4NlH0Kwjv3X/w8NHW496Tp8+2n/d3Xpzqoq4om9JCFNV5SjQTXLGp4Uaw87JiRKaCnaWXn9v62XdWaV6ob6Yp2VySXPGMU2I8lfR/YMlVYrFZMkMAcwVYErOkRNiPzvM5kZI4B8eJLcdXbtjcrMbXF76frYyVZOXcCHDOrmAzZ3idWO5GPh3F7sIqN4bNaIcXWUWojZ0nAeta/umBN4BFkUM5bFr1zarNY1iPer+RjyDpD6K9aB1wF8QdGKAuTpL+T7woaC2ZMlQrWdxVJq5JZXhVDXw7VmJaGXJGczDxWRTM/t2lEHrz2zgKyo/FEG1uzfCkuk1o1MfWdrl75da8n/1Wa1yd7NLVdlbZim4uyWoApoF0PLHjFqBGNB4RW3L8V6J454xfYs+bEN/+8l1wur8Xe/z1YD51NmxhV6iV2iIYvQWTdAxOkFTRIPt4CA4Cj6Eu+FhOAm73jDoNLvonwi/Aa9kMoP</latexit><latexit sha1_base64="o8aAYTACJXrjfyUfzuzjiD9z6U=">ACkXicbVFdb9MwFHXC1ygwOvbIyxUVUiuVKZkmwYSGCrxM4mVIdJtUd5HjOqk128liBzXz/H/4Pbzxb3DaSMDGlXx1dO491/a5aSm4NlH0Kwjv3X/w8NHW496Tp8+2n/d3Xpzqoq4om9JCFNV5SjQTXLGp4Uaw87JiRKaCnaWXn9v62XdWaV6ob6Yp2VySXPGMU2I8lfR/YMlVYrFZMkMAcwVYErOkRNiPzvM5kZI4B8eJLcdXbtjcrMbXF76frYyVZOXcCHDOrmAzZ3idWO5GPh3F7sIqN4bNaIcXWUWojZ0nAeta/umBN4BFkUM5bFr1zarNY1iPer+RjyDpD6K9aB1wF8QdGKAuTpL+T7woaC2ZMlQrWdxVJq5JZXhVDXw7VmJaGXJGczDxWRTM/t2lEHrz2zgKyo/FEG1uzfCkuk1o1MfWdrl75da8n/1Wa1yd7NLVdlbZim4uyWoApoF0PLHjFqBGNB4RW3L8V6J454xfYs+bEN/+8l1wur8Xe/z1YD51NmxhV6iV2iIYvQWTdAxOkFTRIPt4CA4Cj6Eu+FhOAm73jDoNLvonwi/Aa9kMoP</latexit><latexit sha1_base64="o8aAYTACJXrjfyUfzuzjiD9z6U=">ACkXicbVFdb9MwFHXC1ygwOvbIyxUVUiuVKZkmwYSGCrxM4mVIdJtUd5HjOqk128liBzXz/H/4Pbzxb3DaSMDGlXx1dO491/a5aSm4NlH0Kwjv3X/w8NHW496Tp8+2n/d3Xpzqoq4om9JCFNV5SjQTXLGp4Uaw87JiRKaCnaWXn9v62XdWaV6ob6Yp2VySXPGMU2I8lfR/YMlVYrFZMkMAcwVYErOkRNiPzvM5kZI4B8eJLcdXbtjcrMbXF76frYyVZOXcCHDOrmAzZ3idWO5GPh3F7sIqN4bNaIcXWUWojZ0nAeta/umBN4BFkUM5bFr1zarNY1iPer+RjyDpD6K9aB1wF8QdGKAuTpL+T7woaC2ZMlQrWdxVJq5JZXhVDXw7VmJaGXJGczDxWRTM/t2lEHrz2zgKyo/FEG1uzfCkuk1o1MfWdrl75da8n/1Wa1yd7NLVdlbZim4uyWoApoF0PLHjFqBGNB4RW3L8V6J454xfYs+bEN/+8l1wur8Xe/z1YD51NmxhV6iV2iIYvQWTdAxOkFTRIPt4CA4Cj6Eu+FhOAm73jDoNLvonwi/Aa9kMoP</latexit><latexit sha1_base64="o8aAYTACJXrjfyUfzuzjiD9z6U=">ACkXicbVFdb9MwFHXC1ygwOvbIyxUVUiuVKZkmwYSGCrxM4mVIdJtUd5HjOqk128liBzXz/H/4Pbzxb3DaSMDGlXx1dO491/a5aSm4NlH0Kwjv3X/w8NHW496Tp8+2n/d3Xpzqoq4om9JCFNV5SjQTXLGp4Uaw87JiRKaCnaWXn9v62XdWaV6ob6Yp2VySXPGMU2I8lfR/YMlVYrFZMkMAcwVYErOkRNiPzvM5kZI4B8eJLcdXbtjcrMbXF76frYyVZOXcCHDOrmAzZ3idWO5GPh3F7sIqN4bNaIcXWUWojZ0nAeta/umBN4BFkUM5bFr1zarNY1iPer+RjyDpD6K9aB1wF8QdGKAuTpL+T7woaC2ZMlQrWdxVJq5JZXhVDXw7VmJaGXJGczDxWRTM/t2lEHrz2zgKyo/FEG1uzfCkuk1o1MfWdrl75da8n/1Wa1yd7NLVdlbZim4uyWoApoF0PLHjFqBGNB4RW3L8V6J454xfYs+bEN/+8l1wur8Xe/z1YD51NmxhV6iV2iIYvQWTdAxOkFTRIPt4CA4Cj6Eu+FhOAm73jDoNLvonwi/Aa9kMoP</latexit>

Max-Min Loss for Training the CNNs with Labeled Data

αmaxHp,q(y|x, zmax) + αminHp,q(y|x, zmin)

<latexit sha1_base64="CTyL0GiHpBuDE9+L99VekVIDKxM=">AAACSHicdZBLSwMxFIUz9VXrq+rSTbAIiiIzIuiy6KZLBauFtg530tQGM5kxuSOt4/w8Ny7d+RvcuFDEnWntwla9EPg45x6SnCCWwqDrPju5icmp6Zn8bGFufmFxqbi8cm6iRDNeZZGMdC0Aw6VQvIoCJa/FmkMYSH4RXB/3/Ytbro2I1Bn2Yt4M4UqJtmCAVvKLfgNk3IHLtIG8i2kI3Syr+Gm8c5Nt9u67O3cjzhbdpmMBof4L9J0tv1hyd93B0N/gDaFEhnPiF58arYglIVfIJBhT99wYmyloFEzyrNBIDI+BXcMVr1tUEHLTTAdFZHTDKi3ajrQ9CulA/ZlIITSmFwZ2MwTsmHGvL/7l1RNsHzZToeIEuWLfF7UTSTGi/VZpS2jOUPYsANPCvpWyDmhgaLsv2BK88S//hvO9Xc/y6X6pfDSsI0/WyDrZJB45IGVSISekShh5IC/kjbw7j86r8+F8fq/mnGFmlYxMLvcFK2q1gQ==</latexit> <latexit sha1_base64="CTyL0GiHpBuDE9+L99VekVIDKxM=">AAACSHicdZBLSwMxFIUz9VXrq+rSTbAIiiIzIuiy6KZLBauFtg530tQGM5kxuSOt4/w8Ny7d+RvcuFDEnWntwla9EPg45x6SnCCWwqDrPju5icmp6Zn8bGFufmFxqbi8cm6iRDNeZZGMdC0Aw6VQvIoCJa/FmkMYSH4RXB/3/Ytbro2I1Bn2Yt4M4UqJtmCAVvKLfgNk3IHLtIG8i2kI3Syr+Gm8c5Nt9u67O3cjzhbdpmMBof4L9J0tv1hyd93B0N/gDaFEhnPiF58arYglIVfIJBhT99wYmyloFEzyrNBIDI+BXcMVr1tUEHLTTAdFZHTDKi3ajrQ9CulA/ZlIITSmFwZ2MwTsmHGvL/7l1RNsHzZToeIEuWLfF7UTSTGi/VZpS2jOUPYsANPCvpWyDmhgaLsv2BK88S//hvO9Xc/y6X6pfDSsI0/WyDrZJB45IGVSISekShh5IC/kjbw7j86r8+F8fq/mnGFmlYxMLvcFK2q1gQ==</latexit> <latexit sha1_base64="CTyL0GiHpBuDE9+L99VekVIDKxM=">AAACSHicdZBLSwMxFIUz9VXrq+rSTbAIiiIzIuiy6KZLBauFtg530tQGM5kxuSOt4/w8Ny7d+RvcuFDEnWntwla9EPg45x6SnCCWwqDrPju5icmp6Zn8bGFufmFxqbi8cm6iRDNeZZGMdC0Aw6VQvIoCJa/FmkMYSH4RXB/3/Ytbro2I1Bn2Yt4M4UqJtmCAVvKLfgNk3IHLtIG8i2kI3Syr+Gm8c5Nt9u67O3cjzhbdpmMBof4L9J0tv1hyd93B0N/gDaFEhnPiF58arYglIVfIJBhT99wYmyloFEzyrNBIDI+BXcMVr1tUEHLTTAdFZHTDKi3ajrQ9CulA/ZlIITSmFwZ2MwTsmHGvL/7l1RNsHzZToeIEuWLfF7UTSTGi/VZpS2jOUPYsANPCvpWyDmhgaLsv2BK88S//hvO9Xc/y6X6pfDSsI0/WyDrZJB45IGVSISekShh5IC/kjbw7j86r8+F8fq/mnGFmlYxMLvcFK2q1gQ==</latexit> <latexit sha1_base64="CTyL0GiHpBuDE9+L99VekVIDKxM=">AAACSHicdZBLSwMxFIUz9VXrq+rSTbAIiiIzIuiy6KZLBauFtg530tQGM5kxuSOt4/w8Ny7d+RvcuFDEnWntwla9EPg45x6SnCCWwqDrPju5icmp6Zn8bGFufmFxqbi8cm6iRDNeZZGMdC0Aw6VQvIoCJa/FmkMYSH4RXB/3/Ytbro2I1Bn2Yt4M4UqJtmCAVvKLfgNk3IHLtIG8i2kI3Syr+Gm8c5Nt9u67O3cjzhbdpmMBof4L9J0tv1hyd93B0N/gDaFEhnPiF58arYglIVfIJBhT99wYmyloFEzyrNBIDI+BXcMVr1tUEHLTTAdFZHTDKi3ajrQ9CulA/ZlIITSmFwZ2MwTsmHGvL/7l1RNsHzZToeIEuWLfF7UTSTGi/VZpS2jOUPYsANPCvpWyDmhgaLsv2BK88S//hvO9Xc/y6X6pfDSsI0/WyDrZJB45IGVSISekShh5IC/kjbw7j86r8+F8fq/mnGFmlYxMLvcFK2q1gQ==</latexit>

Input Image Max Xentropy Min Xentropy Max-Min Xentropy Shared weights

  • Max cross-entropy maximizes the posteriors of correct labels.

Min cross-entropy minimizes the posteriors of incorrect labels.

  • Co-learning: Max and Min networks try to learn from each other
slide-19
SLIDE 19

STATISTICAL GUARANTEES FOR THE NRM

Bound on the generalization error Risk ≤

!"#$%& () *+,-.% &%/0%&-/1 2*,34 /5/7

  • Rendering path normalization:
  • new form of regularization

Training loss in the CNNs equivalent to likelihood in NRM

Max-Min NRM with RPN achieves SOTA on benchmarks

Rendering Path

slide-20
SLIDE 20

DATA AUGMENTATION 2: SYMBOLIC EXPRESSIONS

Goal: Learn a domain of functions (sin, cos, log, add…)

  • Training on numerical input-output does not generalize.

Data Augmentation with Symbolic Expressions

  • Efficiently encode relationships between functions.

Solution:

  • Design networks to use both: symbolic + numerical
slide-21
SLIDE 21

ARCHITECTURE : TREE LSTM

sin; 𝜄 + cos; 𝜄 = 1 sin −2.5 = −0.6

  • Symbolic expression trees. Function evaluation tree.
  • Decimal trees: encode numbers with decimal representation (numerical).
  • Can encode any expression, function evaluation and number.

Decimal Tree for 2.5

slide-22
SLIDE 22

RESULTS

  • Vastly Improved numerical evaluation: 90% over function-fitting baseline.
  • Generalization to verifying symbolic equations of higher depth
  • Combining symbolic + numerical data helps in better generalization

for both tasks: symbolic and numerical evaluation.

LSTM: Symbolic TreeLSTM: Symbolic TreeLSTM: symbolic + numeric 76.40 % 93.27 % 96.17 %

slide-23
SLIDE 23
  • OPTIMIZATION : ANALYSIS OF CONVERGENCE
  • SCALABILITY : GRADIENT QUANTIZATION
  • MULTI-DIMENSIONALITY : TENSOR ALGEBRA

ALGORITHMS

slide-24
SLIDE 24

DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION

Parameter server

GPU 1 GPU 2

With 1/2 data With 1/2 data

slide-25
SLIDE 25

DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION

Parameter server

GPU 1 GPU 2

With 1/2 data With 1/2 data

Compress? Compress? Compress?

slide-26
SLIDE 26

DISTRIBUTED TRAINING BY MAJORITY VOTE

Parameter server

GPU 1 GPU 2 GPU 3

sign(g) sign(g) sign(g)

Parameter server

GPU 1 GPU 2 GPU 3

sign [sum(sign(g))]

slide-27
SLIDE 27

SIGNSGD PROVIDES “FREE LUNCH"

Throughput gain with only tiny accuracy loss

P3.2x machines on AWS, Resnet50 on imagenet

slide-28
SLIDE 28

SIGNSGD ACROSS DOMAINS AND ARCHITECTURES

Huge throughput gain!

slide-29
SLIDE 29

TAKE-AWAYS FOR SIGN-SGD

  • Convergence even under biased gradients and noise.
  • Faster than SGD in theory and in practice.
  • For distributed training, similar variance reduction as SGD.
  • In practice, similar accuracy but with far less communication.
slide-30
SLIDE 30

TENSORS FOR LEARNING IN MANY DIMENSIONS

slide-31
SLIDE 31

Images: 3 dimensions Videos: 4 dimensions

TENSORS FOR MULTI-DIMENSIONAL DATA AND HIGHER ORDER MOMENTS

Pairwise correlations Triplet correlations

slide-32
SLIDE 32

OPERATIONS ON TENSORS: TENSOR CONTRACTION

Tensor Contraction

Extends the notion of matrix product Matrix product Mv =

  • j

vjMj

=

+

Tensor Contraction T(u, v, ·) =

  • i,j

uivjTi,j,:

=

+ + +

slide-33
SLIDE 33

DEEP NEURAL NETS: TRANSFORMING TENSORS

slide-34
SLIDE 34

DEEP TENSORIZED NETWORKS

slide-35
SLIDE 35

SPACE SAVING IN DEEP TENSORIZED NETWORKS

slide-36
SLIDE 36

Tensor Train RNN and LSTMs

TENSORS FOR LONG-TERM FORECASTING

Challenges:

  • Long-term

dependencies

  • High-order

correlations

  • Error propagation
slide-37
SLIDE 37

Climate dataset Traffic dataset

TENSOR LSTM FOR LONG-TERM FORECASTING

slide-38
SLIDE 38

TENSORLY: HIGH-LEVEL API FOR TENSOR ALGEBRA

  • Python programming
  • User-friendly API
  • Multiple backends:

flexible + scalable

  • Example notebooks in

repository

slide-39
SLIDE 39

A New Vision for Autonomy

Center for Autonomous Systems and Technologies

slide-40
SLIDE 40

CAST: BRINGING ROBOTICS AND AI TOGETHER

slide-41
SLIDE 41

FIRST SET OF RESULTS: LEARNING TO LAND

slide-42
SLIDE 42
slide-43
SLIDE 43

SOME RESEARCH LEADERS AT NVIDIA

Robotics

Dieter Fox

Learning & Perception

Jan Kautz Bill Dally Dave Luebke Alex Keller Aaron Lefohn

Graphics

Steve Keckler Dave Nellans Mike O’Connor

Architecture Programming

Michael Garland

VLSI

Brucek Khailany

Circuits

Tom Gray

Networks

Larry Dennison

Chief Scientist Computer vision Core ML

Sanja Fidler Me !

Applied research

Bryan Catanzaro

slide-44
SLIDE 44
  • DATA
  • Collection: Active learning and partial feedback
  • Aggregation: Crowdsourcing models
  • Augmentation: Graphics rendering + GANs, Symbolic expressions
  • ALGORITHMS
  • Convergence: SignSGD has good rates in theory and practice
  • Scalability: SignSGD has same variance reduction as SGD for multi-machine
  • Multi-dimensionality: Tensor algebra for neural networks and probabilistic models.
  • INFRASTRUCTURE:
  • Frameworks: Tensorly is high-level API for deep tensorized networks.

CONCLUSION

AI needs integration of data, algorithms and infrastructure

slide-45
SLIDE 45

COLLABORATORS (LIMITED LIST)

slide-46
SLIDE 46

Thank you