accelerate framework the armadillo library
play

Accelerate Framework & the Armadillo Library Instructor - Simon - PowerPoint PPT Presentation

Accelerate Framework & the Armadillo Library Instructor - Simon Lucey 16-623 - Designing Computer Vision Apps Today Motivation Accelerate Framework BLAS & LAPACK Armadillo Library Algorithm Software Architecture


  1. Accelerate Framework & the Armadillo Library Instructor - Simon Lucey 16-623 - Designing Computer Vision Apps

  2. Today • Motivation • Accelerate Framework • BLAS & LAPACK • Armadillo Library

  3. Algorithm Software Architecture SOC Hardware

  4. Correlation Filters with Limited Boundaries Hamed Kiani Galoogahi Terence Sim Simon Lucey Algorithm Istituto Italiano di Tecnologia National University of Singapore Carnegie Mellon University Genova, Italy Singapore Pittsburgh, USA hamed.kiani@iit.it tsim@comp.nus.edu.sg slucey@cs.cmu.edu Abstract Correlation filters take advantage of specific proper- ties in the Fourier domain allowing them to be estimated efficiently: O ( ND log D ) in the frequency domain, ver- sus O ( D 3 + ND 2 ) spatially where D is signal length, and N is the number of signals. Recent extensions to cor- Software (a) (b) relation filters, such as MOSSE, have reignited interest of their use in the vision community due to their robustness and attractive computational properties. In this paper we demonstrate, however, that this computational efficiency Ax = b 1 comes at a cost. Specifically, we demonstrate that only D proportion of shifted examples are unaffected by boundary effects which has a dramatic effect on detection/tracking � � (c) (d) performance. In this paper, we propose a novel approach to correlation filter estimation that: (i) takes advantage of Figure 1. (a) Defines the example of fixed spatial support within inherent computational redundancies in the frequency do- the image from which the peak correlation output should occur. main, (ii) dramatically reduces boundary effects, and (iii) (b) The desired output response, based on (a), of the correlation is able to implicitly exploit all possible patches densely ex- filter when applied to the entire image. (c) A subset of patch ex- tracted from training examples during learning process. Im- amples used in a canonical correlation filter where green denotes pressive object tracking and detection results are presented a non-zero correlation output, and red denotes a zero correlation in terms of both accuracy and computational efficiency. output in direct accordance with (b). (d) A subset of patch ex- Architecture amples used in our proposed correlation filter. Note that our pro- posed approach uses all possible patches stemming from different 1. Introduction parts of the image, whereas the canonical correlation filter simply employs circular shifted versions of the same single patch. The Correlation between two signals is a standard approach central dilemma in this paper is how to perform (d) efficiently in to feature detection/matching. Correlation touches nearly the Fourier domain. The two last patches of (d) show that D − 1 T every facet of computer vision from pattern detection to ob- patches near the image border are affected by circular shift in our ject tracking. Correlation is rarely performed naively in the method which can be greatly diminished by choosing D << T , where D and T indicate the length of the vectorized face patch in spatial domain. Instead, the fast Fourier transform (FFT) (a) and the whole image in (a), respectively. affords the efficient application of correlating a desired tem- plate/filter with a signal. Correlation filters, developed initially in the seminal proach is that it attempts to learn the filter in the frequency work of Hester and Casasent [15], are a method for learning domain due to the efficiency of correlation in that domain. a template/filter in the frequency domain that rose to some prominence in the 80s and 90s. Although many variants Interest in correlation filters has been reignited in the vi- have been proposed [15, 18, 20, 19], the approach’s central sion world through the recent work of Bolme et al. [5] on SOC Hardware tenet is to learn a filter, that when correlated with a set of Minimum Output Sum of Squared Error (MOSSE) correla- training signals, gives a desired response, e.g. Figure 1 (b). tion filters for object detection and tracking. Bolme et al.’s Like correlation, one of the central advantages of the ap- work was able to circumvent some of the classical problems

  5. Algorithm Software Architecture Hardware

  6. Algorithm 4-way x + Software SIMD (Single Instruction, Multiple Data) � � Architecture � (length 2, 4, 8, …) vectors of integers or floats � Names: MMX, SSE, SSE2, … � SOC Hardware � �

  7. Reminder: CPU clock is stuck!!!! • CPU clock stuck at about 3GHz since 2006 due to high power consumption (up to 130W per chip) • chip circuitry still doubling every 18-24 months • ⇒ more on-chip memory and MMU (memory management units) • ⇒ specialised hardware (e.g. multimedia, encryption) ⇒ multi-core (multiple CPU’s on one chip) • peak performance of chip still doubling every 18-24 months 7 Taken from http://people.maths.ox.ac.uk/gilesm/cuda/lecs/lec0.pdf

  8. 2015 2010

  9. (Taken from http://bgr.com/2016/08/22/galaxy-note-7-vs-iphone-6-speed-test/)

  10. Architecture Considerations • Memory hierarchy. • Vector instructions. 4-way x + SIMD (Single Instruction, Multiple Data) • Multiple threads. • Branch Prediction. � � � (length 2, 4, 8, …) vectors of integers or floats � Names: MMX, SSE, SSE2, … � � �

  11. Writing fast vision code….. • In general you should NOT be trying to do these optimizations yourself. • BUT, you should be using tools to find where the biggest losses in performance are coming from. • Xcode comes with an excellent tool for doing this which is called “instruments”. • Ray Wenderlich has a useful tutorial (see link) on using instruments in Xcode. • More on this in later lectures.

  12. Emerging Alternatives to OpenCV (https://developer.qualcomm.com/software/fastcv-sdk) (https://www.khronos.org/openvx/) (http://opencv.org/itseez-announces-release-of-accelerated-cv-library.html) GPUImage (https://github.com/BradLarson/GPUImage)

  13. OpenVX versus OpenCV Open standard API designed to be Implementation Community driven open source library implemented by hardware vendors Extensive OpenCV Test Suite but Implementations must pass defined Conformance no formal Adopters program conformance test suite to use trademark Available functions can vary depending on All core functions must be available in all Consistency implementation / platform conformant implementations Very wide Tight focus on core hardware accelerated Scope 1000s of imaging and vision functions functions for mobile vision – but extensible Multiple camera APIs/interfaces Uses external/native camera API Memory-based architecture Graph-based execution Efficiency Each operation reads and writes to memory Optimizable computation and data transfer Rapid experimentation and Production development & deployment on Typical Use Case prototyping - especially on desktop mobile and embedded devices Embedded Re-usable code Callable library Deployment (Taken from https://www.khronos.org/openvx/)

  14. Today • Motivation • Accelerate Framework • BLAS & LAPACK • Armadillo Library

  15. Accelerate Framework

  16. Accelerate Framework Jaguar iOS 4 Tiger iOS 5 (vForce) Taken from: http://www.mactech.com/sites/default/files/Biggus-Accelerate_IV.pdf

  17. Accelerate Framework 1980 1990 2000 2010 LAPACK BLAS vForce vMathLib vDSP vBasicOps vBigNum vImage Taken from: http://www.mactech.com/sites/default/files/Biggus-Accelerate_IV.pdf

  18. Accelerate Framework “image operations” “matrix operations” “signal processing” “misc math” BNNS (2016) “basic neural network subroutines” (Taken from https://www.bignerdranch.com/blog/neural-networks-in-ios-10-and-macos/ )

  19. Today • Motivation • Accelerate Framework • BLAS & LAPACK • Armadillo Library

  20. Matrix-Matrix Multiplication (MMM) Matrix-Matrix Multiplication (MMM) on 2 x Core 2 Duo 3 GHz Performance [Gflop/s] 50 MMM kernel function 45 40 35 30 Multiple threads: 4x 25 20 15 10 Vector instructions: 4x 5 Memory hierarchy: 20x 0 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 matrix size Compiler doesn’t do the job >> A*B (in MATLAB) � � Taken from Markus Püschel - “How to Write Fast Numerical Code”.

  21. BLAS • Basic Linear Algebra Subprograms • Level 1 (70s) • Level 2 (mid 80s) • Level 3 (late 80s) • BLAS was originally used to implement the linear algebra subroutine library (LINPACK).

  22. The Path to LAPACK • EISPACK and LINPACK (early 70s) • Libraries for linear algebra algorithms • Jack Dongarra, Jim Bunch, Cleve Moler, Gilbert Stewart • LINPACK still the name of the benchmark for the TOP500 (Wiki) list of most powerful supercomputers • Problem • Implementation vector-based = low operational intensity 
 (e.g., MMM as double loop over scalar products of vectors) • Low performance on computers with deep memory hierarchy (in the 80s) • Solution: LAPACK • Reimplement the algorithms “block-based,” i.e., with locality • Developed late 1980s, early 1990s • Jim Demmel, Jack Dongarra et al. Taken from Markus Püschel - “How to Write Fast Numerical Code”.

  23. Availability of LAPACK • LAPACK available on nearly all platforms. • Numerous implementations, • Intel MKL (Windows, Linux, OS X) • AMD ACML • OpenBLAS (Windows, Linux, Android, OS X) • Apple Accelerate (OS X, iOS)

  24. Which is Easier to Follow?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend