Harris Corner Detec.on on a NUMA Manycore Claude TADONKI C entre de - - PowerPoint PPT Presentation

harris corner detec on on a numa manycore
SMART_READER_LITE
LIVE PREVIEW

Harris Corner Detec.on on a NUMA Manycore Claude TADONKI C entre de - - PowerPoint PPT Presentation

Harris Corner Detec.on on a NUMA Manycore Claude TADONKI C entre de R echerche en I nformatique (CRI) Joint work with Olfa HAGGUI and Lionel LACASSAGNE S ousse N ational S chool of E ngineering - L aboratoire d I nformatique de P aris 6 (LIP6)


slide-1
SLIDE 1

Harris Corner Detec.on on a NUMA Manycore

Claude TADONKI

Centre de Recherche en Informatique (CRI) Joint work with

Olfa HAGGUI and Lionel LACASSAGNE

Sousse National School of Engineering - Laboratoire d’Informatique de Paris 6 (LIP6)

Mines ParisTech - PSL

1

slide-2
SLIDE 2

Claude TADONKI Harris Corner Dectec=on on a NUMA Manycore Seminar at Centre de Recherche en Informa=que – April 16, 2018 - FONTAINEBLEAU - FRANCE From the intensity I (color not needed), we need to compute (approximated) deriva:ves and combined them

Corner points are used for mo.on detec.on for instance

2

slide-3
SLIDE 3

Claude TADONKI Harris Corner Dectec=on on a NUMA Manycore Seminar at Centre de Recherche en Informa=que – April 16, 2018 - FONTAINEBLEAU - FRANCE

The procedure applies a series of convolu=on kernels to the input intensity matrix Each convolu.on is a stencil computa=on The whole computa.on can be fully serial, fully pilelined, or hybrid. Memory acces paEerns are the main focus w.r.t performances

3

slide-4
SLIDE 4

Claude TADONKI Harris Corner Dectec=on on a NUMA Manycore Seminar at Centre de Recherche en Informa=que – April 16, 2018 - FONTAINEBLEAU - FRANCE

Stencil computa=on: redundant memory accesses, cache misses, and unalignement Scheduling of the convolu=ons : Intermediate reads/writes (space and access :me) SIMD: not efficient in its standard form (what we get from the compiler) SM Parallelism: bus conten.on, threads synchroniza.on, NUMA We are going to explain our approach for each of the aforemen-oned aspects !!!

4

slide-5
SLIDE 5

Claude TADONKI Harris Corner Dectec=on on a NUMA Manycore Seminar at Centre de Recherche en Informa=que – April 16, 2018 - FONTAINEBLEAU - FRANCE

5

Separability Half-Pipe Clustering

slide-6
SLIDE 6

Claude TADONKI Harris Corner Dectec=on on a NUMA Manycore Seminar at Centre de Recherche en Informa=que – April 16, 2018 - FONTAINEBLEAU - FRANCE

6

Vectoriza.on with the original memory access paEern leads to unaligned accesses We propose a diagonal shiS to keep all accesses aligned The last vector register contains 2 dirty values, but the whole vector is stored (4-components vector)

slide-7
SLIDE 7

Claude TADONKI Harris Corner Dectec=on on a NUMA Manycore Seminar at Centre de Recherche en Informa=que – April 16, 2018 - FONTAINEBLEAU - FRANCE

7

The goal here is to load data into vector registers once, and then perform all dependent calcula=ons (op:mal data consump:on and memory accesses saving)

slide-8
SLIDE 8

Claude TADONKI Harris Corner Dectec=on on a NUMA Manycore Seminar at Centre de Recherche en Informa=que – April 16, 2018 - FONTAINEBLEAU - FRANCE

8

Thus, for the computa.on of an en.re row, the typical steps at each itera.on are:

slide-9
SLIDE 9

Claude TADONKI Harris Corner Dectec=on on a NUMA Manycore Seminar at Centre de Recherche en Informa=que – April 16, 2018 - FONTAINEBLEAU - FRANCE

9

slide-10
SLIDE 10

Claude TADONKI Harris Corner Dectec=on on a NUMA Manycore Seminar at Centre de Recherche en Informa=que – April 16, 2018 - FONTAINEBLEAU - FRANCE

10

We consider the half-pipe clustering We pipeline the two cluster steps (SOLBEL+MUL and GAUSS+COARSITY) through a loop fusion We apply an array contrac=on (mod 3) for the intermediate storage

slide-11
SLIDE 11

Claude TADONKI Harris Corner Dectec=on on a NUMA Manycore Seminar at Centre de Recherche en Informa=que – April 16, 2018 - FONTAINEBLEAU - FRANCE

11

Both input and output images are stored on NUMA node 0 Each NUMA node locally computes its chunk (block of lines) of the final result Within each NUMA node, the work is equally distributed by block to its threads Expected memory alloca.on on the NUMA nodes is done by explicit binding rou.nes

slide-12
SLIDE 12

Claude TADONKI Harris Corner Dectec=on on a NUMA Manycore Seminar at Centre de Recherche en Informa=que – April 16, 2018 - FONTAINEBLEAU - FRANCE

12

(1) Original SIMD without the in-registers strategy (2) Op.mized SIMD with the in-registers strategy In-register strategy doubles the overall peformances and Our sequen.al implementa.on outperforms the state-of-the-art absolute performance

slide-13
SLIDE 13

Claude TADONKI Harris Corner Dectec=on on a NUMA Manycore Seminar at Centre de Recherche en Informa=que – April 16, 2018 - FONTAINEBLEAU - FRANCE

13

slide-14
SLIDE 14

Claude TADONKI Harris Corner Dectec=on on a NUMA Manycore Seminar at Centre de Recherche en Informa=que – April 16, 2018 - FONTAINEBLEAU - FRANCE

14

slide-15
SLIDE 15

Claude TADONKI Harris Corner Dectec=on on a NUMA Manycore Seminar at Centre de Recherche en Informa=que – April 16, 2018 - FONTAINEBLEAU - FRANCE