i tanium power programming
play

I tanium Power Programming Sverre Jarp CERN openlab 1 Summer - PowerPoint PPT Presentation

S.Jarp CERN I tanium Power Programming Sverre Jarp CERN openlab 1 Summer 2005 Lesson 1 S.Jarp a) I ntroduction CERN b) Overview of Architecture and Conventions Lesson 2 a) Standard I nstruction Set b) Our first real


  1. S.Jarp CERN “I tanium Power Programming” Sverre Jarp CERN openlab 1 Summer 2005

  2. Lesson 1 S.Jarp a) I ntroduction CERN b) Overview of Architecture and Conventions Lesson 2 a) Standard I nstruction Set b) Our first “real” example Agenda: Lesson 3 a) Secrets of Speed b) An improved version our example Lesson 4 a) Multimedia I nstructions b) A top-notch version of our example Lesson 5 a) Floating-point I nstructions b) Changing our example to handle floating-point Lesson 6 a) Compilers and Assemblers: Peaceful coexistence? b) Conclusions Appendices 2 Summer 2005

  3. Part 1a S.Jarp CERN I ntroduction 3 Summer 2005

  4. Presentation Objectives S.Jarp CERN � Offer programmers � Comprehension of the architecture � I nstruction set and other features � Working Understanding of I tanium machine code � Compiler-generated code � Hand-written assembler code � I nspiration for writing code � Well-targeted assembler routines � Highly optimized routines � I n-line assembly code � Full control of architectural features 4 Summer 2005

  5. Part 1b S.Jarp CERN Overview of Architecture and Conventions 5 Summer 2005

  6. Architectural Highlights S.Jarp CERN � (Some of the) Main I nnovations: � Rich I nstruction Set � Bundled Execution � Predicated I nstructions � Large Register Files � Register Stack � Rotating Registers � Software Pipelined Loops � Control/ Data Speculation � Cache Control I nstructions � High-precision Floating-Point 6 Summer 2005

  7. A simple example S.Jarp CERN � Lots of details Application registers Register � Many questions allocation .proc getval: alloc r3= ar.pfs,R_input,R_local,R_output,R_rotating (p0) movl r2= Table / / Base table address Enforced (p0) and in0= 7,in0 / / Choice is 0 – 7 Instruction ;; (p0) shladd r2= in0,3,r2 / / I ndex table Separation ;; (p0) ldfd f8= [r2] / / Load value (p0) br.ret.sptk.few rp / / return Predicated execution Branch return 7 Summer 2005

  8. User Register Overview S.Jarp CERN 128 16 Kernel Integer Registers Backup Registers 128 8 Floating Point Registers Region Registers 64 128 Predicate Registers Control Registers 8 Instruction Pointer Branch Registers 128 NN Debug Application Registers Breakpoint Registers 5 NN Perf. Mon. CPUID Registers Data Reg’s 8 Summer 2005

  9. I A64 Common Registers S.Jarp CERN � I nteger registers � 128 in total; Width is 64 bits + 1 bit (NaT); r0 = 0 � I nteger, Logical and Multimedia data � Floating point registers � 128 in total; 82 bits wide � 17-bit exponent, 64-bit significand � f0 = 0.0; f1 = 1.0 � Significand also used for two SI MD floats � Predicate registers � 64 in total; 1 bit each (fire/ do not fire) � p0 = 1 (default value) � Branch registers � 8 in total; 64 bits wide (for address) 9 Summer 2005

  10. Rotating Registers S.Jarp CERN � Upper 75% rotate (when activated): � General registers (r32-r127) � Floating Point Registers (f32-f127) � Predicate Registers (p16-p63) � Formula: � Virtual Register = Physical Register – Register Rotation Base (RRB) ……. f28 f29 f30 f31 f32 f33 f34 f35 ……. f124 f125 f126 f127 10 Summer 2005

  11. Register Convention S.Jarp CERN � Run-time: � Branch Registers: � B0: Call register [rp] � B1-B5: Must be preserved � B6-B7: Scratch � General Registers: � R1: Global Data Pointer [gp] � R2-R3: scratch � R4-R7: Must be preserved � R8-R11: Procedure Return Values [ret0, ret1, ret2, ..] � R12: Stack Pointer [sp] � R13: (Reserved as) Thread Pointer � R14-R31: Scratch � R32-Rxx: Argument Registers [in0, in1, in2, ..] 11 Summer 2005

  12. Register Convention (2) S.Jarp CERN � Run-time convention � Floating-Point: � F2-F5: Preserved � F6-F7: Scratch � F8-F15: Argument/ Return Registers � F16-F31: Must be preserved � F32-F127: Scratch � Predicates: � P1-P5: Must be preserved � P6-P15: Scratch � P16-P63: Must be preserved � Additionally: � Ar.lc: Must be preserved 12 Summer 2005

  13. Register Stack Rules S.Jarp CERN � The rotating integer registers serve as a stack � Each routine allocates via ”alloc” instruction: � I nput + Local + Output � “R_rotate” < = “R_input + R_local” may rotate (in a multiple of 8 registers) Proc A Local A Output A Proc B Input B + Local B Output B Proc C Further Calls Proc B Proc A Local A Output A 13 Summer 2005

  14. I nstruction Types S.Jarp CERN � M � Memory/ Move Operations � I � Complex I nteger/ Multimedia Operations � A � Simple I nteger/ Logic/ Multimedia Operations � F � Floating Point Operations (Normal/ SI MD) � B � Branch Operations � L � Special instructions with 64-bit immediate 14 Summer 2005

  15. I nstruction Bundle S.Jarp CERN � Bundle as “Packaging entity”: � 3 * 41 bit I nstruction Slots � 5 bits for Template (of I nst. types) � Typical examples: MFI or MI B � I ncluding bit for I nstruction Group Separation “S” � A bundle is 16B: � Basic unit for expressing parallelism � The unit that the I nstruction Pointer points to � The unit you branch to � Actually executed may be less, equal, or more Slot 2 Slot 1 Slot 0 T 15 Summer 2005

  16. I nstruction Group Separation (Stop bit) S.Jarp CERN � Necessary to avoid “Dependency Violations” � For ALL registers: I nteger, FP, Predicate, Branch, App., etc. � Two out of four possibilities (Forbidden): � Read-After-Write (RAW): Good � add r22= 1,r21 ; add r23= 1,r22 ;; assemblers will issue � Write-After-Write (WAW): necessary � add r22= 1,r21 ; add r22= 1,r23 ;; warnings! � Two out of four (OK): � Read-After-Read (RAR): � add r22= 1,r21 ; add r23= 1,r21 ;; � Write-After-Read (WAR): � add r23= 1,r22 ; add r22= 1,r21 ;; 16 Summer 2005

  17. Conventions S.Jarp CERN � I nstruction syntax � (qp) ops[.comp 1 ] r 1 = r 2 , r 3 � Execution is always right-to-left � Result(s) on left-hand side of equal-sign. � Almost all instructions have a qualifying predicate � Many have further completers: Unsigned, left, double, etc. � 7 6 5 4 3 2 1 0 � Numbering � Also right-to left 63 0 � I mmediates At execution time, sign bit is � Various sizes exist extended all the � I mm 8 (Signed immediate – 7 bits plus sign) way to bit 63 17 Summer 2005

  18. Part 2a S.Jarp CERN Standard I nstruction Set 18 Summer 2005

  19. The Total I nstruction Set S.Jarp CERN � Many I nstruction Categories: � Logical operations (e.g. and) � Arithmetic operations (e.g. add) � Compare operations � Shift operations � Branches, including loop control � Memory and cache operations � Move operations � Multimedia operations (e.g. padd) � Floating Point operations (e.g. fma) � SI MD Floating Point operations (e.g. fpma) See documentation for complete reference set 19 Summer 2005

  20. Arithmetic Operations S.Jarp CERN � I nstruction format: X86 I nc/ Dec replaced with � (qp) ops 1 r 1 = r 2 , r 3 [,1] (qp) ops r 1 = r 2 ,r0,1 � (qp) ops 2 r 1 = imm x , r 3 � (qp) ops 3 r 1 = r 2 , count 2 , r 3 Z = Y – imm becomes (qp) Add r 1 = -imm, r 3 � Valid Operations: � ops 1 : add, sub � ops 2 : sub, adds/ addl (imm 14 , imm 22 ) � ops 3 : shladd Loading an immediate value (qp) Add r 1 = imm, r0 � NB: I nteger multiply is an FLP operation 20 Summer 2005

  21. Compare Operations S.Jarp CERN � I nstruction format: � (qp) cmp.crel.ctype p 1 , p 2 = r 2 , r 3 � (qp) cmp.crel.ctype p 1 , p 2 = imm 8 , r 3 Parallel � (qp) cmp.crel.ctype p 1 , p 2 = r0, r 3 inequality form � Valid Relationships: � eq, ne, lt, le, gt, ge, ltu, leu gtu, geu, � Types: � none , unc, and, or, or.andcm, orcm, andcm, and.orcm 21 Summer 2005

  22. Load Operations S.Jarp CERN � Standard instructions: � (qp) ld sz .ldtype.ldhint r 1 = [r 3 ], r 2 Always � (qp) ld sz . ldtype.ldhint r 1 = [r 3 ], imm 9 post- � (qp) ldf fsz .fldtype.ldhint f 1 = [r 3 ], r 2 modify � (qp) ldf fsz .fldtype.ldhint f 1 = [r 3 ], imm 9 � Valid Sizes: Sign-bit is NOT � sz: 1/ 2/ 4/ 8 [bytes] extended for � fsz: s(ingle)/ d(double)/ e(extended)/ 8(as integer) 1/ 2/ 4 bytes In the case � Types: of integer � s/ a/ sa/ c.nc/ c.clr/ c.clr.acq/ acq/ bias multiply (for instance) � Advanced options (not discussed here!) Also “fill” variants More complex usage (see Manuals) 22 Summer 2005

  23. Branch Operations S.Jarp CERN � Several different types: � Conditional or Call branches � Relative offset (I P-relative) or I ndirect (via branch registers) � Triggered by predication � Return branches � I ndirect + Qualifying Predicate (QP) � Loop controlling branches: � Simple Counted Loops (br.cloop) � I P-relative with AR.LC � Software-pipelined Counted Loop (br.ctop) � I P-relative with AR.LC and AR.EC � Software-pipelined While Loops (br.wtop) � I P-relative with QP and AR.EC 23 Summer 2005

  24. Simple Counted Loop S.Jarp CERN � Works as ‘expected’ � ar.lc counts down the loop (automatically) � No need to use a general register mov ar.lc= 5 ;; / / NB: 6 iterations loop: { work } ……. { much more work } br.cloop.sptk.few loop ;; � Software-pipelined loops are more advanced � Uses Epilogue Count (as well as Loop Count) � … and Rotating Registers We will deal with such loops later 24 Summer 2005

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend