advances in loop analysis frameworks and optimizations
play

Advances in Loop Analysis Frameworks and Optimizations Adam Nemet - PowerPoint PPT Presentation

Advances in Loop Analysis Frameworks and Optimizations Adam Nemet & Michael Zolotukhin Apple Loop Unrolling for (x = 0; x < 6; x++) { foo(x); } Loop Unrolling for (x = 0; x < 6; x += 2) { for (x = 0; x < 6; x++) { foo(x);


  1. Can We Vectorize It? Iteration K: t = dc[k-1] + tpdd[k-1]; dc[k] = t; Iteration K+1: t2 = dc[k] + tpdd[k]; dc[k+1] = t2;

  2. Can We Vectorize It? Iteration K: t = dc[k-1] + tpdd[k-1]; Iteration K+1: t2 = dc[k] + tpdd[k]; dc[k] = t; dc[k+1] = t2;

  3. Can We Vectorize It? dc[k] = = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY;

  4. Case Study for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; Non-vectorizable if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk} }

  5. Case Study for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; Vectorizable if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; Non-vectorizable if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk} }

  6. Case Study if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk}

  7. Case Study if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk}

  8. Case Study for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; Vectorizable if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; Non-vectorizable Non-vectorizable if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk} }

  9. Case Study for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; Vectorizable if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; } for (k = 1; k <= M; k++) { dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; Non-vectorizable Non-vectorizable if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk} }

  10. Plan • Distribute loop • Let LoopVectorizer vectorize top loop -> Partial Loop Vectorization

  11. Loop Distribution

  12. Pros and Cons + Partial loop vectorization + Improve memory access pattern: • Cache associativity • Number of HW prefetcher streams + Reduce spilling - Loop overhead - Instructions duplicated across new loops - Instruction-level parallelism

  13. Legality for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; } for (k = 1; k <= M; k++) { dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; if (k < M) { Loop Run-time ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; Dependence Alias ic[k] += is[k]; Analysis Checks if (ic[k] < -INFTY) ic[k] = -INFTY; } }

  14. Loop Access Analysis • Born from the Loop Vectorizer • Generalized as new analysis pass • Computed on-demand and cached • New Loop Versioning utility

  15. Algorithm • Light-weight • Uses only LoopAccessAnalysis • No Program Dependence Graph • No Control Dependence • Inner loops only • Different from textbook algorithm • No reordering of memory operations

  16. Algorithm mul 1 st 2 ld 3 st 4 ld 5 add 6 st 7 ld 8 mul 9 st 10

  17. Algorithm mul 1 st 2 ld 3 st 4 ld 5 add 6 st 7 ld 8 mul 9 st 10

  18. Algorithm mul 1 st 2 ld 3 st 4 ld 5 add 6 st 7 ld 8 mul 9 st 10

  19. Algorithm mul 1 st 2 ld 3 st 4 ld 5 add 6 st 7 ld 8 mul 9 st 10

  20. Algorithm mul 1 st 2 ld 3 st 4 ld 5 add 6 st 7 ld 8 mul 9 st 10

  21. Algorithm mul 1 st 2 ld 3 st 4 ld 5 add 6 st 7 ld 8 mul 9 st 10

  22. Algorithm mul 1 st 2 ld 3 st 4 ld 5 add 6 st 7 ld 8 mul 9 st 10

  23. Algorithm mul 1 st 2 ld 3 st 4 ld 5 add 6 st 7 ld 8 mul 9 st 10

  24. Algorithm mul 1 st 2 ld 3 st 4 ld 5 add 6 st 7 ld 8 mul 9 st 10

  25. Algorithm mul 1 st 2 ld 3 st 4 ld 5 add 6 st 7 ld 8 mul 9 st 10

  26. Algorithm mul 1 st 2 ld 3 st 4 ld 5 add 6 st 7 ld 8 mul 9 st 10

  27. Algorithm mul 1 st 2 ld 3 st 4 dup of ld 5 mul 1 add 6 st 7 ld 8 mul 9 st 10

  28. Algorithm mul 1 st 2 ld 3 st 4 dup of ld 5 mul 1 add 6 st 7 ld 8 mul 9 st 10

  29. Algorithm mul 1 st 2 ld 3 st 4 dup of ld 5 mul 1 add 6 st 7 ld 8 mul 9 st 10

  30. Algorithm mul 1 st 2 ld 3 st 4 dup of ld 5 mul 1 add 6 st 7 ld 8 mul 9 st 10

  31. Algorithm mul 1 st 2 ld 3 st 4 dup of ld 5 mul 1 add 6 st 7 ld 8 mul 9 st 10

  32. Algorithm mul 1 st 2 ld 3 st 4 dup of ld 5 mul 1 add 6 st 7 dup of ld 8 ld 3 mul 9 st 10

  33. Algorithm mul 1 st 2 ld 3 st 4 dup of ld 5 mul 1 add 6 st 7 ld 8 mul 9 st 10

  34. Algorithm mul 1 st 2 ld 3 st 4 dup of ld 5 mul 1 add 6 st 7 ld 8 mul 9 st 10

  35. Algorithm mul 1 st 2 ld 3 st 4 dup of ld 5 mul 1 add 6 st 7 ld 8 mul 9 st 10

  36. Recap • Distributed loop • Versioned with run-time alias checks • Top loop vectorized

  37. Case Study for (k = 1; k <= M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; Vectorized if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; } for (k = 1; k <= M; k++) { dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk} }

  38. Case Study for (k = 1; k <= M; k++) { dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk} }

  39. Case Study dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY;

  40. Case Study Load Load Load Load Add Add Cmp DC[k-1] —> DC[k] Csel Cmp Csel Store

  41. Case Study Load Load Load Load Add Add HW st -> ld forwarding Cmp Csel Cmp Csel Store

  42. Case Study Load Load Load Load Add Add HW st -> ld forwarding Cmp SW st -> ld forwarding Csel Cmp Csel Store

  43. Case Study dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY;

  44. Loop Load Elimination

  45. Algorithm 1. Find loop-carried dependences with iteration distance of one 2. Between store -> load? 3. No (may-)intervening store 4. Propagate value stored to uses of load

  46. Algorithm for (k = 1; k <= M; k++) { dc[k] = = dc[k-1] + tpdd[k-1]; = sc; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = = -INFTY; if (dc[k] < -INFTY) dc[k] = if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk} }

  47. Algorithm for (k = 1; k <= M; k++) { dc[k] = = dc[k-1] + tpdd[k-1]; T if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = T = sc; if (dc[k] < -INFTY) dc[k] = = -INFTY; T if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk} }

  48. Algorithm for (k = 1; k <= M; k++) { dc[k] = = T + tpdd[k-1]; T if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = T = sc; if (dc[k] < -INFTY) dc[k] = = -INFTY; T if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk} }

  49. Algorithm T = dc[0]; for (k = 1; k <= M; k++) { dc[k] = = T + tpdd[k-1]; T if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = T = sc; if (dc[k] < -INFTY) dc[k] = = -INFTY; T if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk} }

  50. Algorithm T = dc[0]; for (k = 1; k <= M; k++) { dc[k] = = T + tpdd[k-1]; T if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = T = sc; if (dc[k] < -INFTY) dc[k] = = -INFTY; T if (k < M) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; sk} }

  51. Loop Load Elimination • Simple and cheap using Loop Access Analysis • With Loop Versioning can optimize more loops • GVN Load-PRE can be simplified to not worry about loop cases

  52. Recap • Distributed loop into two loops • Versioned with run-time alias checks • Vectorized top loop • Store-to-load forwarding in bottom loop • Versioned with run-time alias checks

  53. Results • 20-30% gain on 456.hmmer on ARM64 and x86 • Loop Access Analysis pass • Loop Versioning utility • Loop Distribution pass • Loop Load Elimination pass

  54. Future Work • Commit Loop Load Elimination • Tune Loop Distribution and turn it on by default • Loop Distribution with Program Dependence Graph

  55. Acknowledgements • Chandler Carruth • Hal Finkel • Arnold Schwaighofer • Daniel Berlin

  56. Q&A

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend