CS252LectureNotes MultithreadedArchitectures Concept - - PDF document

cs252 lecture notes multithreaded architectures
SMART_READER_LITE
LIVE PREVIEW

CS252LectureNotes MultithreadedArchitectures Concept - - PDF document

CS252LectureNotes MultithreadedArchitectures Concept Tolerateormasklongandoftenunpredictablelatencyoperationsbyswitchingtoanothercontext,


slide-1
SLIDE 1

CS252LectureNotes MultithreadedArchitectures

  • Concept

Tolerateormasklongandoftenunpredictablelatencyoperationsbyswitchingtoanothercontext, whichisabletodousefulwork.

  • SituationToday–Whyisthistopicrelevant?
  • ILPhasbeenexhaustedwhichmeansthreadlevelparallelismmustbeutilized
  • Thegapbetweenprocessorperformanceandmemoryperformanceisstilllarge
  • Thereisamplereal-estateforimplementation
  • Moreapplicationsarebeingwrittenwiththeuseofthreadsandmultitaskingisubiquitous
  • Multiprocessorsaremorecommon
  • Networklatencyisanalogoustomemorylatency
  • Complexschedulingisalreadybeingdoneinhardware
  • ClassicalProblem

60’sand70’s

  • I/Olatencypromptedmultitasking
  • IBMmainframes
  • Multitasking
  • I/Oprocessors
  • Cacheswithindiskcontrollers
  • RequirementsofMultithreading
  • Storageneedtoholdmultiplecontext’sPC,registers,statusword,etc.
  • Coordinationtomatchaneventwithasavedcontext
  • Awaytoswitchcontexts
  • Longlatencyoperationsmustuseresourcesnotinuse
  • Tovisualizetheeffectoflatencyonprocessorutilization,letRbetherunlengthtoalonglatency

event,letLbetheamountoflatencythen:

Util=R/(R+L) Util 1 L

slide-2
SLIDE 2
  • 80’s

Problemwasrevisitedduetotheadventofgraphicsworkstations

  • XeroxAlto,TIExplorer
  • Concurrentprocessesareinterleavedtoallowfortheworkstationstobemoreresponsive.
  • Theseprocessescoulddriveormonitordisplay,input,filesystem,network,user

processing

  • Processswitchwasslowsothesubsystemsweremicroprogrammedtosupportmultiple

contexts

  • ScalableMultiprocessor
  • Dancehall–asharedinterconnectwithmemoryononesideandprocessorsontheother.
  • Orprocessorsmayhavelocalmemory
  • Howdotheprocessorscommunicate?

SharedMemory

  • Potentiallonglatencyoneveryload
  • Cachecoherencybecomesanissue
  • ExamplesincludeNYU’sUltracomputer,IBM’sRP3,BBN’sButterfly,MIT’sAlewife,

andlaterStanford’sDash.

  • Synchronizationoccursthroughsharevariables,locks,flags,andsemaphores.
  • MessagePassing
  • Programmerdealswithlatency.Thisenablesthemtominimizethenumberofmessages,

whilemaximizingthesize,andthisschemeallowsfordelayminimizationbysendinga messagesothatitreachesthereceiveratthetimeitexpectsit.

  • ExamplesincludeIntel’sPSCandParagon,Caltech’sCosmicCube,andThinking

Machines’CM-5

  • Synchronizationoccursthroughsendandreceive
  • Cycle-by-CycleInterleavedMultithreading

BurtonSmith

  • CurrentlychiefscientistatCray
  • DenelcorHEP1(1982),HEP2
  • Horizon,whichwasneverbuilt
  • Tera,MTA

Highspeed interconnect P P M M IO Highspeed interconnect P/M P/M P/M P/M

slide-3
SLIDE 3
  • Featuresofthisarchitecture
  • Aninstructionfromadifferentcontextislaunchedateachclockcycle
  • Nointerlocksorbypassesthankstoanon-blockingpipeline
  • Optimizations:
  • Leavingcontextstateinproc(PC,register#,status)
  • Assigningtagstoremoterequestandthenmatchingitoncompletion
  • Additionaloptimizations:
  • Afull/emptybitoneverymemorywordallowingforautomaticandefficient

synchronization.Thisobviatestheneedforsemaphores,locks,etc.anditmayreduce pollingtimedonebytheprocessorbymovingthatjobtoacontroller.

  • PCs

ThreadScheduler I-Fetch WB Mem RF

slide-4
SLIDE 4
  • Challengeswiththisapproach
  • Instructionbandwidth
  • Sinceinstructionsarebeinggrabbedfrommanydifferentcontexts,instructionlocalityis

degradedandtheI-cachemissraterises.

  • Registerfileaccesstimeincreasesduetothefactthattheregfilehadtosignificantly

increaseinsizetoaccommodatemanyseparatecontexts.Infact,theHEPandTerause SRAMtoimplementtheregfile,whichmeanslongeraccesstimes.Someofthismaybe alleviatedthroughincreasingthepipelinedepthatthecostofadditionallatency.

  • Singlethreadperformanceissignificantlydegradedsincethecontextisforcedtoswitch

toanewthreadevenifnoneareavailable.

  • Insufficientpipelining(bandwidthbottleneck)
  • UnpipelinedFPunit–muststallorreflectupintothreadscheduler
  • Veryhighbandwidthnetwork,whichisfastandwide
  • Retriesonloademptyorstorefull
  • ImprovingSingleThreadPerformance
  • Domoreoperationsperinstruction(VLIW)
  • Allowmultipleinstructionstoissueintopipelinefromeachcontext.Thiscouldleadto

pipelinehazards,soothersafeinstructionscouldbeinterleavedintotheexecution.For Horizon&Terathecompilerdetectssuchdatadependenciesandthehardwareenforcesit byswitchingtoanothercontextifdetected.Thisisimplementedbyinsertingintoeach instructionafieldwhichindicatesitsminimumnumberofindependentsuccessorsover allpossiblecontrolflows.

  • Switchonload
  • Switchonmiss
  • Switchingonloadormisswillincreasethecontextswitchtime.Consider:
  • TypeofSwitch

R C Cycle 1 Load 5-10 1-2 Miss 5-100 5-10(+hittime)

  • MaxUtilization=R/(R+C)
  • Wheredoessaturationoccur?
  • Util

1 N R/(R+C) R/(R+L) Nsat=L/(R+C)+1 StochasticR DeterministicR

slide-5
SLIDE 5
  • Cautions
  • Pipelinebottlenecksaremoreapparentwitheffectivemultithreading.Forexample,an

unpipelinedFPunitneedstoreflectreservationuptothethreadscheduler

  • Architecteddelayslotscomplicatemultithreadedcontrollogic.Forexample,an

exceptionoccursoninstructionsinbranchdelayslot,whilefetchandexecareatbranch target

  • Registerfiledelayandbandwidth
  • OtherConcepts
  • TaggedMemoryisanotherconceptthatisoftencirculatedinwhichmemoryisthoughtof

asasetofobjectsinsteadofhomogenousbits.Examplesofthisarelispmachines,data- flowmachines,andJ-machines.

  • PhysicalvsVirtualParallelism