Usage Of dCache Resilient Pools for User Code Distribution Recap - - PowerPoint PPT Presentation

usage of dcache resilient pools for user code
SMART_READER_LITE
LIVE PREVIEW

Usage Of dCache Resilient Pools for User Code Distribution Recap - - PowerPoint PPT Presentation

Usage Of dCache Resilient Pools for User Code Distribution Recap Analyzers and some production groups lost access to a subset of their frequently changing code located on Bluearc and have to download files from dCache scratch area. The


slide-1
SLIDE 1

Usage Of dCache Resilient Pools for User Code Distribution

slide-2
SLIDE 2

Recap

  • Analyzers and some production groups lost access to a subset of their frequently changing code

located on Bluearc and have to download files from dCache scratch area. The thousand of jobs they started on the Grid tried to simultaneously access the same file. Bluearc access were controlled by locks (5 per experiment).

  • After we dismounted BlueArc (1/18/2018) users could run anywhere. Jobs started on the OSG are

downloading the same files but with much lower transfer rates, creating even more complications for jobs that are waiting access to the same pool.

  • Temporary Solution (part 1)

○ We attempted to parallelize file access by distributing file replicas across many pools utilizing dCache file Resilient pool feature. ○ Each file under /pnfs/<experiment>/resilient was set to be replicated 20 times across existing readWritePools. ○ Asked user to store files on dCache resilient pool group.

  • Temporary Solution (part 2)

○ Implemented JobSub feature that allows to upload tar files to a special area on resilient pools and handles clean up.

slide-3
SLIDE 3

Current Status

Scratch Resilient

Number of tar files (code) pulled from dCache during last 7 days by experiment (scratch vs resilient)

220k 180k

Number of tar files (code) pulled from dCache during last 7 days by experiment (direct upload vs jobsub upload)

Direct upload Upload via jobsub

slide-4
SLIDE 4

Clean Up Issues

Most of the experiment don’t use jobsub feature and don’t clean up the area Most of the users keep their code but some users keep LOG.TAR files at resilient pools!

Factor of 20! Tar files in resilient areas by experiment (old == didn’t access during last 3 months)

slide-5
SLIDE 5

Moving Forward Proposal

  • Contact experiment’s liaison with the list of users/old tar files and request clean up.
  • Continue to push users to use jobsub feature.
  • Start deleting files that are older than a month. (We cannot really do it ourselves)
  • Drop factor of 20 replication to 10.
  • Ultimate goal is to move to Rapid Code Distribution Service