Usage Of dCache Resilient Pools for User Code Distribution Recap - - PowerPoint PPT Presentation
Usage Of dCache Resilient Pools for User Code Distribution Recap - - PowerPoint PPT Presentation
Usage Of dCache Resilient Pools for User Code Distribution Recap Analyzers and some production groups lost access to a subset of their frequently changing code located on Bluearc and have to download files from dCache scratch area. The
Recap
- Analyzers and some production groups lost access to a subset of their frequently changing code
located on Bluearc and have to download files from dCache scratch area. The thousand of jobs they started on the Grid tried to simultaneously access the same file. Bluearc access were controlled by locks (5 per experiment).
- After we dismounted BlueArc (1/18/2018) users could run anywhere. Jobs started on the OSG are
downloading the same files but with much lower transfer rates, creating even more complications for jobs that are waiting access to the same pool.
- Temporary Solution (part 1)
○ We attempted to parallelize file access by distributing file replicas across many pools utilizing dCache file Resilient pool feature. ○ Each file under /pnfs/<experiment>/resilient was set to be replicated 20 times across existing readWritePools. ○ Asked user to store files on dCache resilient pool group.
- Temporary Solution (part 2)
○ Implemented JobSub feature that allows to upload tar files to a special area on resilient pools and handles clean up.
Current Status
Scratch Resilient
Number of tar files (code) pulled from dCache during last 7 days by experiment (scratch vs resilient)
220k 180k
Number of tar files (code) pulled from dCache during last 7 days by experiment (direct upload vs jobsub upload)
Direct upload Upload via jobsub
Clean Up Issues
Most of the experiment don’t use jobsub feature and don’t clean up the area Most of the users keep their code but some users keep LOG.TAR files at resilient pools!
Factor of 20! Tar files in resilient areas by experiment (old == didn’t access during last 3 months)
Moving Forward Proposal
- Contact experiment’s liaison with the list of users/old tar files and request clean up.
- Continue to push users to use jobsub feature.
- Start deleting files that are older than a month. (We cannot really do it ourselves)
- Drop factor of 20 replication to 10.
- Ultimate goal is to move to Rapid Code Distribution Service