some ideas for best practice in scientific computing
play

Some Ideas for Best Practice in Scientific Computing Dr Owain - PowerPoint PPT Presentation

Some Ideas for Best Practice in Scientific Computing Dr Owain Kenway, (@owainkenway) UCL/ISD/RITS/RCAS/Team Leader Scientific Computing? Doing science with computers Generating data Simulation Analysing data


  1. Some Ideas for “Best Practice” in Scientific Computing Dr Owain Kenway, (@owainkenway) UCL/ISD/RITS/RCAS/Team Leader

  2. “Scientific Computing?” ● “Doing science with computers” – Generating data → Simulation – Analysing data → Filtering, statistical analysis… – Theorising about data → Machine learning/AI? ● Not just science – Arts/humanities → “Research Computing”

  3. About Me ● Been at UCL since 2005 (Computational Chemistry PhD) ● Spent the last 8 or so years working in Research Computing in ISD – Team Lead of Research Computing Applications and Support – Look after users and applications on UCL ISD managed resources + design those services

  4. Contents ● Overview of HPC/HTC services at UCL ● Version Control ● Publishing Code (+ Data) ● Pitfalls 4 contentious statements

  5. UCL Research Computing Resources ● Parallel ● UCL only services: – Single job spans multiple nodes – Grace → High Performance – Tightly coupled parallelisation usually in Computing (HPC) MPI – Myriad , Legion → High Throughput – Sensitive to network performance Computing (HTC) – Currently primarily chemistry, physics, – Aristotle → Interactive teaching engineering Linux service ● High throughput – Lots (tens of thousands) of independent ● National services: jobs on different data – Thomas (Tier 2 MMM hub) – High I/O – Michael (Faraday Battery Institute) – Currently, primarily biosciences and physics – In the future, digital humanities

  6. HPC Many processes on many processors work simultaneously + communicate between each other Input Data Output Data

  7. HTC Many processes, operate independently of each other and in Output Input any order Data Data

  8. The what + why of version control ● Version control systems are tools that let you keep track of who changed a file or set of files, when and what they changed. – If you are collaborating they let you all work on a project and share changes in a structured way. – If you are working on a long term project (e.g. your PhD thesis!) help you keep a record of what you did and when (and get old versions back). ● Many available, many types – from very basic (e.g. “track changes”) to very advanced decentralised systems.

  9. Git and Github ● Git is an Open Source (GPL) command line tool originally written by Linus Torvalds. – But there are lots of graphical tools available that “talk git” – “Decentralised” - i.e. every person working on a repository has their own copy ● Github is a centralised service for hosting, sharing and contributing to git repositories of open source code – A sort of “social network” for coding – Free for public repositories – Recently bought by Microsoft! “Octocat”, Github’s cute mascot

  10. Github is an interesting place to explore ● It’s the default for RITS (including RSD) at UCL – e.g. – https://github.com/UCL/i_newspaper_rods- software to run queries over the British Museum’s Times Digital archive. – https://github.com/UCL-RITS/rcps-buildscripts/- all the installation management for UCL RC services (and where you can request new software). ● Code for all sorts and scales of projects, inc. big companies like Microsoft, Valve...

  11. Setting up git/Github ● Depends on whether you are using Linux, Mac or Windows! Linux – often already installed, or install from your package manager – Mac – install from the Xcode developer tools – Windows – a lot more complicated: pick an option from: – Command-line tools: https://git-scm.com/downloads ● GUI choices: https://git-scm.com/downloads/guis ● ● Set up name and email in the client ● IF you want to use Github, register a Github account More detail on linking this to git on your local machine here: – https://help.github.com/articles/set-up-git/

  12. But overall... ● You don’t have to like git or even use it: – Other version control systems are available (SVN, CVS...) – Anything is better than nothing – what is important is to have a good automated way of tracking what you did when and getting back “that” version of the code. – Find out if your research group already uses a version control tool and use that. – Similarly there are Github alternatives for collaboration like BitBucket. ● Anything that’s a text file(*) can go into version control – this includes LaTeX source if you use that for your thesis/papers. (*) Binary files can go in but you can’t see the difference between versions as easily

  13. Aside: Code: Application vs Method Applications Method ● Packaged as “ready for other ● “What I did” people” ● Really a part of the write-up ● Works on machines other than the – Probably hard-coded to work developer’s: with one dataset, in the few No hard coding of paths – environments available to the “sensible” install process – user. works on arbitrary dataset – – Jupyter notebooks etc. ● Used directly by other people for ● Inspire other people’s work work

  14. Publishing Code + Data First contentious statement: IF your research is publicly funded it is your moral obligation to make your code and data available to outsiders under a reasonable license. Increasingly, funding councils agree.

  15. Publishing Code + Data: Motivations ● Citations: IF you license appropriately, get citations for free! ● Collaborations: IF you are willing to, potential collaborators will come to you! ● “Reproducibility” + finding errors IF you are not evil, this is good.

  16. Reproducibility Second contentious statement: “Reproducibility” in scientific computing has been hijacked by software engineers. ● Overwhelming on focus on bit-perfect reproduction of results: Containers/VMs to exactly reproduce environment. – → Doesn’t work anyway because of hardware. Only actually of use forensically (of course useful for moving your software about which – is a separate issue)

  17. Reproducibility ● Relatively little focus on whether the general method is stable: – If your method only works with a particular compiler/MKL/whatever version then it may be a bug, not a valid result. – (Related) If your code stops working because a language feature is deprecated then expecting that old version to be available for the lifetime of your research is a bad idea – update your working version of your code.

  18. Publishing Code: “Do” Things to think about: ● What you are publishing – is it an “Application” or is it a “Method”? Set expectations for users in the documentation – ● License – in order for people to actually use your software! ● Versions: Keep old versions online and distinguish them i.e. “ myprg-1.2.3.tar.gz ” not “ myprg- – current.tar.gz ” Tag releases (if on Github etc.). – When publishing results say which version/tag you used! – ● DO THE SAME FOR DATA SETS!

  19. Pitfalls ● Ritual ● The “things I did” explosion ● Obsessing over performance/not caring enough ● Designing experiments based on the contents of slide decks

  20. Ritual Third contentious statement: – IF you are publishing research you should know how your results are generated . – i.e. it’s not just enough to plug some data into a black box whose workings you do not understand and then publish the results. THIS DOES NOT MEAN YOU NEED TO UNDERSTAND THE COMPUTER DOWN TO THE MICROCODE!

  21. Avoiding Ritual ● Read up on the software you are using. ● Think about its limitations: – Is its output is deterministic or is there a random element? – Where does that algorithm break down? – What sort of machine can I run this on? ● Think about how it might be applied to your problem: – Am I actually using this software appropriately? – What data requirements do I need to think about? ● Think about your results: – Are they reasonable?

  22. It’s just a model... This all dovetails neatly together into a larger problem: When we simulate things, we are just building a model. Models have limits! Computer models are not the only models: ● Animal models ● Building an actual model ● Theoretical models JUST BECAUSE THE COMPUTER MODEL SAYS IT IS TRUE DOESN’T MAKE IT SO!

  23. It’s just a model... Real life... Physical scale model... Computer model….

  24. The “things I did” explosion ● This is not unique to scientific computing but is encouraged by the way we use computers. ● It can be tempting to try a lot of unplanned things on our input data and see what “works”: “ I’ll just run it through X and see... ” – This can be difficult to track. – This can be dangerously close to “p-hacking” when analysing data... ● Always record what you did even if you didn’t plan it. – Version control helps (particularly if you are modifying code)

  25. Performance Performance is important. But what is important is the time to get to a meaningful solution, not the performance of code alone. ● There’s no point in learning C to make one job that takes 48 hours run in 4 hours. – But maybe if you have to run 10000 of them? ● Obsessive optimisation is madness. ● It’s completely worth slightly modifying your code to make it run 10x as fast.

  26. Experiment design by slide deck Supervisor: “Hey, I went to this conference and saw a really interesting presentation by Prof. X’s group on this application they have developed and you should try using it on our problem”

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend