eScience in the Netherlands Rob van Nieuwpoort - - PowerPoint PPT Presentation

escience in the netherlands
SMART_READER_LITE
LIVE PREVIEW

eScience in the Netherlands Rob van Nieuwpoort - - PowerPoint PPT Presentation

eScience in the Netherlands Rob van Nieuwpoort R.vanNieuwpoort@esciencecenter.nl We work demand-driven 35 Career paths Scale eScience eScience top eScience top 13 manager specialist researcher eScience eScience eScience 12


slide-1
SLIDE 1

eScience in the Netherlands

Rob van Nieuwpoort R.vanNieuwpoort@esciencecenter.nl

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6

We work demand-driven

slide-7
SLIDE 7
slide-8
SLIDE 8

35

slide-9
SLIDE 9

Career paths

Research Technical Managerial Scale … 13 12 11 10 eScience research engineer eScience research engineer

eScience coordinator eScience specialist eScience researcher eScience manager eScience top researcher eScience top specialist

slide-10
SLIDE 10

Lessons learnt

  • Demand-driven: start from the science
  • Collaboration, not competition (connected projects, calls)
  • Good is good enough
  • Generalization (10% ring-fenced)
  • Communication communication communication

– Hiring people – Internal communication – Project kickoffs

  • IP, work place, generalization, co-authorships

– coordinators – Web sites, demo’s

  • Keep challenging the RSEs

– Courses, hackathons, sprints, … – Switch disciplines

slide-11
SLIDE 11

eStep

The eScience technology platform A coherent set of technologies to tackle the grand challenges in eScience

slide-12
SLIDE 12

Cross-cutting basic skills

  • Code quality and best practices
  • Integration of software
  • Scaling of software
  • Analytics and statistics
  • Visualization
slide-13
SLIDE 13

NLeSC eScience competences applied in research

  • 1. Optimized data handling

Data integration, data base optimization, structured & unstructured data, real time data

  • 2. Big data analytics

Statistics, machine learning, visualization, text mining

  • 3. Efficient computing

Distributed & accelerated computing, efficient algorithms

Optimized data handling Big Data analytics Efficient computing

Distributed computing

eStep

Accelerated computing Low power computing Orchestrated computing High-performance computing Natural Language processing Machine learning Information visualization Scientific visualization Information retrieval Computer vision Handling sensor data Linked data Information integration Databases Data assimilation

slide-14
SLIDE 14
  • Key expertises

are used in many projects

  • Projects often

use quite a number of different competences and technologies

slide-15
SLIDE 15
slide-16
SLIDE 16

eStep Goals

  • Prevent fragmentation and duplication
  • Promote the exchange and re-use of best practices
  • Represent NLeSC’s expertise and knowledge base
  • Improve the eScience state of the art with a fundamental eScience research line
slide-17
SLIDE 17

NLeSC projects eStep

Tailor Generalize Develop Adopt

slide-18
SLIDE 18

eStep

project-specific software

discipline-specific software

enhanced science

  • verarching software

e-infrastructure

generic libraries, tools, and algorithms

NLeSC projects

  • Main criteria for integrating

technology in eStep:

– State-of-the-art / best-of-breed? – Generic and overarching? – Match with our expertise areas? – Includes externally developed software

Open platform!

slide-19
SLIDE 19

Our sustainability approach

  • Prevent duplication, fragmentation
  • Build something that is worth sustaining!

– Sufficiently generic – Modular – High quality – Must be taken into account from the start

  • Enforce software engineering guidelines and best practices
  • Educate partners with software carpentry and data carpentry
  • Open source / open access, open standards, unless…
  • Community coding
  • Standardization for software and data formats
  • eStep is an open platform
slide-20
SLIDE 20

Gi GitHub Hub

Travis is CI

Test and Deploy with Confidence. Easily sync your GitHub projects with Travis CI and you’ll be testing your code in minutes!

We run a Jenkins CI instance locally. Used for private repositories and repositories requiring HPC middleware.

A Common Workflow @ NLeSC

Open platform for building, shipping and running distributed applications.

deploy

slide-21
SLIDE 21

software

eScience software

  • technology.esciencecenter.nl
  • Non-technical, targets

general audience

slide-22
SLIDE 22
  • estep.esciencecenter.nl
  • All eScience software

and knowledge you need, in one place

  • Technical, targets

developers, PIs

slide-23
SLIDE 23
slide-24
SLIDE 24

Knowledge base

  • knowledge.esciencecenter.nl
  • training and education
  • best practices
  • tutorials
  • white papers
  • training resources
  • Software development

Checklist available

slide-25
SLIDE 25
slide-26
SLIDE 26

More info on eStep

technology.esciencecenter.nl estep.esciencecenter.nl R.vanNieuwpoort@esciencecenter.nl

slide-27
SLIDE 27

Logo Bingo

Osmium Semanticizer xtas EDAL NLTK CommonSense AHN2 viewer

slide-28
SLIDE 28

Optimized data handling

slide-29
SLIDE 29

Cities poorly protected against heat Three persons in elderly house died due to heat End of 16 day heat wave Heat protection plan abandoned Elderly use heat info call desk massively

Summer in the city example: human thermal comfort

Courtesy Bert Holtslag

slide-30
SLIDE 30

Summer in the city example

  • Kilometer scale: elevation (AHN2) and land-use data (Kadaster), imagery

for assessing the green vegetation coverage and the soil moisture content

  • Street scale: sky view factor, the building height to street width ratio, the

reflectivity and thermal characteristics of buildings and streets, the abundance of vegetation

  • Network of weather stations and crowd sourcing: wunderground.com
  • Special measuring campaigns
  • Social media?
  • Combine with fine-grained models

Novel hourly forecasting system for human thermal comfort in urban areas on street level

Courtesy Bert Holtslag

slide-31
SLIDE 31

Via Appia

slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34

Optimized Data Handling Technology

  • Distributed sensor networks, multi-model and multi-scale

simulations, data assimilation, data integration, multi-scale pattern recognition, geographic information systems, databases, …

  • Xenon, NetCDF, HDF5, ROOT, XNAT, OpenDA, Hadoop,

MapReduce, Oracle, MySQL, Postgres, MonetDB, ElasticSearch, DataVault, JSON, Spark, …

slide-35
SLIDE 35

Big Data Analytics

slide-36
SLIDE 36

eEcology example

Courtesy Willem Bouten

slide-37
SLIDE 37

Accelerometer and Behaviour

standing sitting floating

Static acceleration

X-flapping gliding flapping flight

Dynamic acceleration

Heave vertical Surge forward Sway sideward

Courtesy Willem Bouten

slide-38
SLIDE 38

Machine learning / annotation interface

slide-39
SLIDE 39

Routes and Geology

Courtesy Willem Bouten

slide-40
SLIDE 40

Detours and Climate

Modis satellite image of dust Courtesy Willem Bouten

slide-41
SLIDE 41

Embodied Emotion Project

  • Mapping bodily expression of emotions

– To be downhearted – Clenching fists – My heart fills with joy – My blood is boiling

  • Test case: Dutch theatre texts 1600-1830

– Shift in experienced emotions – Shift in embodiment of emotions

  • Approach

– Establishing corpus & standardizing text – Establishing emotional and bodily vocabularies – Emotion mining – Visualizing results

Source: Nummenmaa et al., 2013

slide-42
SLIDE 42

eMetabolomics example

  • Use reaction rules to identify compounds in

Mass-spectrometry datasets

  • Online at http://www.emetabolomics.org/
  • Public and private version, private allows

bigger/longer calculations

  • Lars Ridder, Laboratory of Biochemistry,

Wageningen University

slide-43
SLIDE 43

Courtesy Lars Ridder

slide-44
SLIDE 44

Courtesy Lars Ridder

slide-45
SLIDE 45
slide-46
SLIDE 46

forecast.ewatercycle.org

slide-47
SLIDE 47

Big Data Analytics Technology

  • Natural language processing, machine learning, information and

scientific visualization, information retrieval, computer vision

  • Matlab, R, NumPy, SciPy, scikit learn, Pandas, Weka, Xtas,

Twiqs.nl, D3, ExtJS, Cesium, Leaflet, OpenLayers, GeoExt, X3Dom, X3DomExt, Mapnik, CommonSense, …

slide-48
SLIDE 48

Efficient Computing

slide-49
SLIDE 49

Efficient Computing

A A r r i t t h m h m e e t t i c c I I n n t t e n s e n s i t y t y

O( N ) O( log(N) ) O( 1 ) SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods

  • Smart algorithms can improve performance dramatically
  • Power consumption is becoming the bottleneck
  • Legacy codes are inefficient on modern architectures

– Need completely different optimizations, algorithms

slide-50
SLIDE 50

Efficient Computing Example

  • Radio Frequency Interference (RFI) is a huge

problem for many radio astronomy observations

  • Caused by

– Lightning, Vehicles, airplanes, satellites, electrical equipment, GSM, FM Radio, fences, reflection of wind turbines, …

  • Best removed offline

– Complete dataset available – Good overview / statistics / model – Can spend compute cycles

  • Partner: Astron
slide-51
SLIDE 51

Real-time RFI mitigation

  • Some pipelines need to run in real time today

– Image-based transient detection (LOFAR/AARTFAAC) – Pulsar searching (WSRT/Apertif)

  • SKA will be entirely real-time

– Data rates simply too high to store

  • Novel algorithms with linear computational complexity

– Only very little loss in quality

slide-52
SLIDE 52

RFI mitigation on accelerators

  • Accelerator-based computing

– GPUs, Xeon Phi, … – Astronomy, ocean modeling, digital forensics, radar systems, high- energy physics

  • Auto-tuning & runtime compilation

– Generate many codes at run-time, select most efficient

Pulsar B1919+21 in the Vulpecula nebula. Pulse profile created with folding and the LOFAR software telescope.

Background picture courtesy European Southern Observatory.

10 20 30 40 50 60 70 80 Xeon Phi NVIDIA GTX Titan GPU AMD HD7990 GPU 2 4 6 8 10 Xeon Phi NVIDIA GTX Titan GPU AMD HD7990 GPU

Performance compared to CPU Power usage compared to CPU

slide-53
SLIDE 53

Performance profile of Ocean simulation

POP has a very large codebase written in Fortran 90 Callgraph obtained using gprof 3 kernels GPU optimized, 20% improvement

Henk Dijkstra, Ben van Werkhoven, et al.

slide-54
SLIDE 54

Efficient Computing: Distributed

  • Water management and climate modeling
  • Simulations in astrophysics (AMUSE)
  • Digital forensics (NFI)
  • Astronomy (LOFAR)
  • Text mining (xtas)
  • Computational chemistry (Noodles)
  • High-energy physics (ROOT & pandas)
slide-55
SLIDE 55

Novel domain-specific algorithm for work distribution

Towards 2 km resolution. Better than space filling curves for distributed runs (36% improvement). Courtesy Jason Maassen

slide-56
SLIDE 56

Efficient Computing Technology (1)

  • Smart algorithms:
slide-57
SLIDE 57

Efficient Computing Technology (2)

  • Distributed computing, accelerators (GPUs), low-

power, orchestrated computing, HPC

  • Ibis, Aether, Xenon, Magnesium, Osmium,

SmartSockets, MPI, UDT, Galaxy, Knime, Sockets, Cuda, OpenCL, OpenMP, Thrift, Protobuffers, ZeroMQ, SSH, GridFTP, Globus, SLURM, PBS, GridEngine, P2P, OpenFlow, bandwidth on demand, Docker, Celery, …

slide-58
SLIDE 58

eStep: Example Libraries

slide-59
SLIDE 59

Several NLeSC applications require access to distributed compute and storage resources. Xenon provides a simple API for this, allowing rapid development of such applications Middleware independent: portable & reusable

slide-60
SLIDE 60

Xenon: current users

  • Osmium: webservice on top Xenon

– Magma: eMetabolomics mass spectrometry analysis – SIMCity: decision support for urban social economic complexity

  • eSalsa: distributed climate simulation deployment
  • Amuse: multi-model distributed astrophysics
  • Via Appia: Pointclouds in archeology
  • Vbrowser: distributed file management
  • Biomarker: medical data upload tool
  • Noodles: workflow tool in Python (Computational chemistry)
slide-61
SLIDE 61

Xtas: distributed text analysis

  • Natural language processing and text mining

– Named entity recognition, sentiment detection, document clustering, topic modeling

  • Use as web service or Python library
  • Integrates with Elasticsearch for document storage and retrieval
  • Developed by Intelligent System Lab Amsterdam (ISLA, UvA) and NLeSC
  • Tight integration of external and internal software
  • Searching Public Discourse: text analytics pipeline for historical research
  • Texcavator: supporting large-scale text mining in the field of digital humanities
  • Beyond the book: how international is a work of fiction?
  • Semanticizer: entity linking
slide-62
SLIDE 62

More info on eStep

technology.esciencecenter.nl estep.esciencecenter.nl R.vanNieuwpoort@esciencecenter.nl

slide-63
SLIDE 63
slide-64
SLIDE 64

Backup slides

slide-65
SLIDE 65

Generalized software from NLeSC portfolio

eStep

Externally developed software

you are here ASDI

slide-66
SLIDE 66

Generalized software from NLeSC portfolio

eStep

Externally developed software

you are here DTEC

slide-67
SLIDE 67

NLeSC Training

Essential Skills in Data-Intensive Research: Enabling your Research 25-29 January 2016, SURF Academy, Utrecht (for Life Sciences) NLeSC, SURFsara, SURFnet and DTL 5 day workshop, with both hands-on and taught components for 1st year PhD students without programming background Days 1-2 = Dealing with Data (Data Carpentry) Days 3-4 = Software and Programming (Software Carpentry) Day 5 = Introducing the e-Infrastructure Future courses requested with other domain focus

  • Humanities & Social Sciences
  • Environment & Sustainability
  • Life Sciences
  • Physics & Beyond

Based on SoftwareCarpenty.org model (NLeSC & SURFsara have trained course leaders)

Dealing with Data Data Stewardship and FAIR best practice From Excel to databases SQL practical R practical Computation & Automation Using the Shell Introduction to programming (Python) Github and version control. Unit testing Debugging Documentation Introduction to the e- Infrastructure SURF to introduce the national e-infrastructure Real world examples of e- infrastructure enhanced research from NLeSC

slide-68
SLIDE 68
slide-69
SLIDE 69
slide-70
SLIDE 70
slide-71
SLIDE 71
  • Make researchers more productive by teaching them basic lab skills for scientific computing
  • All lessons are freely available
  • Workshops, teacher trainings
  • Example lessons

– Version Control and Unit Testing for Scientific Software – Shell, Git, Scientific Python – Testing and Continuous Integration with Python – From Excel to a Database – Data Management in the Ocean, Weather and Climate Sciences – Visualizing Your Data on the Web Using D3 – Working With Data on the Web – Intermediate/Advanced R Lessons – Programming with GAP

slide-72
SLIDE 72
  • Develop and teach workshops on the fundamental data skills for research in all domains
  • Covering the full lifecycle of data-driven research
  • Introductory computational skills for data management and analysis
  • Domain-specific lessons, from life and physical sciences to social sciences
  • Build on existing knowledge, enabling quick application of new skills to own research
  • Examples:

– Ecology

  • Data Organization in Spreadsheets, Data Cleaning with OpenRefine, Data Management with SQL, Data

Analysis and Visualization in R, Data Analysis and Visualization in Python – Genomics

  • Introduction to cloud computing for genomics, Introduction to the command line, Data wrangling and

processing, Data analysis in R, Data visualization in R – Social sciences

  • Social sciences text mining

– Biology – Geospatial data

slide-73
SLIDE 73

Coding Style

  • Nicholas C. Zakas: Why Coding Style Matters
  • http://coding.smashingmagazine.com/2012/10/25/why-coding-style-matters
  • Use is mandatory
  • We provide editor configuration
  • http://editorconfig.org/

EditorConfig

slide-74
SLIDE 74

Conventions & Guidelines

  • Web development

– General frontend guidelines: https://github.com/bendc/frontend-guidelines – AngularJS: https://github.com/johnpapa/angular-styleguide – Airbnb JavaScript Style Guide: https://github.com/airbnb/javascript

  • Python

– PEP8: https://www.python.org/dev/peps/pep-0008/

  • Java

– Code Conventions for the JavaTM Programming Language (Oracle)

  • Google Style Guides: https://github.com/google/styleguide
  • Wikipedia: https://en.wikipedia.org/wiki/Coding_conventions
slide-75
SLIDE 75

Quality Improvement Tools

  • SonarQube: http://www.sonarqube.org
  • Code climate: https://codeclimate.com
  • Codacy: https://www.codacy.com
  • Scrutinizer: https://scrutinizer-ci.com​
  • Landscape: https://landscape.io
  • Coveralls: https://coveralls.io
  • See also

– https://github.com/ripienaar/free-for-dev#code-quality – http://shields.io/

Article about good development practices: The Joel Test: 12 Steps to Better Code.

slide-76
SLIDE 76

Unit & Integration Testing

  • Guide: Writing Testable Code
  • 'Unit Testing Best Practices' and other presentations on http://artofunittesting.com/.
  • Continuous integration testing with Travis-CI and Jenkins-CI
  • We require at least 70% code coverage
  • Java: junit
  • Javascript

– Jasmine, a behavior-driven development framework for testing JavaScript code. – Karma, Runs tests in web browser with code coverage. – PhantomJS, headless web browser on CI-servers.

  • Python

– Unittest, nose and pytest.

  • R

– testthat

  • Web development

– To interact with web-browsers use Selenium. – Sauce Labs hosts a matrix of web-browsers and Operating Systems for testing. – AngularJS applications can be tested with Protractor.

slide-77
SLIDE 77

Documentation

  • Document at multiple levels

– Source code comments – API documentation – Installation and usage documentation

  • Comments at each level should take into account the different target audiences
  • Use Markdown, a readable lightweight markup language that can be converted

to many formats

slide-78
SLIDE 78

Version Control

  • Git and GitHub
  • A successful and simple Git branching model:

GitHub Flow

  • https://guides.github.com/introduction/flow/
  • Commit messages are formatted and

formulated in a readable way

  • http://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html
  • http://who-t.blogspot.nl/2009/12/on-commit-messages.html
slide-79
SLIDE 79

Releases and packaging

  • Tag versions, use github releases
  • Semantic versioning
  • Keep changelogs
  • Packaging is important

– Use packaging that is well known and appropriate for user community: pypi, npm, maven, docker

  • Make your code and data citable: get a DOI (Zenodo)
slide-80
SLIDE 80

Support levels

  • S0: generic software or hibernating software that is currently not used in the

NLeSC project portfolio

– Fortran, Python, vBrowser, TwiNL, XNAT, … – No support, not disseminated

  • S1: software where NLeSC maintains expertise on, and that is used in projects,

as well as external software that NLeSC extends and improves

– Potree, OpenDA, ElasticSearch – Support for project partners only – Contribute improvements back to community

  • S2: software developed in-house, where NLeSC is the specialist

– Xenon, Magnesium, Osmium, xtas, esiBayes, Aether, … – Full support for project partners, limited support for Dutch scientific community, best effort for international community

slide-81
SLIDE 81

IP & Software Licenses

  • NLeSC does not develop IP portfolio
  • Ownership of research results is shared property of partners,

including NLeSC

  • IP protection is possible
  • Software is open source / open access unless agreed otherwise
  • Default software license: Apache 2.0
  • Deviations possible, discuss with NLeSC MT