SLIDE 5 Challenges
1.Store and organize 100’s of TB’s of large files with multiple
meta-data schema
2.Run informatics analyses that do distributed computations on
very large datasets
3.Do real-time high-performance queries on compact genome
data (e.g. variants)
4.Ensure validity and maintain provenance on all data in the
system over time
5.Make it easy to reproduce pipelines exactly as they were done
in the past
6.Protect all data with flexible access control rules and strong
encryption
7.Share large data sets between data centers and organizations
without physically moving data
4