Sofware Heritage Building the Universal Sofware Archive Nicolas - - PowerPoint PPT Presentation

sofware heritage
SMART_READER_LITE
LIVE PREVIEW

Sofware Heritage Building the Universal Sofware Archive Nicolas - - PowerPoint PPT Presentation

Sofware Heritage Building the Universal Sofware Archive Nicolas Dandrimont July 4th, 2016 Nicolas Dandrimont Sofware Heritage July 4th, 2016 1 / 22 Outline The need for Sofware Preservation 1 Sofware all around us Sofware is Fragile The


slide-1
SLIDE 1

Sofware Heritage

Building the Universal Sofware Archive Nicolas Dandrimont July 4th, 2016

Nicolas Dandrimont Sofware Heritage July 4th, 2016 1 / 22

slide-2
SLIDE 2

Outline

1

The need for Sofware Preservation Sofware all around us Sofware is Fragile

2

The Sofware Heritage project Our mission Our vision

3

Sofware Heritage in depth Our current work Our roadmap

4

How to contribute to Sofware Heritage Developer information Sponsoring opportunities

5

Conclusion

Nicolas Dandrimont Sofware Heritage July 4th, 2016 2 / 22

slide-3
SLIDE 3

Sofware is Pervasive

At the heart of our society

Software

communication, entertainment administration, finance health, energy, transportation education, research, politics ...

Nicolas Dandrimont Sofware Heritage July 4th, 2016 2 / 22

slide-4
SLIDE 4

Sofware is Pervasive

At the heart of our society

Software

communication, entertainment administration, finance health, energy, transportation education, research, politics ... At the heart of technology house appliances ≈ 10M SLOC phones ≈ 20M SLOC, cars ≈ 100M SLOC Internet of things, ...

Nicolas Dandrimont Sofware Heritage July 4th, 2016 2 / 22

slide-5
SLIDE 5

Sofware is Knowledge

Key mediator for accessing all information (c) Banksy Information is a main pillar of our modern societies. Absent an ability to correctly interpret digi- tal information, we are lef with [...] "roting bits" [...] of no value. Vinton G. Cerf IEEE 2011

Nicolas Dandrimont Sofware Heritage July 4th, 2016 3 / 22

slide-6
SLIDE 6

Sofware is Knowledge

Key mediator for accessing all information (c) Banksy Information is a main pillar of our modern societies. Absent an ability to correctly interpret digi- tal information, we are lef with [...] "roting bits" [...] of no value. Vinton G. Cerf IEEE 2011 Sofware is an essential component of modern scientific research [...] the vast majority describe experimental methods or sofware that have become essential in their fields. Top 100 papers (Nature, October 2014)

Nicolas Dandrimont Sofware Heritage July 4th, 2016 3 / 22

slide-7
SLIDE 7

Sofware is Knowledge

Key mediator for accessing all information (c) Banksy Information is a main pillar of our modern societies. Absent an ability to correctly interpret digi- tal information, we are lef with [...] "roting bits" [...] of no value. Vinton G. Cerf IEEE 2011 Sofware is an essential component of modern scientific research [...] the vast majority describe experimental methods or sofware that have become essential in their fields. Top 100 papers (Nature, October 2014) Botomline: Sofware embodies our Knowledge and Cultural Heritage It must be collected, preserved, referenced and made accessible!

Nicolas Dandrimont Sofware Heritage July 4th, 2016 3 / 22

slide-8
SLIDE 8

Sofware is Fragile

Bits rot, hosters shut down Have you tested your backups recently? How about git fsck? Gitorious Google Code Sofware is scatered all around GitHub, GitLab, BitBucket, SourceForge, alioth, ... ... your personal home page, ... No uniformity or stability whatsoever Sofware migrates from hosters to hosters, URIs aren’t perennial

Nicolas Dandrimont Sofware Heritage July 4th, 2016 4 / 22

slide-9
SLIDE 9

Outline

1

The need for Sofware Preservation Sofware all around us Sofware is Fragile

2

The Sofware Heritage project Our mission Our vision

3

Sofware Heritage in depth Our current work Our roadmap

4

How to contribute to Sofware Heritage Developer information Sponsoring opportunities

5

Conclusion

Nicolas Dandrimont Sofware Heritage July 4th, 2016 5 / 22

slide-10
SLIDE 10

Our mission

Collect, organise, preserve and share all the sofware source code that lies at the heart of our culture and our society. https://www.softwareheritage.org/

Nicolas Dandrimont Sofware Heritage July 4th, 2016 5 / 22

slide-11
SLIDE 11

Sofware Source Code is different

“Programs must be writen for people to read, and only incidentally for machines to execute.” Harold Abelson, Structure and Interpretation of Computer Programs Distinguishing features executable and human readable knowledge (an all time new)

even hardware is... sofware! (VHDL, FPGA, ...) text files are forever

naturally evolves over time

the development history is key to its understanding

complex: large web of dependencies, millions of SLOCs In a word sofware is not just another sequence of bits a sofware archive is not just another digital archive

Nicolas Dandrimont Sofware Heritage July 4th, 2016 6 / 22

slide-12
SLIDE 12

We are working on the foundations

One infrastructure to build them all

Nicolas Dandrimont Sofware Heritage July 4th, 2016 7 / 22

slide-13
SLIDE 13

Preserving the world’s sofware heritage

A structured archive of all of the world’s sofware preserve humanity’s technological and scientific knowledge enable continued access to all digital documents and information building block for thematic portals and collections

Nicolas Dandrimont Sofware Heritage July 4th, 2016 8 / 22

slide-14
SLIDE 14

Beter sofware for industry

A unique reference catalog of all industrial sofware components ensures long term preservation of critical sofware eases vulnerability tracking for more secure sofware simplifies traceability for beter sofware integration

Nicolas Dandrimont Sofware Heritage July 4th, 2016 9 / 22

slide-15
SLIDE 15

Supporting more accessible and reproducible science

A global library referencing all sofware used in all research fields completes the infrastructure for Open Access in science provides intrinsic persistent identifiers needed for scientific reproducibility enables large scale, verifiable sofware studies

Nicolas Dandrimont Sofware Heritage July 4th, 2016 10 / 22

slide-16
SLIDE 16

Outline

1

The need for Sofware Preservation Sofware all around us Sofware is Fragile

2

The Sofware Heritage project Our mission Our vision

3

Sofware Heritage in depth Our current work Our roadmap

4

How to contribute to Sofware Heritage Developer information Sponsoring opportunities

5

Conclusion

Nicolas Dandrimont Sofware Heritage July 4th, 2016 11 / 22

slide-17
SLIDE 17

Meet the team

Roberto Di Cosmo, CEO Stefano Zacchiroli, CTO Antoine Dumont and Nicolas Dandrimont, Engineers Jordi Bertran de Balanda and Qentin Campos, Interns Guillaume Rousseau, Visiting Scientist

Nicolas Dandrimont Sofware Heritage July 4th, 2016 11 / 22

slide-18
SLIDE 18

Our stack

Hardware Hosted by Inria One big hypervisor with a dozen virtual machines One high density storage array (60 * 6TB hard drives => 300TB usable) Another copy of the data in another server room; logical leader/follower mirroring Soon to enable a mirror network to duplicate our contents Sofware Debian for all our machines PostgreSQL for metadata storage Python3 and psycopg2 for the backend Flask for the web apps RabbitMQ for task scheduling

Nicolas Dandrimont Sofware Heritage July 4th, 2016 12 / 22

slide-19
SLIDE 19

Our values are those of Debian

100% FOSS licenses GPLv3 for the backend code AGPLv3 for the frontend Apache2 for the Puppet manifests Community-minded We encourage bug reports and code contributions from everyone interested in pursuing our sofware preservation mission.

Nicolas Dandrimont Sofware Heritage July 4th, 2016 13 / 22

slide-20
SLIDE 20

Source Code

Our forge opens today https://forge.softwareheritage.org/ We’ve timed the opening of our forge for DebConf, as a thank you for what the community has given to us.

Nicolas Dandrimont Sofware Heritage July 4th, 2016 14 / 22

slide-21
SLIDE 21

Current data sources

Ingest all the sofware all the "non-fork" GitHub repositories all the Debian packages from snapshot.debian.org the GNU project FTP archive Preserve all the sofware Google Code Gitorious

Nicolas Dandrimont Sofware Heritage July 4th, 2016 15 / 22

slide-22
SLIDE 22

The structure of the archive

On-disk storage flat file storage for contents postgres database for the metadata Data model: one big Merkle DAG, inspired by the git model Origins (= repositories) Occurrences (= branches) Releases (= tags) Revisions (= commits) Directories (= trees) Contents (= blobs)

Nicolas Dandrimont Sofware Heritage July 4th, 2016 16 / 22

slide-23
SLIDE 23

Sofware Heritage in numbers

Volume 120TB used by (gzipped) files on disk 3.1TB PostgreSQL database for the metadata Counts 2.7 billion files 2.2 billion directories 600 million revisions 12 million people 5 million releases

Nicolas Dandrimont Sofware Heritage July 4th, 2016 17 / 22

slide-24
SLIDE 24

Sofware Heritage in numbers

Volume 120TB used by (gzipped) files on disk 3.1TB PostgreSQL database for the metadata Counts 2.7 billion files 2.2 billion directories 600 million revisions 12 million people 5 million releases By far, the biggest DVCS tree in existence

Nicolas Dandrimont Sofware Heritage July 4th, 2016 17 / 22

slide-25
SLIDE 25

The road ahead

Planned features... lookup by hashes for contents (done) provenance information for all the content browsing: wayback machine for sofware source code full text search: dive into the Sofware Heritage archive download: git clone from Sofware Heritage

Nicolas Dandrimont Sofware Heritage July 4th, 2016 18 / 22

slide-26
SLIDE 26

The road ahead

Planned features... lookup by hashes for contents (done) provenance information for all the content browsing: wayback machine for sofware source code full text search: dive into the Sofware Heritage archive download: git clone from Sofware Heritage ... and many more one could imagine all the world’s sofware development history in a single graph! that makes a 3.1TB database already...

Nicolas Dandrimont Sofware Heritage July 4th, 2016 18 / 22

slide-27
SLIDE 27

Technical Challenges for data sources

Handling the backlog ingesting all the data saved one-shot Staying current new repositories and commits on GitHub sync with snapshot.debian.org ... We need reliable, standardised event feeds Expanding the archive discover and classify all the sofware sources importers for other VCSs (SVN, Hg, ...) Wonderful playground if you have time to help!

Nicolas Dandrimont Sofware Heritage July 4th, 2016 19 / 22

slide-28
SLIDE 28

Outline

1

The need for Sofware Preservation Sofware all around us Sofware is Fragile

2

The Sofware Heritage project Our mission Our vision

3

Sofware Heritage in depth Our current work Our roadmap

4

How to contribute to Sofware Heritage Developer information Sponsoring opportunities

5

Conclusion

Nicolas Dandrimont Sofware Heritage July 4th, 2016 20 / 22

slide-29
SLIDE 29

Join our community of developers

Our forge is open at https://forge.softwareheritage.org/ Subscribe to our mailing list https://deb.li/swhdevelml Have a look at our wiki https://wiki.softwareheritage.org/

Nicolas Dandrimont Sofware Heritage July 4th, 2016 20 / 22

slide-30
SLIDE 30

Join our sponsors

Inria as initiator French national institute for research in Computer Science. Contributed to the birth of W3C 4500 people, many prestigious scientists In the news: Freak and Logjam TLS vulnerabilities Inria is fully supporting the bootstrap phase of Sofware Heritage. More info for sponsors on our website: https://www.softwareheritage.org/support/ sponsors/

Nicolas Dandrimont Sofware Heritage July 4th, 2016 21 / 22

slide-31
SLIDE 31

Outline

1

The need for Sofware Preservation Sofware all around us Sofware is Fragile

2

The Sofware Heritage project Our mission Our vision

3

Sofware Heritage in depth Our current work Our roadmap

4

How to contribute to Sofware Heritage Developer information Sponsoring opportunities

5

Conclusion

Nicolas Dandrimont Sofware Heritage July 4th, 2016 22 / 22

slide-32
SLIDE 32

Conclusion

Sofware Heritage is a revolutionary reference archive of all sofware ever writen a unique complement for development platforms like GitHub an international, open, nonprofit, mutualized infrastructure ready to work with you!

Nicolas Dandrimont Sofware Heritage July 4th, 2016 22 / 22

slide-33
SLIDE 33

Conclusion

Sofware Heritage is a revolutionary reference archive of all sofware ever writen a unique complement for development platforms like GitHub an international, open, nonprofit, mutualized infrastructure ready to work with you!

Qestions ?

Contact us email: olasd@sofwareheritage.org / info@sofwareheritage.org website: https://www.softwareheritage.org

Nicolas Dandrimont Sofware Heritage July 4th, 2016 22 / 22