Tools for large-scale collection & analysis of source code - PowerPoint PPT Presentation

Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE

Intro Alexander Bezzubov source{d} committer & PMC @ apache zeppelin startup in Madrid ➔ ➔ engineer @source{d} builds the open-source components that enable large-scale ➔ ➔ code analysis and machine learning on source code

motivation & vision MOTIVATION: WHY COLLECTING SOURCE CODE Academia: material for research in IR/ML/PL communities ➔ Industry: fuel for building data-driven products (i.e for sourcing candidates for hiring) ➔ VISION: OSS collection pipeline ➔ Use it to build public datasets industry and academia ➔ Use Git as “source of truth”, the most popular VCS ➔ A crawler (find URLs, git clone them), Distributed storage (FS, DB), Parallel processing framework ➔ custom, in Golang standard, Apache HDFS + Postgres custom, library for Apache Spark

Tech Stack

tech stack CoreOS infrastructure K8s Rovers Borders collection go-git HDFS storage śiva Apache Spark processing source{d} Engine Bblfsh analysis Enry

infrastructure Dedicated cluster (cloud becomes prohibitively expensive for storing ~100sTb) ➔ CoreOS provisioned on bare-menta \w Terraform ➔ Booting and OS configuration Matchbox and Ignition ➔ K8s deployed on top of that ➔ More details at talk at CfgMgmtCamp http://cfgmgmtcamp.eu/schedule/terraform/CoreOS.html

collection Rovers: search for Git repository URLs ➔ Borges: fetching repository \w “git pull” ➔ Git storage format & protocol implementation ➔ Optimize for on-disk size: forks that share history, saved together ➔ go-git to talk Git Last year had a talk at FOSDEM https://archive.fosdem.org/2017/schedule/event/go_git/

go-git A HIGHLY EXTENSIBLE IMPLEMENTATION OF GIT IN GO motivation example usage TRY IT YOURSELF GIT LIBRARY FOR GO GO-GIT IN ACTION example mimicking `git clone` using go-git: • need to clone and analyze tens of millions of # installation repositories with our core language Go $ go get -u gopkg.in/src-d/go-git.v4/... • be able to do so in memory, and by using custom // Clone the repo to the given directory filesystem implementations • list of more go-git usage examples url := "https://github.com/src-d/go-git", • easy to use and stable API for the Go community _, err := git.PlainClone( "/tmp/foo", false, &git.CloneOptions{ • used in production by companies, e.g.: keybase.io URL: url, Progress: os.Stdout, }, ) features resources CheckIfError(err) PURE GO SOURCE CODE YOUR NEXT STEPS output: • the most complete git library for any language after • https://github.com/src-d/go-git Counting objects: 4924, done. libgit2 and jgit Compressing objects: 100% (1333/1333), done. • go-git presentation at FOSDEM 2017 Total 4924 (delta 530), reused 6 (delta 6), • highly extensible by design pack-reused 3533 • go-git presentation at Git Merge 2017 • idiomatic API for plumbing and porcelain commands • compatibility table of git vs. go-git • 2+ years of continuous development • comparing git trees in go • used by a significant number of open source projects

rovers & borges LARGE SCALE CODE REPOSITORY COLLECTION AND STORAGE motivation architecture usage CODE COLLECTION AT SCALE KEY CONCEPT SETUP & RUN • collection and storage of repositories at large scale • rooted repositories are standard git repositories that • set up and run rovers store all objects from all repositories that share a • automated process • set up borges common history, identified by same initial commit: • optimal usage of storage • run borges producer • optimal to keep repositories up-to-date with the origin • run borges consumer architecture resources SEEK, FETCH, STORE YOUR NEXT STEPS • distributed system similar to a search engine • https://github.com/src-d/rovers • src-d/rovers retrieves URLs from git hosting providers via API, plus self-hosted git repositories • https://github.com/src-d/borges • src-d/borges producer reads URL list, schedules fetching • a rooted repository is saved in a single śiva file • https://github.com/src-d/go-siva • borges consumer fetches and pushes repo to storage • updates stored in concatenated siva files: no need to • śiva: Why We Created Yet Another Archive rewriting the whole repository file Format • borges packer also available as a standalone command, transforming repository urls into siva files • distributed-file-system backed, supports GCS & HDFS • stores using src-d/śiva repository storage file format • optimized for storage and keeping repos up-to-date

storage Metadata: PostgreSQL ➔ Built small type-safe ORM for Go<->Postgres ➔ https://github.com/src-d/go-kallax Data: Apache Hadoop HDFS ➔ Custom (seekable, appendable) archive format: Siva 1 RootedRepository <-> 1 Siva file ➔

śiva SEEKABLE INDEXED BLOCK ARCHIVER FILE FORMAT motivation architecture SMART REPO STORAGE SIVA FILE BLOCK SCHEMA • store a git repository in a single file • updates possible without rewriting the whole file • friendly to distributed file systems • seekable to allow random access to any file position architecture usage resources CHARACTERISTICS APPENDING FILES YOUR NEXT STEPS • src-d/go-siva is an archiving format similar to tar or zip # pack into siva file • https://github.com/src-d/go-siva $ siva pack example.siva qux • allows constant-time random file access • śiva: Why We Created Yet Another Archive # append into siva file Format • allows seekable read access to the contained files $ siva pack --append example.siva bar • allows file concatenation given the block-based design # list siva file contents $ siva list example.siva • command-line tool + implementations in Go and Java Sep 20 13:04 4 B qux -rw-r--r-- Sep 20 13:07 4 B bar -rw-r--r--

tech stack Core OS infrastructure K8s Rovers Borders collection go-git HDFS storage śiva Apache Spark processing source{d} Engine Bblfsh analysis Enry

processing Apache Spark For batch processing, SparkSQL ➔ Engine Library, \w custom DataSource implementation GitDataSource ➔ Read repositories from Siva archives in HDFS, exposes though DataFrame ➔ API for accessing refs/commits/files/blobs ➔ Talks to external services though gRPC for parsing/lexing, and other analysis ➔

engine UNIFIED SCALABLE CODE ANALYSIS PIPELINE ON SPARK motivation architecture usage sample UNIFIED SCALABLE PIPELINE APACHE SPARK DATAFRAME EngineAPI(spark, 'siva', '/path/to/siva-files') .repositories .references • easy-to-use pipeline for git repository analysis .head_ref .files • integrated with standard tools for large scale data analysis .classify_languages() .extract_uasts() • avoid custom code in operations across millions of repos .query_uast('//*[@roleImport and @roleDeclaration]', 'imports') .filter("lang = 'java'") .select('imports', 'path', 'repository_id') .write .parquet("hdfs://...") architecture PREPARATION resources • extends Apache SparkSQL • git repositories stored as siva files or standard • listing and retrieval of git repositories YOUR NEXT STEPS repositories in HDFS • Apache Spark datasource on top of git repositories • metadata caching for faster lookups over all the dataset. • iterators over any git object, references • https://github.com/src-d/engine • code exploration and querying using XPath expressions • fetches repositories in batches and on demand • Early example jupyter notebook: https://github.com/src-d/spark-api/blob/maste • language identification and source code parsing • available APIs for Spark and PySpark r/examples/notebooks/Example.ipynb • feature extraction for machine learning at scale • can run either locally or in a distributed cluster

analysis Enry Programming language identification ➔ Re-write of github/linguist in Golang, ~370 langs ➔ Project Babelfish Distributed parser infrastructure for source code analysis ➔ Unified interface though gRPC to native parsers in containers: src -> uAST ➔ Talk in Source Code Analysis devRoom Room: UD2.119, Sunday, 12:40 https://fosdem.org/2018/schedule/event/code_babelfish_a_universal_code_parser_for_source_c ode_analysis/

Tools for large-scale collection & analysis of source code - PowerPoint PPT Presentation

Tools for large-scale collection & analysis of source code repositories OPEN SOURCE GIT REPOSITORY COLLECTION PIPELINE Intro Alexander Bezzubov source{d} committer & PMC @ apache zeppelin startup in Madrid engineer

Sunglasses SM001 Collection SM005 Collection YPC001 Collection(swimming goggles) SR001

in practice source code source code javac scalac groovyc jrubyc 0xCAFEBABE byte code

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H.

Sourcerer: An Infrastructure for Large-scale Collection and Analysis of Open-source Code Sushil

Runtime Environments Where We Are Source Lexical Analysis Code Syntax Analysis Semantic

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Conference + Meeting Spaces Salt + Pepper TONON COLLECTION Macs Table TONON COLLECTION Pit

Conference + Meeting Spaces Salt + Pepper TONON COLLECTION Macs Table TONON COLLECTION Pit

Large Scale Complex Network Analysis using Large Scale Complex Network Analysis using the Hybrid

CloPlag A Study of Effects of Code Obfuscation to Code Similarity Detection Tools Chaiyong

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

What is a Compiler? Compiler A program that translates code in one language (source code) to

Code Generation Machine code generation cs4713 1 Machine code generation machine Intermediate

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

Disks wangth Computer Center, CS, NCTU Outline Interfaces Geometry Add new disks

(Deep Learning and Universal Sentence-Embedding Models)

Process Address Spaces Lab deadline extended to Wed night (9/14) and Enrollment

Presented by Jack Humburg Jack Humburg, Executive Vice President of Housing, Development,

CS5412: NETWORKS AND THE CLOUD Lecture III Ken Birman The Internet and the Cloud 2 Cloud

Solar-powering your geek gear Alternative and mobile energy for all your little toys Michael

Natural Response to Non-zero Initial Conditions Prof. Seungchul Lee Industrial AI Lab. The

Multi-Criteria Partitioning of Multi-Block Structured Grids Hengjie Wang Aparna