STAT 605 Data Science Computing
Introduction to Version Control: git
Some materials adapted from Pro Git by Scott Chacon and Ben Straub
STAT 605 Data Science Computing Introduction to Version Control: git - - PowerPoint PPT Presentation
STAT 605 Data Science Computing Introduction to Version Control: git Some materials adapted from Pro Git by Scott Chacon and Ben Straub Version control It is useful to record and track the changes to a project over time Revert to older
Some materials adapted from Pro Git by Scott Chacon and Ben Straub
It is useful to record and track the changes to a project over time
Want to do this locally (i.e., stored on our own machine, not in the cloud)... ...but in a distributed manner (i.e., multiple people working on project at once).
For a more thorough discussion of why you should use version control, the problems that git seeks to solve, and how it solves them, see https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control
Created by Linus Torvalds (also the creator of Linux) Free and open-source, available at https://git-scm.com/ Installation
Ubuntu: apt install git (you may need to use sudo) Windows/MacOS: https://git-scm.com/downloads
Please see the lecture video for a demonstration
Image credit: S. Chacon and B. Straub. Pro Git
As we make changes to a project, git keeps track of those changes
Allows us to got back to an earlier version, if necessary
Option 1: create a git repository Take a directory on your machine Start tracking files in that directory Option 2: clone an existing repository from elsewhere Take an existing git repo (e.g., an R package that you like) Create a copy of it on your local machine This also allows you to contribute back to a project, if you wish to do so
Please see the lecture video for a demonstration
All files that git tracks are in one of three states at a given time
Image credit: S. Chacon and B. Straub. Pro Git
Modified Staged Committed Untracked
A file in your git repository can be in one of three states:
Modified: file has been changed, but is not
yet committed to the database
Staged: a modified file that is ready to be
included in the next snapshot
Committed: the file is stored in the database
(i.e., a snapshot has been taken).
Image credit: S. Chacon and B. Straub. Pro Git
A file in your git repository can be in one of three states:
Modified: file has been changed, but is not
yet committed to the database
Staged: a modified file that is ready to be
included in the next snapshot
Image credit: S. Chacon and B. Straub. Pro Git
Note: not every file in a project directory has to be part of the repo. Thus, there may be files in a directory that are in none of these states, because they are not being tracked at all.
Committed: the file is stored in the database
(i.e., a snapshot has been taken).
Please see the lecture video for a demonstration of adding files to the git repo and tracking changes.
1) Modify one or more files in your repository 2) Stage the changes that you wish to add to the next snapshot 3) Commit your changes. A snapshot of the staged files is stored.
Image credit: S. Chacon and B. Straub. Pro Git
As we make changes to a project, git keeps track of those changes
Allows us to got back to an earlier version, if necessary
When you commit to the repo: Git stores an object with a pointer to the snapshot of the files you staged
Object also includes additional information, e.g., commit author, message, etc Commit object stores a pointer to its parent(s) (commit(s) that came directly before)
Add three files for staging, and commit.
When you commit to the repo: Git stores an object with a pointer to the snapshot of the files you staged
Object also includes additional information, e.g., commit author, message, etc Commit object stores a pointer to its parent(s) (commit(s) that came directly before) Commit object created by git commit
When you commit to the repo: Git stores an object with a pointer to the snapshot of the files you staged
Object also includes additional information, e.g., commit author, message, etc Commit object stores a pointer to its parent(s) (commit(s) that came directly before)
Now, git has created a commit object, which includes a pointer to the root tree object of the project. A blob object is created for each newly committed file. Roughly speaking, tree objects correspond to UNIX/Linux directories, while blob objects correspond to files. Add three files for staging, and commit.
Commit object created by git commit Tree object created by git init and updated by commits Blob objects corresponding to the three files, created by git add
Image credit: S. Chacon and B. Straub. Pro Git
Please see the lecture video for a demonstration of examining the git commit history
When you commit to the repo: Git stores an object with a pointer to the snapshot of the files you staged
Object also includes additional information, e.g., commit author, message, etc Commit object stores a pointer to its parent(s) (commit(s) that came directly before)
Add three files for staging, and commit.
As we make additional changes and commit them, each commit points back to the commit immediate before it. A branch is simply a pointer to one of these commit objects.
Image credit: S. Chacon and B. Straub. Pro Git
Two different branches, both pointing to the same commit. The head points to the current branch. That is, the branch that we are currently working on.
Image credit: S. Chacon and B. Straub. Pro Git
Two different branches, both pointing to the same commit. The head points to the current branch. That is, the branch that we are currently working on. Note: the master branch is not special; it is just the default name for the first branch created by init.
Image credit: S. Chacon and B. Straub. Pro Git
Head points to current (only) branch. The current branch, master, created
Image credit: S. Chacon and B. Straub. Pro Git
Create a new branch called testing, pointing to the current commit. Head still points to current branch. New branch created by git branch. The current branch, master, created
Image credit: S. Chacon and B. Straub. Pro Git
Please see the lecture video for a demonstration of creating a new branch with git branch and viewing the branch pointers using git log
Here is a project with three commits, and a single branch.
Image credit: S. Chacon and B. Straub. Pro Git
Here is a project with three commits, and a single branch. Create a new branch called iss53, and switch HEAD to that branch.
Image credit: S. Chacon and B. Straub. Pro Git
Here is a project with three commits, and a single branch. Create a new branch called iss53, and switch HEAD to that branch. Now we have a new branch, iss53, pointed to by HEAD (not shown). Any commits we make will be made to iss53, rather than master.
Image credit: S. Chacon and B. Straub. Pro Git
If we make changes and commit them, the current branch moves forward, while master remains unchanged.
Image credit: S. Chacon and B. Straub. Pro Git
Please see the lecture video for a demonstration of switching between branches with git checkout and viewing the commit history of multiple branches.
If we make changes in both of our branches, then they will have divergent histories. The changes in the two branches are isolated from one another. Eventually, we may want to merge them.
Image credit: S. Chacon and B. Straub. Pro Git
Merge the changes made in branch iss53 into branch master.
Image credit: S. Chacon and B. Straub. Pro Git
New commit created by merge operation.
After a merge like this, we can typically delete the branch that we merged: git branch -d iss53
Image credit: S. Chacon and B. Straub. Pro Git
Please see the lecture video for a demonstration of merging branches with git merge
What if we make changes to the same part of the same file in two branches? git may not know how to merge them, and we’ll get an error like...
Note: You can use git status to get more information about what went wrong. Files with merge conflicts will have sections that look like this. Contents of index.html in HEAD branch Contents of index.html in branch being merged
What if we make changes to the same part of the same file in two branches? git may not know how to merge them, and we’ll get an error like...
Note: You can use git status to get more information about what went wrong. Files with merge conflicts will have sections that look like this. Contents of index.html in HEAD branch Contents of index.html in branch being merged We have to fix these sections before we can merge!
Please see the lecture video for a demonstration
Your git repo can be hosted remotely on a server (e.g., on github)
Useful for collaboration (e.g., with a research group, company, etc) Once remote is set up, you and your team can push/pull data to/from it
Basic commands git fetch <remote> : retrieves data from <remote>
Note: this only downloads data. You still have to merge it.
git pull : automatically fetch and merge data from a remote branch
Note: your repo must be tracking the remote branch. See here for more information
git push <remote> <branch> : upload changes to the remote repo
Uploads the changes in your branch <branch> to the remote <remote> Note: the term “remote” does not necessarily mean that the repo is not on your local machine! It is just a repo that is not the one you are currently working in.
Please see the lecture video for a demonstration
When in doubt: 1. git add 2. git commit 3. git pull 4. git push Use git status liberally and at any time.
Sticking to these commands and this order will keep you out of trouble, but you’re better off reading the documentation and making sure you understand what’s going on under the hood.
Of course, we have only scratched the surface of the tools available in git But you now know more than enough to work on basic projects To learn more, I recommend Pro Git by Scott Chacon and Ben Straub Available for free at https://git-scm.com/book/en/v2 Other resources: “Everyday git” quick guide https://git-scm.com/docs/giteveryday Documentation (also available through man git): https://git-scm.com/docs Version Control with Git by J. Loeliger (2009) O’Reilly