Unix commands for beginners D. Puthier TAGC/Inserm, U1090, - - PowerPoint PPT Presentation

unix commands for beginners
SMART_READER_LITE
LIVE PREVIEW

Unix commands for beginners D. Puthier TAGC/Inserm, U1090, - - PowerPoint PPT Presentation

Unix commands for beginners D. Puthier TAGC/Inserm, U1090, denis.puthier@univ-amu.fr Matthieu Defrance, ULB, matthieu.dc.defrance@ulb.ac.be Stphanie Le gras, Igbmc, slegras@igbmc.fr Christophe Blanchet, IFB,


slide-1
SLIDE 1

Unix commands for beginners

  • D. Puthier TAGC/Inserm, U1090, denis.puthier@univ-amu.fr

Matthieu Defrance, ULB, matthieu.dc.defrance@ulb.ac.be Stéphanie Le gras, Igbmc, slegras@igbmc.fr Christophe Blanchet, IFB, Christophe.BLANCHET@france-bioinformatique.fr

slide-2
SLIDE 2

MATE Desktop

Demo Quick overview.

Installation: http://www.france-bioinformatique.fr/?q=fr/core/cellule-infrastructure/documentation-cloud Dashboard: https://cloud.france-bioinformatique.fr/cloud/instance/

slide-3
SLIDE 3

The terminal…

Demo Type ‘ls’ in the terminal (list files)

# list files root@vm: ls

slide-4
SLIDE 4
  • Answer : you can speak in BASH (Bourne Again Shell) *

○ BASH is one of numerous shell dialect (ksh, csh, zsh,...). ○ All this shell languages are extremely similar. ○ These languages are based on commands. ○ These modular commands allows one to perform tasks.

How can I speak to the terminal

* Reférence (calembour) au premier langage Shell écrit par Stephen Bourne :)

slide-5
SLIDE 5

# Argument without any associated value # depending on the command v means verbose, version (or other) fastqc -v # An argument with an associated value man -k jpeg

Command prototype(s) (1)

  • One command performs a task (sort, select, open, align reads,...).
  • A command has arguments that may be facultative and modify the way it works.
  • These arguments may take some values.
  • Most of the time an instruction (command line) starts with a command name (or path to

the command).

  • In the example below we will say minus v’.
slide-6
SLIDE 6

# Long form without any associated value. fastqc --version # Long form with an associated value. man --apropos jpeg

Command prototype(s) (1)

  • Most of the time arguments can be written in their short of long form (more

explicit/better readability).

  • Long form are generally precede with ‘--’ (for instance ‘minus minus apropos’)
slide-7
SLIDE 7

Getting help !

Call you friends or better use man (manuel) # Demo root@vm: man ls # getting help about ls root@vm: man man # getting help about man ... Help shortcuts: /foo : search for ‘foo’. n : (next) next occurence of ‘foo’. p: (previous) previous occurrence of ‘foo’. q : quit help page.

slide-8
SLIDE 8

Our first command: ls

slide-9
SLIDE 9
  • ls can take several arguments.
  • Main arguments:

  • l : (long) get lot of information.

  • a (all) show all files including hidden files*.

  • 1 : show results as 1 column.

  • t (time) sort results by date/time.

  • r (reverse) reverse sort order.
  • One can combine arguments

○ ls -l -a ○ ls -la

The ls command and some of its arguments

* Under linux hidden files start with a ‘.’ (e.g ‘.thehiddenfile.txt’).

slide-10
SLIDE 10

The ls command and some of its arguments

# Demo root@vm: ls # list files root@vm: ls -a # list files including hidden files * root@vm: ls -l # get lot of information about files root@vm: ls -1 # list file (one column) root@vm: ls -t # List file by modification date ** # Combining arguments root@vm: ls -rtl # lot of info, sort by date, reverse order

* WARNING with spaces. Instruction should start with a command. The ls-a command does not exists !

** Default sorting is case-sensitive sorting.

slide-11
SLIDE 11

Create directories and files

slide-12
SLIDE 12

File system tree

  • The file system can be viewed as a tree in which nodes are directories or files.
  • This tree has a root: /
  • The root folder (/) contains

○ A root folder an various additional folder* ■ Under IFB machine your root folder contains a Documents folder

* Under IFB VM, you are the root/sysadmin, this is a

particular case.

slide-13
SLIDE 13
  • 1) By specifying the path from the root. Absolute path.

e.g; /root/Documents /root/Music

Hos should I refer to a file/directory

  • 2) By referring to the current location/directory (the working/current directory).

Relative path.

slide-14
SLIDE 14

Syntax for relative path

# The upper directory relative to the working directory .. # Two directories up ../.. # Three ../../.. # The current working directory ./

slide-15
SLIDE 15

File system: Demo

root@vm: pwd # The current working directory (/root) root@vm: cd /root/Documents # We go into Documents root@vm: pwd # /root/Documents root@vm: cd .. # go up one level (/root) root@vm: cd /root/Music # Go to the Music folder root@vm: pwd # /root/Music root@vm: cd ../.. # Go to the root of the file system root@vm: ls # You should see the root directory root@vm: cd /root/Music # Let’s go to the root/Music directory root@vm: cd ../Documents # And to the Document folder

pwd (print working directory); cd (change directory). *

* Use complétion (tab key) for file, directories and commands.

slide-16
SLIDE 16
  • If you are the /root your data are stored in /root

○ i.e ‘user directory’ or home.

  • ~ (tilda) contains the path to your home (same as

$HOME).

File system: some hints

root@vm: cd / # At the root root@vm: pwd # / root@vm: cd ~/Documents # The Document directory of your home folder. root@vm: cd ~ # go to your home dir. root@vm: cd /usr/local/bin # Go to /usr/local/bin root@vm: ls ~ # list files in your ‘home’ directory root@vm: cd ~/Music # Go to the Music folder inside your home dir. root@vm: cd # == cd ~

slide-17
SLIDE 17
  • We will use the mkdir (make directory) command.

root@vm: mkdir projet_roscoff # Create a directory root@vm: cd ./projet_roscoff # == cd projet_roscoff * root@vm: mkdir rna-seq # Let’s create a folder root@vm: mkdir chip-seq dna-seq # and several sub-folders root@vm: ls -1 # list files and folders root@vm: cd chip-seq # == ./chip-seq root@vm: pwd # the current working dir root@vm: cd ../.. # go back home

Make directories

* ./ is most of the time facultative

slide-18
SLIDE 18

Hands on

  • 1) go to ~/projet_roscoff/chip-seq
  • 2) From this directory create a directory named annotation in

~/projet_roscoff/

  • Go inside annotation directory
  • Check you are in the write place
  • Go back home.
slide-19
SLIDE 19

Hands on

  • 1) go to ~/projet_roscoff/chip-seq
  • 2) From this directory create a directory named annotation in

~/projet_roscoff/

  • Go inside annotation directory
  • Check you are in the write place
  • Go back home.
  • # Solution

root@vm: cd ~/projet_roscoff/chip-seq root@vm: mkdir ../annotations root@vm: cd ../annotations root@vm: cd

slide-20
SLIDE 20

Manipulate files

slide-21
SLIDE 21
  • We will use the wget command to download files.
  • To uncompress we will use gunzip if the file was compressed with the gzip

algorithm (extension .gz)

Download and uncompress files

root@vm: cd ~/projet_roscoff/annotations # on se déplace dans annotations # On télécharge le fichier root@vm: wget http://pedagogix-tagc.univ-mrs.fr/courses/data/roscoff/hg19_exons.bed.gz root@vm: ls # le fichier compressé root@vm: ls # le fichier compressé root@vm: gunzip hg19_exons.bed # on le décompresse root@vm: ls # le fichier a perdu l’extension gz

slide-22
SLIDE 22

Contains coordinates (start/end) of humand exons in bed format.

Bed format (Bed6) ( http://genome.ucsc.edu/FAQ/FAQformat.html#format1 ) * Tabulated format (how to check that ???) Chromosome Start End Name Score Strand (Others…)

The hg19_exons.bed file

* Start and End position are always given relative to the 5’/3’ orientation of the + strand. Coordinates are ‘zero-based, half-open’.

slide-23
SLIDE 23

Visualising file content

  • With a pager: less or more (do more or less the same).
  • With head ou tail to display the n first or n last lines of a file.
  • The cat command allows to send file content to the screen. <ctrl> + c to cancel.
  • The shortcuts for less are the same as for the man command.

Raccourcis dans less: ↑ : go up. ↓ : go down. > : go to first line. < : go to last line. /foo : search for ‘foo’. n : next occurrence of foo. p: previous occurrence of foo q : quit.

slide-24
SLIDE 24

Hands on

  • 1) Look at the ten first lines of hg19_exons.bed with head.
  • 2) look at the ten last lines of hg19_exons.bed with tail.
  • 3) Go through the hg19_exons.bed file with less.
  • 4)Send file content to the screen with cat.
slide-25
SLIDE 25

Exercices

  • 1) Look at the ten first lines of hg19_exons.bed with head.
  • 2) look at the ten last lines of hg19_exons.bed with tail.
  • 3) Go through the hg19_exons.bed file with less.
  • 4)Send file content to the screen with cat.

# Solution root@vm: head -n 10 hg19_exons.bed root@vm: tail -n 10 hg19_exons.bed root@vm: less hg19_exons.bed root@vm: cat hg19_exons.bed

slide-26
SLIDE 26

This can be done with the wc (word count) command with -l (line) argument.

root@vm: wc -l hg19_exons.bed # 484127 exons

Counting line number

slide-27
SLIDE 27
  • Use the cut command with the -f (field) argument
  • The columns must be tabulated or use the -d argument (‘delimiter’)

root@vm: cut -f1 hg19_exons.bed # Column 1 root@vm: cut -f1,2 hg19_exons.bed # Columns 1 and 2 root@vm: cut -f3-5 hg19_exons.bed # Columns from 3 to 5 root@vm: cut -f3- hg19_exons.bed # Column 3 to the last column

Extract columns

slide-28
SLIDE 28
  • On should use the sort command (alphabetic sorting by default).

  • k (key): e.g

  • k1,1: sort by column 1.

  • k2,2nr: sort by column 2 using a numeric sorting in reverse order.

  • k2,2g: sort by column 2 (decimal sorting).

Example: sort hg19_exons.bed by chromosomes then by genomic coordinates: root@vm: sort -k1,1 -k2,2nr hg19_exons.bed

Sort a file

slide-29
SLIDE 29

Redirections

slide-30
SLIDE 30

Command pipes

Commande

Input Output Error

  • Standard Input: a file or text stream.
  • Standard output: screen by default.
  • Standard error: may be capture for log purpose.

Commande

Input Output Error

Commande

Input Output Error

slide-31
SLIDE 31

Obtenir la liste de chromosomes présents dans le fichier root@vm: cut -f1 hg19_exons.bed | sort | uniq # La liste non-redondante des chromosomes Obtenir la liste des chromosomes présents dans le fichier et leur nombre root@vm: sort hg19_exons.bed | uniq -c # -c pour ‘count’ Compter le nombre de transcript non codant (contenant ‘NR_’). root@vm: cut -f4 hg19_exons.bed | grep "NR_" | sort | uniq | wc -l #11675

Demo: command pipes

Note: La commande uniq permet d’éliminer les doublons dans un fichier trié. Note: la commande grep permet de chercher une chaîne de caractères.

slide-32
SLIDE 32

Exercices (notés)

  • How many exons on chromosome 22 ?
  • What is the most frequent chrom-start-end tuple ?

○ i.e The most frequent exon.

slide-33
SLIDE 33

Exercices (notés)

  • How many exons on chromosome 22 ?
  • What is the most frequent chrom-start-end tuple ?

○ i.e The most frequent exon.

Solution root@vm: grep -w chr22 hg19_exons.bed | wc -l # n = 259 root@vm: cut -f1-3 hg19_exons.bed | sort | uniq -c| sort -n| tail -n 1 # 77 chrY

slide-34
SLIDE 34
  • What is the genome fraction covered by exons ?

○ We must perform the operation below

Exercice

Exon Exon

  • Let see how to do that...

Exon Exon Exon Exon Exon Exon Exon Exon Exon Exon Exon Exon Exon

slide-35
SLIDE 35

Using Bedtools

slide-36
SLIDE 36
  • A software to perform arithmetic operations on genomic coordinates.

○ http://bedtools.readthedocs.org/en/latest/content/overview.html

  • Some example usages:

○ Extend/slop regions. ○ Compare regions (intersect). ○ Merge regions. ○ Format convertion. ○ …

  • The bedtools command is associated with a set of sub-commands.

Bedtools

slide-37
SLIDE 37
  • Use bedtools with -h argument.

○ What do you see ?

  • Ask for some help about the merge command (bedtools merge -h)

○ Looks at the arguments. ○ Read the note at the end of the command. Why is it important ?

Exercice with bedtools

slide-38
SLIDE 38
  • Use bedtools with -h argument.

○ What do you see ?

  • Ask for some help about the merge command (bedtools merge -h)

○ Looks at the arguments. ○ Read the note at the end of the command. Why is it important ?

  • Solution

root@vm: bedtools -h # l’ensemble des sous commandes root@vm: bedtools merge -h # utiliser l’argument -i # la note indique que les régions génomiques doivent être triées au préalable.

Exercice with bedtools

slide-39
SLIDE 39
  • Use bedtools sort and bedtools merge to merge overlapping regions/exons.

Exercice

slide-40
SLIDE 40
  • Use bedtools sort and bedtools merge to merge overlapping regions/exons.

Exercice

root@vm: bedtools sort -i hg19_exons.bed | bedtools merge

slide-41
SLIDE 41
  • Use the > redirection operator.

○ Erase file if it exists.

  • >> can be used to add lines to an existing file.

root@vm: bedtools sort -i hg19_exons.bed | bedtools merge > hg19_exons_merged.bed root@vm: ls # A new file was created

How to save results to a file ?

slide-42
SLIDE 42

Some arithmetic with awk

slide-43
SLIDE 43

Awk

  • Awk is a command available on most linux system.
  • Awk has its own language.
  • Awk allows to perform oneliners (and more)
  • The prototype of a awk command is the following:
  • Each set of brace is associated to a particular task:

awk ‘BEGIN{action} {action} END{action}’ fichier BEGIN{before opening the file} {for each line} END{after rading all lines}

slide-44
SLIDE 44

Awk

  • Awk has special variables.
  • Examples:

FS: Field Separator. OFS: Output Field Separator. NR: Number of Row. NF: Number of Field. $0: The current line $1,$2,$3 (...): columns 1,2 ou 3 (...) of the current line

slide-45
SLIDE 45

# print columns 2 and 1 # \t is the tabulation character root@vm: awk 'BEGIN{FS="\t"}{print $2,$1}' hg19_exons.bed # print columns 2 and 1 with tabulated output root@vm: awk 'BEGIN{FS=OFS="\t"}{print $2,$1}' hg19_exons.bed # print columns 2 and 1 with tabulated output and line number root@vm: awk 'BEGIN{FS=OFS="\t"}{print NR,$2,$1}' hg19_exons.bed # Compute start - end for each line root@vm: awk 'BEGIN{FS=OFS="\t"}{print $3-$2}' hg19_exons.bed

Exemple

slide-46
SLIDE 46

Exercice Calculer la somme des fragments (awk)

slide-47
SLIDE 47

# Calculer à chaque ligne la somme cumulée de la taille des fragments # Notez que les “;” permettent de séparer des instructions # s est une variable que l’on déclare à 0 # 75861726 root@vm: awk 'BEGIN{FS="\t"; s=0}{s=s+$3-$2; print s}' hg19_exons_merged.bed # Ou encore awk 'BEGIN{FS="\t"; s=0}{s=s+$3-$2}END{print s}' hg19_exons_merged.bed # A vos calculettes (vous pouvez utiliser R). # 75861726/3.2e9*100 # ~ 2.37 % du génome couvert

Exercice: Calculer la somme des fragments (awk)

slide-48
SLIDE 48

Aller plus loin avec awk

awk ‘BEGIN{} pattern {} END{}’ fichier

  • Le prototype d’une commande awk peut être un peu étendu en ajoutant des

‘patterns’ (selecteurs ou critères).

  • Le critère pourra être une expression régulière (voir plus loin) ou une

expression logique

# exemples: test si a égal b. Imprime si vrai. awk ‘a == b {} END{}’ fichier # Exemples: imprime si la colonne 1 vérifie une expression régulière. awk ‘$1 ~/regExp/ {print}’ fichier

slide-49
SLIDE 49

Exemples avec des patterns

# La première ligne root@vm: awk 'NR == 1 {print}' hg19_exons_merged.bed # La ligne 2 à 10 root@vm: awk '{OFS=”\t”} NR >= 2 && NR <= 10 {print NR, $0}' hg19_exons_merged.bed # Les lignes dont la colonne 1 contient la chaîne ‘chr19’. root@vm: awk ' $1 ~/chr19/ {print}' hg19_exons_merged.bed

slide-50
SLIDE 50

Expressions régulières

  • Permettent de décrire un motif dans une chaîne de caractère.

. un caractère quelconque [a-z] une lettre minuscule (interval, ex : [u − w]) [A-Z] une lettre majuscule (interval, ex : [U − W]) [ABc] A ou B ou c [ˆABab] Toute lettre différente de a et b. ^ Début de ligne. $ Fin de ligne x* 0 à n fois le caractères x. x+ 1 à n fois le caractères x. x{n,m} Le caractère x répété n à m fois.

slide-51
SLIDE 51

Exemples

\.txt$ Toute chaîne finissant par “.txt” ˆ[A − B] Une chaîne débutant par une majuscule. ˆ.{4,6}\.txt$ Quatre à 6 caractères suivis de “.txt“ ˆ[A − Z].*\.txt$ Une chaîne débutant par une majuscule et finissant par ”.txt“ ˆ$ Une chaîne de caractères vide. ˆ[ˆ0 − 9]*\.sh$ Une chaîne ne contenant pas de chiffres et se terminant par ”.sh“

slide-52
SLIDE 52

Exercice

  • En utilisant grep (general regular expression processor) construire une

expression régulière permettant de récupérer, dans le fichier hg19_exons_merged.bed, les lignes dont la colonne 1contient les chaînes de caractères chr1, chr2 et chr9 (et pas d’autres chromosomes quoi que puisse contenir le fichier).

  • En utilisant awk et un pattern, construire une expression régulière permettant

de récupérer, dans le fichier hg19_exons_merged.bed, les lignes dont la colonne 1 contient chr1, chr2 et chr9 (et rien d’autre quoi que puisse contenir le fichier).

slide-53
SLIDE 53

Solutions

# grep chr1, chr2 et chr9 (et rien d’autre !) # Notez l’utilisation de -P (perl) pour avoir un langage d’expression régulière étendu # utile ici pour la prise en compte de \t. Ne fonctionne pas sou mac. root@vm: grep -P “^chr[123]\t” hg19_exons_merged.bed # awk chr1, chr2 et chr9 (et rien d’autre !) root@vm: awk ' $1 ~/^chr[123]$/ {print}' hg19_exons_merged.bed

slide-54
SLIDE 54

Merci