UNIX Course Part II Working with files Andy Hauser LAFUGA & - - PowerPoint PPT Presentation

unix course part ii working with files
SMART_READER_LITE
LIVE PREVIEW

UNIX Course Part II Working with files Andy Hauser LAFUGA & - - PowerPoint PPT Presentation

UNIX Course Part II Working with files Andy Hauser LAFUGA & Chair of Animal Breeding and Husbandry Gene Center Munich LMU June, 2016 1 Recall ls list file, information about files cd change working directory mkdir make directory


slide-1
SLIDE 1

UNIX Course Part II Working with files

Andy Hauser LAFUGA & Chair of Animal Breeding and Husbandry Gene Center Munich LMU June, 2016

1

slide-2
SLIDE 2

Recall

ls list file, information about files cd change working directory mkdir make directory whoami; id information about user groups; id information about group memberships df -h information about disks man manual pages

slide-3
SLIDE 3

What is an Operating System?

CPU

Hardware

RAM Disks Net Keyboard Mouse

Communication

Kernel Drivers Kernel Resource Management Kernel User ABI/API Command Line Interface Desktop Environment

slide-4
SLIDE 4

stdin, stdout, stderr

command stdin (0) stdout (1) stderr (2) 2> > >> create / overwrite append 2>> append create / overwrite

slide-5
SLIDE 5

Connecting commands with pipes

$ command1 | command2 | command3 $ command1 > /tmp/file1 $ command2 < /tmp/file1 > /tmp/file2 $ command3 < /tmp/file2

Advantages:

  • Less typing
  • No temporary files
  • Data is shared in memory (fast!)
  • Submit together and forget
  • Commands run in parallel

Stdout of one command is connected to stdin of another command with pipe „|“.

slide-6
SLIDE 6

echo

$ echo foo foo $ echo foo bar foo bar $ echo foo bar foo bar $ echo "foo bar" foo bar $ echo -n "foo bar" foo bar$ $ echo hello > hello $ echo world >> world

Arguments are separated by one or more space One argument with spaces must be protected.

  • n suppresses the adding of a newline at the end

stdout goes to a file, creating /overwriting it stdout goes to a file, appending to it

slide-7
SLIDE 7

cat - concatenate files

$ cat hello hello world $ echo hello > hello $ cat hello hello $ cat < hello $ echo world > world $ cat hello world hello world $ cat < hello < world

Not necessarily supported by a shell

slide-8
SLIDE 8

Creating files

$ touch touch.whatever $ echo abc > abc.txt $ cp abc.txt abc.copy.txt $ ls -l abc.txt

  • rw-rw-r-- 1 andy staff 4 13 Jun 14:32 abc.txt

Note that default permissions are influenced by umask.

Removing files

$ rm touch.whatever

slide-9
SLIDE 9

File Metadata

$ ls -l abc.txt

  • rw-rw-r-- 1 andy staff 4 13 Jun 14:32 abc.txt
  • rwx rwx rwx

User Permissions Size blocks Group Size Bytes Creation date Filename User Group Others

Closest kicks in. E.g. rwx—-rwx will disallow any group member but not others

slide-10
SLIDE 10

chmod - changing permissions

$ ls -l world

  • rw-rw-r-- 1 andy staff 6 13 Jun 15:06 world

$ chmod a-rwx world $ ls -l world

  • --------- 1 andy staff 6 13 Jun 15:06 world

$ chmod o+r world $ ls -l world

  • ------r-- 1 andy staff 6 13 Jun 15:06 world

$ chmod ug+rwx world $ ls -l world

  • rwxrwxr-- 1 andy staff 6 13 Jun 15:06 world

$ chmod g-w world $ ls -l world

  • rwxr-xr-- 1 andy staff 6 13 Jun 15:06 world
slide-11
SLIDE 11

copying files

$ cp abc.txt abc.copy.txt $ scp abc.txt housedata:abc.backup.txt $ cp -r foo/ bar/ $ scp -r foo/ housedata:bar/ $ rsync -av foo/ housedata:bar/

bar/ needs to exist! bar/ will be created can copy over SSH

slide-12
SLIDE 12

shell globbing *

$ echo * .CFUserTextEncoding .DS_Store .RData .Rapp.history .Rhistory .Rprofile .Trash .Xauth

  • rity .asoundrc .bash_history .cache .config .conkyrc .cordova .cups .cvsrc .devilsp

ierc .emai .gem .gitconfig .gnupg .hgrc .ionic .lesshst .lldb .local .npm .plugman . rnd .rstudio- desktop .screenrc .ssh .subversion .toprc .vim .viminfo .vimrc .wmii .xbindkeysrc .x pdfrc .zcompdump .zshrc .zshrc.local AndroidStudioProjects Attachments Bout2 Desktop Documents Downloads HistTexte.pdf Library MA Movies Music Pictures Public URLS2 admix backup brew-install.rb bta4_bending_andy_filtered_10_logl.png bta4_bending_andy_filtered_logl.png bta4_bending_andy_logl.png bta4_bending_logl.png chess config dot emai github ivica lm.RData src titel-small.png tmp wrk $ echo */ .Trash/ .cache/ .config/ .cordova/ .cups/ .emai/ .gem/ .gnupg/ .ionic/ .lldb/ .local / .npm/ .plugman/ .rstudio-desktop/ .ssh/ .subversion/ .vim/ AndroidStudioProjects/ Attachments/ Bout2/ Desktop/ Documents/ Downloads/ Library/ MA/ Movies/ Music/ Pictures/ Public/ admix/ backup/ chess/ dot/ emai/ github/ ivica/ src/ tmp/ wrk/ $ echo .* .CFUserTextEncoding .DS_Store .RData .Rapp.history .Rhistory .Rprofile .Trash .Xauth

  • rity .asoundrc .bash_history .cache .config .conkyrc .cordova .cups .cvsrc .devilsp

ierc .emai .gem .gitconfig .gnupg .hgrc .ionic .lesshst .lldb .local .npm .plugman . rnd .rstudio- desktop .screenrc .ssh .subversion .toprc .vim .viminfo .vimrc .wmii .xbindkeysrc .x pdfrc .zcompdump .zshrc .zshrc.local $ echo *.png bta4_bending_andy_filtered_10_logl.png bta4_bending_andy_filtered_logl.png bta4_bending_andy_logl.png bta4_bending_logl.png titel-small.png $ echo bta4_bending_* bta4_bending_andy_filtered_10_logl.png bta4_bending_andy_filtered_logl.png bta4_bending_andy_logl.png bta4_bending_logl.png

slide-13
SLIDE 13

more shell globbing

$ echo ? zsh: no matches found: ? $ touch a $ echo ? a $ touch b $ echo ? a b $ echo [a-z] a b $ echo [abc] a b $ echo [^abc] zsh: no matches found: [^abc] $ echo [^b-z]

? matches any one character [] for matching character sets, allowing for ranges like a-z. And [^] for matching not that character set.

slide-14
SLIDE 14

Download a tab seperated file you are familiar with

$ wget ftp://ftp.ensemblgenomes.org/pub/release-29/bacteria//gtf/bacteria_86_collection/ escherichia_coli_gca_000770055/Escherichia_coli_gca_000770055.GCA_000770055.1.29.gtf.gz

  • -2016-06-14 21:22:13-- ftp://ftp.ensemblgenomes.org/pub/release-29/bacteria//gtf/

bacteria_86_collection/escherichia_coli_gca_000770055/ Escherichia_coli_gca_000770055.GCA_000770055.1.29.gtf.gz => 'Escherichia_coli_gca_000770055.GCA_000770055.1.29.gtf.gz' Resolving ftp.ensemblgenomes.org... 193.62.197.94 Connecting to ftp.ensemblgenomes.org|193.62.197.94|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /pub/release-29/bacteria//gtf/bacteria_86_collection/ escherichia_coli_gca_000770055 ... done. ==> SIZE Escherichia_coli_gca_000770055.GCA_000770055.1.29.gtf.gz ... 323321 ==> PASV ... done. ==> RETR Escherichia_coli_gca_000770055.GCA_000770055.1.29.gtf.gz ... done. Length: 323321 (316K) (unauthoritative) Escherichia_coli_gca_0007700 100%[=============================================>] 315.74K 64.5KB/s in 4.9s 2016-06-14 21:22:19 (64.5 KB/s) - 'Escherichia_coli_gca_000770055.GCA_000770055.1.29.gtf.gz' saved [323321] $ gunzip Escherichia_coli_gca_000770055.GCA_000770055.1.29.gtf.gz $ ls Escherichia_coli_gca_000770055.GCA_000770055.1.29.gtf

slide-15
SLIDE 15

head - print the first lines from a file

$ head Escherichia_coli_gca_000770055.GCA_000770055.1.29.gtf #!genome-build ASM77005v1 #!genome-version GCA_000770055.1 #!genome-date 2014-11 #!genome-build-accession GCA_000770055.1 #!genebuild-last-updated 2014-11 Contig0000020 ena gene 680 1846 . + . gene_id "JQ56_06920"; gene_version "1"; gene_name "nhaA"; gene_source "ena"; gene_biotype "protein_coding"; Contig0000020 ena transcript 680 1846 . + . gene_id "JQ56_06920"; gene_version "1"; transcript_id "KGP19944"; transcript_version "1"; gene_name "nhaA"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "nhaA-1"; transcript_source "ena"; transcript_biotype "protein_coding"; Contig0000020 ena exon 680 1846 . + . gene_id "JQ56_06920"; gene_version "1"; transcript_id "KGP19944"; transcript_version "1"; exon_number "1"; gene_name "nhaA"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "nhaA-1"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "KGP19944-1"; exon_version "1"; Contig0000020 ena CDS 680 1843 . + 0 gene_id "JQ56_06920"; gene_version "1"; transcript_id "KGP19944"; transcript_version "1"; exon_number "1"; gene_name "nhaA"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "nhaA-1"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "KGP19944"; protein_version "1"; Contig0000020 ena start_codon 680 682 . + 0 gene_id "JQ56_06920"; gene_version "1"; transcript_id "KGP19944"; transcript_version "1"; exon_number "1"; gene_name "nhaA"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "nhaA-1"; transcript_source "ena"; transcript_biotype "protein_coding"; $ head -5 Escherichia_coli_gca_000770055.GCA_000770055.1.29.gtf #!genome-build ASM77005v1 #!genome-version GCA_000770055.1 #!genome-date 2014-11 #!genome-build-accession GCA_000770055.1 #!genebuild-last-updated 2014-11

slide-16
SLIDE 16

cut - print columns

$ cut -f 1-7 Escherichia_coli_gca_000770055.GCA_000770055.1.29.gtf > ecoli_1-7.gtf $ head ecoli_1-7.gtf #!genome-build ASM77005v1 #!genome-version GCA_000770055.1 #!genome-date 2014-11 #!genome-build-accession GCA_000770055.1 #!genebuild-last-updated 2014-11 Contig0000020 ena gene 680 1846 . + Contig0000020 ena transcript 680 1846 . + Contig0000020 ena exon 680 1846 . + Contig0000020 ena CDS 680 1843 . + Contig0000020 ena start_codon 680 682 . +

Fields can be given as ranges like 1-7 or comma separated. Delimiter is by default a TAB (\t), but can be nearly any one character.

slide-17
SLIDE 17

grep - print lines of files matching expression

$ echo a > abc $ echo b >> abc $ echo c >> abc $ grep b abc b $ grep c abc c $ grep -v b abc a c

$ cut -f 1-7 Escherichia_coli_gca_000770055.GCA_000770055.1.29.gtf | grep -v '^#' > ecoli_1-7.gtf $ head ecoli_1-7.gtf Contig0000020 ena gene 680 1846 . + Contig0000020 ena transcript 680 1846 . + Contig0000020 ena exon 680 1846 . + Contig0000020 ena CDS 680 1843 . + Contig0000020 ena start_codon 680 682 . + Contig0000020 ena stop_codon 1844 1846 . + Contig0000020 ena gene 1912 2811 . + Contig0000020 ena transcript 1912 2811 . + Contig0000020 ena exon 1912 2811 . + Contig0000020 ena CDS 1912 2808 . +

The expression can be a simple string like „a“. Certain characters are interpreted though. E.g. ^ means match at the beginning of the line.

  • v option to print lines NOT matching pattern
slide-18
SLIDE 18

sort - print sorted lines

$ head ecoli_1-7.gtf | sort -k 3 Contig0000020 ena CDS 1912 2808 . + Contig0000020 ena CDS 680 1843 . + Contig0000020 ena exon 1912 2811 . + Contig0000020 ena exon 680 1846 . + Contig0000020 ena gene 1912 2811 . + Contig0000020 ena gene 680 1846 . + Contig0000020 ena start_codon 680 682 . + Contig0000020 ena stop_codon 1844 1846 . + Contig0000020 ena transcript 1912 2811 . + Contig0000020 ena transcript 680 1846 . + $ head ecoli_1-7.gtf | sort -n -k 5 Contig0000020 ena start_codon 680 682 . + Contig0000020 ena CDS 680 1843 . + Contig0000020 ena exon 680 1846 . + Contig0000020 ena gene 680 1846 . + Contig0000020 ena stop_codon 1844 1846 . + Contig0000020 ena transcript 680 1846 . + Contig0000020 ena CDS 1912 2808 . + Contig0000020 ena exon 1912 2811 . + Contig0000020 ena gene 1912 2811 . + Contig0000020 ena transcript 1912 2811 . +

  • n numeric sort
slide-19
SLIDE 19

uniq - print changing lines or count them

$ head ecoli_1-7.gtf | cut -f 5 | uniq 1846 1843 682 1846 2811 2808 $ head ecoli_1-7.gtf | cut -f 5 | sort -n | uniq 682 1843 1846 2808 2811 $ head ecoli_1-7.gtf | cut -f 5 | sort -n | uniq -c 1 682 1 1843 4 1846 1 2808 3 2811 $ head ecoli_1-7.gtf | cut -f 5 | sort -n | uniq -c | sort -n 1 1843 1 2808 1 682 3 2811 4 1846

slide-20
SLIDE 20

How many sense vs. anti- sense genes are in ecoli?

slide-21
SLIDE 21

$ cut -f 7 ecoli_1-7.gtf | sort | uniq -c 14313 + 13246 -

slide-22
SLIDE 22

More Command Line Editing

Automatic Completetion. Most important key. Works for commands and paths in most shells.

+

Search in the history

+

Delete a word

+

Jump to beginning

+

Jump to end

slide-23
SLIDE 23

Literature

"Der UNIX - Werkzeugkasten. Programmieren mit UNIX" Brian W. Kernighan, Rob Pike Erscheinungsdatum: 1986 ISBN: 3446142738 "UNIX Power Tools, 2nd Edition" Jerry Peek, Tim O'Reilly & Mike Loukides Erscheinungsdatum: 1997 ISBN: 1-56592-260-3 The Unix Programming Environment Brian W. Kernighan, Rob Pike Year: 1983 ISBN: 978-0139376818 "UNIX Power Tools, 2nd Edition" Jerry Peek, Tim O'Reilly & Mike Loukides Year: 1997 ISBN: 978-1565922600

English Deutsch