Barcode Sequence Alignment and Statistical Analysis (Barcas) tool - - PowerPoint PPT Presentation

barcode sequence alignment and statistical analysis
SMART_READER_LITE
LIVE PREVIEW

Barcode Sequence Alignment and Statistical Analysis (Barcas) tool - - PowerPoint PPT Presentation

Barcode Sequence Alignment and Statistical Analysis (Barcas) tool 2016.10.05 Mun, Jihyeob and Kim, Seon-Young Korea Research Institute of Bioscience and Biotechnology Barcode-Sequencing Genome-wide screening method based on sequencing the


slide-1
SLIDE 1

Barcode Sequence Alignment and Statistical Analysis (Barcas) tool

2016.10.05 Mun, Jihyeob and Kim, Seon-Young Korea Research Institute of Bioscience and Biotechnology

slide-2
SLIDE 2

Barcode-Sequencing

2

Ø Genome-wide screening method based on sequencing the counts of tens of thousands of individual tags (barcodes) for each gene for a given condition Ø Originally developed as yeast deletion libraries such as Saccharomyces cerevisiae and Schizosaccharomyces pombe Ø Now applied for genome-wide siRNA or shRNA screening to measure the effects of knock-down of genes Ø Or, using CRISPR-Cas9, applied for genome-wide sgRNA screening for the effects of gene knock-out

slide-3
SLIDE 3

Examples of genome-wide barcode-sequencing libraries

3

Contents Organism # of genes # of barcodes References

Yeast deletion consortium

  • S. cerevisiae

6,343 2 (UP and DN) www-sequence.Stanford.edu/group/ Bioneer pombe collection

  • S. pombe

4,836 2 (UP and DN) http://us.bioneer.com/ MISSION shRNA (human)

  • H. sapiens

20,018 129,696 shRNA http://sigmaaldrich.com MISSION shRNA (human)

  • M. musculus

21,171 118,072 shRNA http://sigmaaldrich.com TRC1 (human) shRNA

  • H. sapiens

16,019 80,717 shRNA https://portals.broadinstitute.org/gpp/trc1/ TRC1 (mouse) shRNA

  • M. musculus

15,960 77,819 shRNA https://portals.broadinstitute.org/gpp/trc1/ Human DECIPHER (shRNA)

  • H. sapiens

15,377 5+ shRNAs https://www.cellecta.com Mouse DECIPHER (shRNA)

  • M. musculus

9,145 5+ shRNAs https://www.cellecta.com Cellecta Genome-wide shRNA

  • H. sapiens

19,276 8 shRNAs https://www.cellecta.com Cellecta Genome-wide CRISPR

  • H. sapiens

19,001 8 sgRNAs https://www.cellecta.com Human GeCKO v2

  • H. sapiens

19,050 123,411 sgRNA https://www.addgene.org/ Mouse GeCKO v2

  • M. musculus

20,611 130,209 sgRNA https://www.addgene.org/ Mouse genome-wide v1 (yusa)

  • M. musculus

19,150 87,897 sgRNA https://www.addgene.org/ Oxford fly Drosophila 13,501 40,279 sgRNA https://www.addgene.org/ CRISPRa

  • H. sapiens

15,977 198,810 sgRNA https://www.addgene.org/ CRISPRi

  • H. sapiens

11,219 206,421 sgRNA https://www.addgene.org/

slide-4
SLIDE 4

4

Workflow : barcoded yeast deletion strains

slide-5
SLIDE 5

5

Workflow : genome-wide shRNA screening

slide-6
SLIDE 6

Basic format of barcode-seq data

6

Universal Primer (20-25 bp) Barcode (20-30 bp) MID (Multiplexing Index, 4-6 bp)

slide-7
SLIDE 7

Steps of barcode-seq data analysis

Barcode (20-30 bp) Universal Primer (20-bp) Multiplex Index (4-6 bp)

Trim index Trim primer Map and count each TAG

sample1 Sample2 sample3 tag1 3400 2500 2983 tag2 120 199 739 tag3 29920 3544 2232 tag4 4300 3433 3344 . . . . . . . . . . . .

Normalization Statistical Analyses Pre-processing and QC Visualization

slide-8
SLIDE 8

8

Current tools and methods for barcode-seq data analysis

Tool (or method) Pre- processing QC Normal ization Statistical Analysis Visuali zation Software format

Ref.

Barcas O O O O O Java GUI Mun 2016 BMC Bioinfo Barcode Deconvoluter O X X X X Windows or Mac GUI www.decipherproject.net/ software BiNGS!LS- seq & edgeR O O O O X R package Kim 2012 Method Mol Biol edgeR O X O O X R package Dai 2014 F1000 Res HiTSelect X X X Multi-objective ranking O Matlab runtime Diaz 2015 Nuc Acids Res MAGeCK O O O O X Python, C source code Li 2014 Genome Bio MAGeCK- VISPR O O O Robust rank aggregation O Python script Li 2015 Genome Bio RIGER X X X RNAi Gene Enrichment Ranking O GENE-E (=> Morpheus) Java GUI Luo 2008 PNAS

RSA

X X X Iterative hypergeometric P- value X Windows GUI (C#), R, Perl Konig 2007 Nat Methods ScreenBEAM X X X Pooled scoring X R package Yu 2015 Bioinformatics shALIGN & shRNAseq O O O O X Perl and R script Sims 2011 Genome Bio

slide-9
SLIDE 9

Barcas (Barcode sequence Alignment and Statistical Analysis)

9

  • Barcas is an all-in-one program for the analysis of multiplexed barcode

sequencing (barcode-seq) data

  • Available at http://medical-genome.kribb.re.kr/barseq/

Input: Barcode-seq data

  • Genome-wide shRNAs (Cellecta, TRC, Sigmaaldrich, etc)
  • Genome-wide sgRNAs (Addgene, Cellecta, etc)
  • barcoded yeast deletion strains: S. cerevisae or S. pombe

Ø Preprocessing & Mapping

  • Filtering, trimming, and mapping with mismatches and indels

Ø Quality Control (of barcodes and samples) Ø Normalization Ø Statistical Analysis

  • Two-condition comparison, multiple time points.

Ø Visualization

  • Various graphs and heatmap
slide-10
SLIDE 10

10

All in one package with user-friendly GUI

Step 1: Pre-processing & Mapping Step 2: QC of data quality Step 3: Design experiment Step 4: Statistical analysis

slide-11
SLIDE 11

11

Step 1: Data preprocessing and mapping

Ø De-multiplexing and trimming (universal primers) Ø Mapping with imperfect matches (mismatches and indels) Ø Searching for individual tag sequences

slide-12
SLIDE 12

12

Step 2: Data quality evaluation

Ø Sequence level: overall sequence quality Ø Sample level: mapping counts and percentage, etc Ø Barcode (or tag) level: mapping counts and percentage, etc

slide-13
SLIDE 13

13

Step 3: Experimental design

Ø Comparison of two conditions Ø Across several different time points

slide-14
SLIDE 14

14

Step 4: Statistical analysis and Visualization

Ø Calculates z-score and p-value for each barcode Ø Ranks each barcode by z-score Ø Plots z-score graph Ø Plots time dependent intensity heat-map Ø Allows searching for individual target gene

slide-15
SLIDE 15

15

Novel functions of Barcas for data pre-processing and QC

Ø Flexible mapping with support for both substitution s and indels Ø Detection of erroneous barcodes in the library Ø Checking similarity among barcodes in the library collection

slide-16
SLIDE 16

16

Existing tools for data preprocessing

Name Mismatches Shifts of the position Indel Backend tool Ref. BiNGS!LS- seq O X X bowtie Kim (2012) Methods Mol Bio shALIGN O X X Perl script (or bowtie) Sims (2011) Genome Bio edgeR O O X edgeR Dai (2014) F1000Res Barcas O O O Trie data structure Mun (2016) BMC Bioinfo

MID Universal Primer Barcode (shRNA)

TCAAAGATAGTCACGCGACCTCATCGACGAGCTACC TCAAAGATAGTCACGCGACCTCATCGACGAGCTACC TCAAAGATAGTCACGCGACCTCATCGACGAGCTACC TCAAAGATAGTCACGCGACC-ATCGACGAGCTACC TCAAAGATAGTCACGCGACCTCATCGA--AGCTACC Original barcode Perfect match Mismatches Position shift Indel

slide-17
SLIDE 17

1:1 sequence matching processing Algorithm : List based Maximum time : N * M (N: read count, M: reference count) 1:M sequence matching processing Algorithm : Tree based Maximum time : N (N: read count)

AGCT CGCT GCCAA TTAG AGCT Library reference read Library reference root A T G C TCAGT GCAG TTAT T C A G G T A G C T C A C G A G C T AGCT read

Trie data structure

17

A T

Ø Data structure based on prefix tree Ø Useful data structure to store a dynamic set or associate array in which the keys are usually strings Ø More efficient than hash table (or dictionary) or lists in terms of look-up speed an d memory

slide-18
SLIDE 18
  • Based on trie data structure, Barcas supports imperfect

matching allowing mismatches, base shifting and indels

  • Dynamic sequence lengths
  • Dynamic start positions

18

  • 1. Data structure of Barcas for mapping
slide-19
SLIDE 19

Comparison of speed and mapping rate of barcas with bowtie and edgeR package of R

Option Result

Barcas was 1.7 times faster than bowtie and 13 times faster than edgeR. Owing to indel mapping, Barcas mapped at least 8-12% more than the other two programs.

Data

  • 215 million reads were mapped to 4,832 heterozygous diploid deletion strains in S. pombe.
  • 45-bp sequences were used as barcode library.
slide-20
SLIDE 20

20

  • 2. Detection of erroneous barcodes from the

genome-wide barcode library

Ø We are likely to assume that barcode sequences in the li brary are perfectly error-free from the original design Ø However, errors can creep in the barcodes during many steps including

  • barcode synthesis,
  • random mutations during library maintenance,
  • erroneous incorporation of barcodes into the genome in case of

yeast strains.

slide-21
SLIDE 21

21

Erroneous barcodes in the yeast library

Eason et al (2004) Characterization of synthetic DNA bar codes in Saccharomyces cerevisiae gene-deletion strains PNAS 101(30):11046-51 Smith et al (2009) Quantitative phenotyping via deep barcode sequencing Genome Res 19:1836-42

U1 UpTag U2 D2 DnTag D1 # correct by Smith 4,242 4,369 4,045 4,207 4,320 3,867 % correct by Smith 80.1% 82.5% 82.9% 80.9% 83.1% 83.7% # correct by Easton 4185 3,764 4,057 4,343 3,807 4,095 % correct by Easton 79.1% 71.1% 83.2% 83.5% 73.2% 88.7% % Agreed 86% 84.4% 89.2% 92.6% 85.1% 92%

slide-22
SLIDE 22

A simple method to detect erroneous barcodes

Original design ACTGACTGACTGACTGACTG Counts Perfect ACTGACTGACTGACTGACTG 50,000 Mismatch 1 ACTGACTGACTGACTGCCTG 10 Mismatch 2 ACTCACTGACTGACTGACTG 9 Mismatch 3 ACTGACAGACTGACTGACTG 20 Mismatch 4 ACTGACTGACTTACTGACTG 3 Mismatch 5 AGTGACTGACTGACTGACTG 7 Mismatch 6 ACTGACTGACTGACTGTCTG 12 Mismatch 7 ACTGACTGACTAACTGACTG 5 PM only 50,000 PM + MM 50,065 Gain 50,565/50,000 = 1.013% 0.13% gain Original design ACTGACTGACTGACTGACTG Counts Perfect ACTGACTGACTGACTGACTG 200 Mismatch 1 ACTGACTGACTGACTGCCTG 40,000 Mismatch 2 ACTCACTGACTGACTGACTG 11 Mismatch 3 ACTGACAGACTGACTGACTG 12 Mismatch 4 ACTGACTGACTTACTGACTG 3 Mismatch 5 AGTGACTGACTGACTGACTG 12 Mismatch 6 ACTGACTGACTGACTGTCTG 9 Mismatch 7 ACTGACTGACTAACTGACTG 5 PM only 20 PM + MM 40,071 Gain 40,071/200 = 200.35% 200% gain

Dominant Perfect Match with minor Mismatches One dominant Mismatch with minor Perfect Match and other Mismatches Measure the amount of gains in count between perfect match only and (PM + MM)

slide-23
SLIDE 23

Detection of erroneous barcodes

Ø Library : 1,230 shRNA sequences of TRC library. Ø Data : Control samples in neuroepithelial (NE), early radial glial (ERG) and mid radial glial (MRG) Ø We found 25 erroneous barcodes (2.03%).

23

Ziller,MJ. et al., Nature 2015, 518, 355-9.

slide-24
SLIDE 24

Detection of erroneous barcodes (TRC)

24

Gene ID

Original sequence Major mapped (Two mismatch/indels) PM count MM count

PBX2

TRCN0000285144

ATACTCCCACTTGCAACTATT ATACTCCCACTTGTAACTATT

10,785 34,084

SKI

TRCN0000010439

GAATCTGCCACTCTCAGAATA

  • AATCTGCCACTCTCAGAATA

14 5,935

TERF2IP

TRCN0000010356

GAGAGTTCTTGCATTGGAACT

  • AGAGTTCTTGCATTGGAACT

4 1,244

SKI

TRCN0000010437

GATCGAAGACCTGCAGGTGAA

  • ATCGAAGACCTGCAGGTGAA

5 625

MYC

TRCN0000010390

GAATGTCAAGAGGCGAACACA

  • AATGTCAAGAGGCGAACACA

3 393

JDP2

TRCN0000019000

CGGGAGAAGAACAAAGTCGCA CGGGAGAAGAACAAAAACGCA

46 508

TFAP2B

TRCN0000019659

CGGTTCTTTCGAGTTTAGTAA CGGTTCTTTTGAGTTTTGTAA

87 522

NFFKB

TRCN0000014868

CAGGGAGGTTGCATCATTGTT CAGGGAGGGTGCATCATTGTT

98 571

KLF13

TRCN0000016925

CGGGCGAGAAGAAGTTCAGCT CGGGCGAGAAGAAGTTCATGGT

124

slide-25
SLIDE 25
  • 3. Check for sequence similarity among barcodes in a reference

25

Ø Erroneous barcodes can potentially be generated during the production of many barcodes. Ø If two barcodes were designed similarly (i.e only 1 bp difference) and mutations or sequencing errors occur, then it will be hard to distinguish errors from true differences. Ø Thus, barcodes originally designed to be similar should be identified (and flagged) in advance. Ø For this purpose, Barcas allows checking of sequence similarity among barcode sequences.

slide-26
SLIDE 26

Library reference QC

Screen Library Date Species Module Barcode length Barcode count Gene count shRNA TRC 05/Apr/11 Human 21-bp 61,621 15,435 Cellecta 15/Feb/12 Module1 18-bp 27,500 5,046 Module2 18-bp 27,500 5,421 Module3 18-bp 27,500 4,923 sgRNA yusa Mouse 19-bp 87,437 19,149 CeCKOv2 09/Mar/15 Human Library A 20-bp 63,950 21,669 Library B 20-bp 56,869 19,834 Mouse Library A 20-bp 65,959 22,486 Library B 20-bp 61,139 21,263 Deletion mutant strains Heterozyg

  • us diploid

Saccharomyces cerevisiae 20-bp 6,318/UP 6,126/DN 6,131 Schizosaccharomyces pombe 20-bp 4,832/UP 4,832/DN 4,832

Tested public library sets (11)

26

slide-27
SLIDE 27

Library reference QC

Library Static sequence length comparison Dynamic sequence length Comparison (indels) GeCKOv2.Human.A 517 (0.81%) 538 (0.84%) GeCKOv2.Human.B 437 (0.77%) 441 (0.78%) GeCKOv2.Mouse.A 736 (1.12%) 755 (1.14%) GeCKOv2.Mouse.B 850 (1.39%) 860 (1.41%) yusa 517 (0.59%) 3,944 (4.51%) Cellecta.Human.M1 0 (0 %) 412 (1.5%) Cellecta.Human.M2 0 (0 %) 398 (1.45%) Cellecta.Human.M3 0 (0 %) 410 (1.49%) TRC 790 (1.28%) 1,909 (3.10%)

  • S. cerevisiae

0 (0 %) 0 (0 %)

  • S. pombe

0 (0 %) 0 (0 %)

Barcode counts having similar pairs within one base

27

slide-28
SLIDE 28

Conclusions

Ø Barcas is an all-in-one software for barcode-seq data analysis with user-friendly interface and a few new useful functions for data pre-processing and quality control of barcode library Ø Future improvements

Supports for diverse statistical analyses

  • Sophisticated gene-level summary statistics for shRNA and sgRNA
  • RSA, RIGER, MAGeCK, HiTSelect, ScreenBEAM, etc
  • Multiple-condition comparison (MAGeCK-VISPR)
  • Utilization of metadata and gene-set level analysis (HiTSelect)

Ø We hope Barcas will be useful for many researchers with minimal bioinformatics skills for barcode-seq data analysis

28

slide-29
SLIDE 29

Thank you for your attention

29

slide-30
SLIDE 30

30

Limits of the mapping of edgeR package

1. Indels in the barcode reads are not supported 2. Only shifts of the barcode positions allowed 3. Mismatches in the MID, universal primers not allowed 4. Indels in the MID and universal primers not allowed

MID Universal Primer Barcode (shRNA)

Read format

Universal Primer (sense) Barcode (shRNA) Universal Primer (anti-sense) Barcode (shRNA)

Example 1: TRC Library Different primer lengths of universal primers: Forward: 37 bp, reverse 42 bp Example 2: Cellecta library Different MID lengths: From 9 to 17 bp

MID Universal Primer Barcode (shRNA) MID MID Universal Primer Barcode (shRNA) Universal Primer Barcode (shRNA)

Loss of sequences with indels in any of the MID, primers and barcodes Loss of sequences with mismatches in the MID and primers