i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence...
Transcript of i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence...
![Page 1: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed](https://reader035.fdocuments.mx/reader035/viewer/2022071214/604297c63300b47890387203/html5/thumbnails/1.jpg)
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
1
SAMandBAMformats
![Page 2: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed](https://reader035.fdocuments.mx/reader035/viewer/2022071214/604297c63300b47890387203/html5/thumbnails/2.jpg)
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
2
Rawsequencedata:Fastq files
Mapping(Bowtie,BWAorothers)
BAM/SAMfiles
• AftermappingtheFASTQfiletothereferencegenomeyouwillendupwithaSAMorBAMalignmentfile
• SAMstandsforSequenceAlignment/Mapformat
• AsingleSAMfilecanstoremapped,unmapped,andevenQC-failedreadsfromasequencingrun,andindexedtoallowrapidaccess.ThismeansthattherawsequencingdatacanbefullyrecapitulatedfromtheSAM/BAMfile.
SAM,BAMformats
![Page 3: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed](https://reader035.fdocuments.mx/reader035/viewer/2022071214/604297c63300b47890387203/html5/thumbnails/3.jpg)
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
LiShen,2014
SAMFormat
![Page 4: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed](https://reader035.fdocuments.mx/reader035/viewer/2022071214/604297c63300b47890387203/html5/thumbnails/4.jpg)
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
Rawsequencedata:Fastq files
Mapping(Bowtie,BWAorothers)
BAM/SAMfiles
• SAMisrarelyhelpfulandreallytakesuptoomuchspace whichiswhyweuseonlytheBAMinprinciple
• ABAMfile(.bam)isthebinaryversionofaSAMfile(savingstorageandfastermanipulation)
SAM,BAMformats
4
![Page 5: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed](https://reader035.fdocuments.mx/reader035/viewer/2022071214/604297c63300b47890387203/html5/thumbnails/5.jpg)
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
§ ASAMfile(.sam)isatab-delimitedtextfilethatcontainssequencealignmentdata
§ SAMfilescanbeopenedusingatexteditororviewedusingtheUNIX"more"command
§ Mostalignmentprogramswillsupply:
- aheader:describingtheformatversion,sortingorderofthereads,genomicsequencestowhichthereadsweremapped
- analignmentsection:containstheinformationforeachsequenceaboutwhere/howitalignstothereferencegenome
Rawsequencedata:Fastq files
Mapping(Bowtie,BWAorothers)
BAM/SAMfiles
SAM,BAMformats
5
![Page 6: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed](https://reader035.fdocuments.mx/reader035/viewer/2022071214/604297c63300b47890387203/html5/thumbnails/6.jpg)
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
SAM,BAMformats
Header:Alignmentsection11columns(tab-separated)
6
![Page 7: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed](https://reader035.fdocuments.mx/reader035/viewer/2022071214/604297c63300b47890387203/html5/thumbnails/7.jpg)
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
7
SAMFormat
http://samtools.sourceforge.net/SAM1.pdfhttp://genome.sph.umich.edu/wiki/SAM
QNAME FLAG RNAME MAPQ RNEX
T
PNEX
T
TLEN
SEQPOS
CIGAR
QUAL
![Page 8: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed](https://reader035.fdocuments.mx/reader035/viewer/2022071214/604297c63300b47890387203/html5/thumbnails/8.jpg)
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
(http://samtools.github.io/hts-specs/SAMv1.pdf)
QNAME:QuerytemplateNAME.Reads/segmentshavingidenticalQNAMEareregardedtocomefromthesametemplate.AQNAME‘*’indicatestheinformationisunavailable.
8
SAMfomat
![Page 9: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed](https://reader035.fdocuments.mx/reader035/viewer/2022071214/604297c63300b47890387203/html5/thumbnails/9.jpg)
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
(http://samtools.github.io/hts-specs/SAMv1.pdf)
FLAG:FLAG:bitwiseFLAG(idealforcompression).
9
SAMfomat(2)
11boolean flagsallstotred inasingecolumn
![Page 10: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed](https://reader035.fdocuments.mx/reader035/viewer/2022071214/604297c63300b47890387203/html5/thumbnails/10.jpg)
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
10
SAMfile
read mapped toposition7:FLAG163(=1+2+32+128):- Readis thesecondread inthepair(128)- Readis properly paired (1+2)- its mateis mapped to37onthereversestrand (32)
SAMflag:example
![Page 11: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed](https://reader035.fdocuments.mx/reader035/viewer/2022071214/604297c63300b47890387203/html5/thumbnails/11.jpg)
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
Explainflagtool:https://broadinstitute.github.io/picard/explain-flags.html
11
DecodingSAMflags
![Page 12: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed](https://reader035.fdocuments.mx/reader035/viewer/2022071214/604297c63300b47890387203/html5/thumbnails/12.jpg)
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
(http://samtools.github.io/hts-specs/SAMv1.pdf)
The MAPQvaluecanbeusedtofigureouthowuniqueanalignmentisinthegenome.ü Largenumber,>10 indicatesit'slikelythealignmentisunique.ü 255indicatesthatthemappingqualityisnotavailable
12
SAMfomat(3)
Itequals−10log10Pr{mappingpositioniswrong},roundedtothenearestinteger.
![Page 13: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed](https://reader035.fdocuments.mx/reader035/viewer/2022071214/604297c63300b47890387203/html5/thumbnails/13.jpg)
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
• The CIGAR string is a sequence of numbers and lettersrepresenting the associated information on bases alignmentused to indicate things like which bases align (either amatch/mismatch) with the reference, are deleted from thereference, and if there are insertions that are not in the reference
SAMfomat:CIGARstring
Moreinformationabouttheseformatsavailablehere:http://samtools.sourceforge.nethttps://samtools.github.io/hts-specs/SAMv1.pdf
13
![Page 14: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed](https://reader035.fdocuments.mx/reader035/viewer/2022071214/604297c63300b47890387203/html5/thumbnails/14.jpg)
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
Mapped andunmapped reads areimported into SAM/BAMformat
ThestandardCIGARdescriptionofpairwise alignment defines three operations:‘M’foralignment match,‘I’forinsertioncompared with thereference and‘D’fordeletion.
(NB:ThePOSindicates that theread aligns starting at position5onthereference)
TheCIGAR:3M=3basesintheread sequence align with thereference.1I=Thenext baseintheread does notexist inthereference.1D=Thereference basedoes notexist intheread sequence
POS:5CIGAR:3M1I3M1D2M
http://genome.sph.umich.edu/wiki/SAM
SAMfomat:CIGARstring
14
![Page 15: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed](https://reader035.fdocuments.mx/reader035/viewer/2022071214/604297c63300b47890387203/html5/thumbnails/15.jpg)
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
(Lietal.,2009)
Alignments
SAMfile
Examples ofCIGARstringsfordifferent typesofalignments
SAMfomat:CIGARstring
15
![Page 16: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed](https://reader035.fdocuments.mx/reader035/viewer/2022071214/604297c63300b47890387203/html5/thumbnails/16.jpg)
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
Nameofmate(matepairinformationforpaired-endsequencing)Positionofmate(matepairinformation)
Obviously,thechromsome andpositionareimportant.TheCIGARstringisalsoimportanttoknowwhereinsertions(i.e.introns)mightexistinyourread.
(http://samtools.github.io/hts-specs/SAMv1.pdf)
16
SAMformat(5)
SAMfomat(5)