Can existing SAM/BAM filter tools give me mapped/unmapped pairs?

classic Classic list List threaded Threaded
3 messages Options
| Threaded
Open this post in threaded view
|

Can existing SAM/BAM filter tools give me mapped/unmapped pairs?

Peter Cock
Hi all,

I'm looking for a little advice on the pre-existing SAM/BAM filtering
tools already in the Galaxy Tool Shed (to avoid reinventing the wheel).

As I mentioned on another thread, I'm working on a wrapper for the
"samtools bam2fq" command (targeting samtools 1.1 which fixed
some bugs in this tool and added new functionality compared to
samtools 0.1.19), see:

https://github.com/peterjc/pico_galaxy/tree/master/tools/samtools_bam2fq
https://toolshed.g2.bx.psu.edu/view/peterjc/samtools_bam2fq
https://testtoolshed.g2.bx.psu.edu/view/peterjc/samtools_bam2fq

One of my motivating use cases is a workflow like this:

1. Upload paired end FASTQ files.
2. Map them against a known contaminant genome giving a BAM file
(note I need the mapper to report unmapped reads in the output).
3. Filter the BAM to get unmapped reads, plus reads whose partner is
unmapped (conversely, remove reads where both partners are mapped).
4. Convert the filtered BAM back into FASTQ (with samtools bam2fq).
5. Proceed with analysis (e.g. de novo assembly).

Assuming I have understood "samtools view", this filtering step
has to be multiple parts:

This would get the unmapped reads
$ samtools view -f 0x4 ...

This would get reads with an unmapped partner:
$ samtools view -f 0x8 ...

However this would only get unmapped reads with an unmapped partner:
$ samtools view -f 0x12 ...

i.e. samtools view allows logical AND, not logical OR, when combining
flag filters.

So, I believe using samtools directly, a two stage filter is needed followed
by a merge (and sort), taking care not to duplicate reads, perhaps:

$ samtools view -f 4 ... > unmapped.bam
$ samtools view -f 8 -F 4 ... > mapped_with_partner_unmapped.bam
$ samtools merge unmapped.bam mapped_with_partner_unmapped.bam > ...

That could be repeated within Galaxy but is surprisingly complicated
with multiple steps in the history - so I do not want to go that route.

Have I overlooked a simple ToolShed solution using samtools?

As far as I could tell, the only other option on the current Tool Shed
is the Sambamba Filter tool (using "unmapped or mate_is_unmapped"),
which has a very capable looking filter system:
https://toolshed.g2.bx.psu.edu/view/lomereiter/sambamba_filter

@Artem - have you explored updating your tool_dependencies.xml
to download your pre-compiled binaries by default? That would
make deployment far easier, since D compilers are still rare, and
would mean we can see the test results on the Tool Shed :)
Please ask if you'd like advice on Tool Shed packaging.

Thanks,

Peter
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Can existing SAM/BAM filter tools give me mapped/unmapped pairs?

Artem Tarasov
Hi Peter,

I've uploaded updated tool_dependencies.xml to the Tool Shed, hopefully it works (I didn't bother installing Galaxy on local machine...)


--
Artem

On Tue, Nov 4, 2014 at 5:44 PM, Peter Cock <[hidden email]> wrote:
Hi all,

I'm looking for a little advice on the pre-existing SAM/BAM filtering
tools already in the Galaxy Tool Shed (to avoid reinventing the wheel).

As I mentioned on another thread, I'm working on a wrapper for the
"samtools bam2fq" command (targeting samtools 1.1 which fixed
some bugs in this tool and added new functionality compared to
samtools 0.1.19), see:

https://github.com/peterjc/pico_galaxy/tree/master/tools/samtools_bam2fq
https://toolshed.g2.bx.psu.edu/view/peterjc/samtools_bam2fq
https://testtoolshed.g2.bx.psu.edu/view/peterjc/samtools_bam2fq

One of my motivating use cases is a workflow like this:

1. Upload paired end FASTQ files.
2. Map them against a known contaminant genome giving a BAM file
(note I need the mapper to report unmapped reads in the output).
3. Filter the BAM to get unmapped reads, plus reads whose partner is
unmapped (conversely, remove reads where both partners are mapped).
4. Convert the filtered BAM back into FASTQ (with samtools bam2fq).
5. Proceed with analysis (e.g. de novo assembly).

Assuming I have understood "samtools view", this filtering step
has to be multiple parts:

This would get the unmapped reads
$ samtools view -f 0x4 ...

This would get reads with an unmapped partner:
$ samtools view -f 0x8 ...

However this would only get unmapped reads with an unmapped partner:
$ samtools view -f 0x12 ...

i.e. samtools view allows logical AND, not logical OR, when combining
flag filters.

So, I believe using samtools directly, a two stage filter is needed followed
by a merge (and sort), taking care not to duplicate reads, perhaps:

$ samtools view -f 4 ... > unmapped.bam
$ samtools view -f 8 -F 4 ... > mapped_with_partner_unmapped.bam
$ samtools merge unmapped.bam mapped_with_partner_unmapped.bam > ...

That could be repeated within Galaxy but is surprisingly complicated
with multiple steps in the history - so I do not want to go that route.

Have I overlooked a simple ToolShed solution using samtools?

As far as I could tell, the only other option on the current Tool Shed
is the Sambamba Filter tool (using "unmapped or mate_is_unmapped"),
which has a very capable looking filter system:
https://toolshed.g2.bx.psu.edu/view/lomereiter/sambamba_filter

@Artem - have you explored updating your tool_dependencies.xml
to download your pre-compiled binaries by default? That would
make deployment far easier, since D compilers are still rare, and
would mean we can see the test results on the Tool Shed :)
Please ask if you'd like advice on Tool Shed packaging.

Thanks,

Peter


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Can existing SAM/BAM filter tools give me mapped/unmapped pairs?

Peter Cock
In reply to this post by Peter Cock
On Tue, Nov 4, 2014 at 2:44 PM, Peter Cock <[hidden email]> wrote:

> Hi all,
>
> I'm looking for a little advice on the pre-existing SAM/BAM filtering
> tools already in the Galaxy Tool Shed (to avoid reinventing the wheel).
>
> As I mentioned on another thread, I'm working on a wrapper for the
> "samtools bam2fq" command (targeting samtools 1.1 which fixed
> some bugs in this tool and added new functionality compared to
> samtools 0.1.19), see:
>
> https://github.com/peterjc/pico_galaxy/tree/master/tools/samtools_bam2fq
> https://toolshed.g2.bx.psu.edu/view/peterjc/samtools_bam2fq
> https://testtoolshed.g2.bx.psu.edu/view/peterjc/samtools_bam2fq


Going off topic, but I just hit a problem here:
https://github.com/samtools/samtools/issues/313

Depending on if the reads have a QUAL value or not, samtool bam2fq
will produce either FASTA or FASTQ output - and will happily give
a mixture in one file. I know Heng Li has a parser that will take this
kind of input, but Galaxy likes to have well defined file formats.

I may have to "fix" samtools, perhaps by adding a strict FASTQ
output mode?

Peter
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/