bam split and gatk calling

classic Classic list List threaded Threaded
7 messages Options
| Threaded
Open this post in threaded view
|

bam split and gatk calling

Roberto Alonso CIPF
Hello,

I have been working in the Galaxy parallelization module and I would like to ask you some questions that I have about how to face one problem.
I have done one pull request about splitting bams: 

Regarding this, I think it is useful but it could be more while accessing somehow the interval. I better explain it with an example:
If I define a simple tool like this, with the parallelism tag "actived":

<tool id="gatk" name="call with gatk">
  <description>gatk</description>
  <parallelism method="multi" split_mode="by_interval" split_size="100000000" merge_outputs="output" split_inputs="input" ></parallelism>

  <command>
## by_rname      
ln -s $input input.bam;
samtools index input.bam;
UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam -o $output -L REGION ;

  </command>
  <inputs>
    <param format="bam" name="input" type="data" label="bam"/>
  </inputs>
  <outputs>
      <data format="vcf" name="output" />
  </outputs>

  <help>
  bwa
  </help>
</tool>

The region is based on the field split_size, it is better explained in the PR.
How does the code from the PR work? It goes through the bam file and does something like "samtools view REGION -o bam_splitted.bam", so then GATK does the calling for this small bam, but what is the problem? As you know, in the software GATK if you don't pass the region as an argument in the command line it goes through all the genome, so it is very slow. So, what would you recommend to me to be able to pass this information to GATK? I was thinking to create, at the same time the bam is splitted, a file region.bed and use it in the tool definition xml, so the command would be like this:
  <command>
...
UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam -o $output -L region.bed;
</command>

This solution does not convince me too much because it is a bit intrusive in the tool definition and also because you have to trust that the region.bed file exists.
Do you have any opinion, suggestion...?

Thanks a lot!

Best regards


--
Roberto Alonso
Functional Genomics Unit
Bioinformatics and Genomics Department
Prince Felipe Research Center (CIPF)
C./Eduardo Primo Yúfera (Científic), nº 3
(junto Oceanografico)
46012 Valencia, Spain
Tel: +34 963289680 Ext. 1021
Fax: +34 963289574
E-Mail: [hidden email]

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: bam split and gatk calling

Peter Cock
Hi Roberto,

Given the way BAM indexing works, I see no reason to actually
split the BAM file at all - it seems like wasted disk IO.

Instead, can you split a BED file into sub-regions? This way
each child GATK job would look at the full BAM file but only for
a small region described in the split BED region file?

Peter


On Wed, May 6, 2015 at 11:19 AM, Roberto Alonso CIPF <[hidden email]> wrote:

> Hello,
>
> I have been working in the Galaxy parallelization module and I would like to
> ask you some questions that I have about how to face one problem.
> I have done one pull request about splitting bams:
> https://github.com/galaxyproject/galaxy/pull/184
>
> Regarding this, I think it is useful but it could be more while accessing
> somehow the interval. I better explain it with an example:
> If I define a simple tool like this, with the parallelism tag "actived":
>
> <tool id="gatk" name="call with gatk">
>   <description>gatk</description>
>   <parallelism method="multi" split_mode="by_interval"
> split_size="100000000" merge_outputs="output" split_inputs="input"
>></parallelism>
>
>   <command>
> ## by_rname
> ln -s $input input.bam;
> samtools index input.bam;
> UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam
> -o $output -L REGION ;
>
>   </command>
>   <inputs>
>     <param format="bam" name="input" type="data" label="bam"/>
>   </inputs>
>   <outputs>
>       <data format="vcf" name="output" />
>   </outputs>
>
>   <help>
>   bwa
>   </help>
> </tool>
>
> The region is based on the field split_size, it is better explained in the
> PR.
> How does the code from the PR work? It goes through the bam file and does
> something like "samtools view REGION -o bam_splitted.bam", so then GATK does
> the calling for this small bam, but what is the problem? As you know, in the
> software GATK if you don't pass the region as an argument in the command
> line it goes through all the genome, so it is very slow. So, what would you
> recommend to me to be able to pass this information to GATK? I was thinking
> to create, at the same time the bam is splitted, a file region.bed and use
> it in the tool definition xml, so the command would be like this:
>   <command>
> ...
> UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam
> -o $output -L region.bed;
> </command>
>
> This solution does not convince me too much because it is a bit intrusive in
> the tool definition and also because you have to trust that the region.bed
> file exists.
> Do you have any opinion, suggestion...?
>
> Thanks a lot!
>
> Best regards
>
>
> --
> Roberto Alonso
> Functional Genomics Unit
> Bioinformatics and Genomics Department
> Prince Felipe Research Center (CIPF)
> C./Eduardo Primo Yúfera (Científic), nº 3
> (junto Oceanografico)
> 46012 Valencia, Spain
> Tel: +34 963289680 Ext. 1021
> Fax: +34 963289574
> E-Mail: [hidden email]
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: bam split and gatk calling

Roberto Alonso CIPF
Hello,

I agree, what you say fits perfectly for GATK, but as I wanted to create a more generic code I did it this way (also because I am a newbie in the galaxy code and I didn't know so well how to implement this ). What about a tool that doesn't accept a region, just a bam? Maybe we can put another parameter in the parallelism tag that force to split the bam. 
Mostly, just to create a bed file would be better, right? 
What do you think?

Regards

On 6 May 2015 at 12:23, Peter Cock <[hidden email]> wrote:
Hi Roberto,

Given the way BAM indexing works, I see no reason to actually
split the BAM file at all - it seems like wasted disk IO.

Instead, can you split a BED file into sub-regions? This way
each child GATK job would look at the full BAM file but only for
a small region described in the split BED region file?

Peter


On Wed, May 6, 2015 at 11:19 AM, Roberto Alonso CIPF <[hidden email]> wrote:
> Hello,
>
> I have been working in the Galaxy parallelization module and I would like to
> ask you some questions that I have about how to face one problem.
> I have done one pull request about splitting bams:
> https://github.com/galaxyproject/galaxy/pull/184
>
> Regarding this, I think it is useful but it could be more while accessing
> somehow the interval. I better explain it with an example:
> If I define a simple tool like this, with the parallelism tag "actived":
>
> <tool id="gatk" name="call with gatk">
>   <description>gatk</description>
>   <parallelism method="multi" split_mode="by_interval"
> split_size="100000000" merge_outputs="output" split_inputs="input"
>></parallelism>
>
>   <command>
> ## by_rname
> ln -s $input input.bam;
> samtools index input.bam;
> UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam
> -o $output -L REGION ;
>
>   </command>
>   <inputs>
>     <param format="bam" name="input" type="data" label="bam"/>
>   </inputs>
>   <outputs>
>       <data format="vcf" name="output" />
>   </outputs>
>
>   <help>
>   bwa
>   </help>
> </tool>
>
> The region is based on the field split_size, it is better explained in the
> PR.
> How does the code from the PR work? It goes through the bam file and does
> something like "samtools view REGION -o bam_splitted.bam", so then GATK does
> the calling for this small bam, but what is the problem? As you know, in the
> software GATK if you don't pass the region as an argument in the command
> line it goes through all the genome, so it is very slow. So, what would you
> recommend to me to be able to pass this information to GATK? I was thinking
> to create, at the same time the bam is splitted, a file region.bed and use
> it in the tool definition xml, so the command would be like this:
>   <command>
> ...
> UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam
> -o $output -L region.bed;
> </command>
>
> This solution does not convince me too much because it is a bit intrusive in
> the tool definition and also because you have to trust that the region.bed
> file exists.
> Do you have any opinion, suggestion...?
>
> Thanks a lot!
>
> Best regards
>
>
> --
> Roberto Alonso
> Functional Genomics Unit
> Bioinformatics and Genomics Department
> Prince Felipe Research Center (CIPF)
> C./Eduardo Primo Yúfera (Científic), nº 3
> (junto Oceanografico)
> 46012 Valencia, Spain
> Tel: <a href="tel:%2B34%20963289680%20Ext.%201021" value="+34963289680">+34 963289680 Ext. 1021
> Fax: <a href="tel:%2B34%20963289574" value="+34963289574">+34 963289574
> E-Mail: [hidden email]
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/



--
Roberto Alonso
Functional Genomics Unit
Bioinformatics and Genomics Department
Prince Felipe Research Center (CIPF)
C./Eduardo Primo Yúfera (Científic), nº 3
(junto Oceanografico)
46012 Valencia, Spain
Tel: +34 963289680 Ext. 1021
Fax: +34 963289574
E-Mail: [hidden email]

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: bam split and gatk calling

Peter Cock
On Wed, May 6, 2015 at 11:33 AM, Roberto Alonso CIPF <[hidden email]> wrote:

> Hello,
>
> I agree, what you say fits perfectly for GATK, but as I wanted to create a
> more generic code I did it this way (also because I am a newbie in the
> galaxy code and I didn't know so well how to implement this ). What about a
> tool that doesn't accept a region, just a bam? Maybe we can put another
> parameter in the parallelism tag that force to split the bam.
> Mostly, just to create a bed file would be better, right?
> What do you think?
>
> Regards

Maybe you're right - BAM splitting might be useful for some tools
(any examples?), even though BED splitting is a much more elegant
solution.

Peter
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: bam split and gatk calling

Roberto Alonso CIPF
I agree, I prefer your solution, I will focus on that solution, thanks!
Although there is some software more or less used in the community such Delly https://github.com/tobiasrausch/delly and Breakdancer http://gmt.genome.wustl.edu/packages/breakdancer/documentation.html, that doesn't use bed files, the only way to parallelize their execution is through smaller bams

Regards


On 6 May 2015 at 15:00, Peter Cock <[hidden email]> wrote:
On Wed, May 6, 2015 at 11:33 AM, Roberto Alonso CIPF <[hidden email]> wrote:
> Hello,
>
> I agree, what you say fits perfectly for GATK, but as I wanted to create a
> more generic code I did it this way (also because I am a newbie in the
> galaxy code and I didn't know so well how to implement this ). What about a
> tool that doesn't accept a region, just a bam? Maybe we can put another
> parameter in the parallelism tag that force to split the bam.
> Mostly, just to create a bed file would be better, right?
> What do you think?
>
> Regards

Maybe you're right - BAM splitting might be useful for some tools
(any examples?), even though BED splitting is a much more elegant
solution.

Peter



--
Roberto Alonso
Functional Genomics Unit
Bioinformatics and Genomics Department
Prince Felipe Research Center (CIPF)
C./Eduardo Primo Yúfera (Científic), nº 3
(junto Oceanografico)
46012 Valencia, Spain
Tel: +34 963289680 Ext. 1021
Fax: +34 963289574
E-Mail: [hidden email]

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: bam split and gatk calling

Joshua Udall
In reply to this post by Peter Cock
Use subBam from the BamBam package. Written in C.

subBam -g targets.bed sorted.bam -o sorted.subset.bam -m 0


On Wed, May 6, 2015 at 4:23 AM, Peter Cock <[hidden email]> wrote:
Hi Roberto,

Given the way BAM indexing works, I see no reason to actually
split the BAM file at all - it seems like wasted disk IO.

Instead, can you split a BED file into sub-regions? This way
each child GATK job would look at the full BAM file but only for
a small region described in the split BED region file?

Peter


On Wed, May 6, 2015 at 11:19 AM, Roberto Alonso CIPF <[hidden email]> wrote:
> Hello,
>
> I have been working in the Galaxy parallelization module and I would like to
> ask you some questions that I have about how to face one problem.
> I have done one pull request about splitting bams:
> https://github.com/galaxyproject/galaxy/pull/184
>
> Regarding this, I think it is useful but it could be more while accessing
> somehow the interval. I better explain it with an example:
> If I define a simple tool like this, with the parallelism tag "actived":
>
> <tool id="gatk" name="call with gatk">
>   <description>gatk</description>
>   <parallelism method="multi" split_mode="by_interval"
> split_size="100000000" merge_outputs="output" split_inputs="input"
>></parallelism>
>
>   <command>
> ## by_rname
> ln -s $input input.bam;
> samtools index input.bam;
> UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam
> -o $output -L REGION ;
>
>   </command>
>   <inputs>
>     <param format="bam" name="input" type="data" label="bam"/>
>   </inputs>
>   <outputs>
>       <data format="vcf" name="output" />
>   </outputs>
>
>   <help>
>   bwa
>   </help>
> </tool>
>
> The region is based on the field split_size, it is better explained in the
> PR.
> How does the code from the PR work? It goes through the bam file and does
> something like "samtools view REGION -o bam_splitted.bam", so then GATK does
> the calling for this small bam, but what is the problem? As you know, in the
> software GATK if you don't pass the region as an argument in the command
> line it goes through all the genome, so it is very slow. So, what would you
> recommend to me to be able to pass this information to GATK? I was thinking
> to create, at the same time the bam is splitted, a file region.bed and use
> it in the tool definition xml, so the command would be like this:
>   <command>
> ...
> UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam
> -o $output -L region.bed;
> </command>
>
> This solution does not convince me too much because it is a bit intrusive in
> the tool definition and also because you have to trust that the region.bed
> file exists.
> Do you have any opinion, suggestion...?
>
> Thanks a lot!
>
> Best regards
>
>
> --
> Roberto Alonso
> Functional Genomics Unit
> Bioinformatics and Genomics Department
> Prince Felipe Research Center (CIPF)
> C./Eduardo Primo Yúfera (Científic), nº 3
> (junto Oceanografico)
> 46012 Valencia, Spain
> Tel: <a href="tel:%2B34%20963289680%20Ext.%201021" value="+34963289680">+34 963289680 Ext. 1021
> Fax: <a href="tel:%2B34%20963289574" value="+34963289574">+34 963289574
> E-Mail: [hidden email]
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/



--
Joshua Udall (5133 LSB)
Brigham Young University
701 E. University Parkway
Plant and Wildlife Science Depart.
Provo, UT 84602

Office: 801-422-9307

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: bam split and gatk calling

Roberto Alonso CIPF
Indeed, I have rewritten the code with the Peter suggestions and I was thinking to update the PR with this code

On 6 May 2015 at 17:45, Joshua Udall <[hidden email]> wrote:
Use subBam from the BamBam package. Written in C.

subBam -g targets.bed sorted.bam -o sorted.subset.bam -m 0


On Wed, May 6, 2015 at 4:23 AM, Peter Cock <[hidden email]> wrote:
Hi Roberto,

Given the way BAM indexing works, I see no reason to actually
split the BAM file at all - it seems like wasted disk IO.

Instead, can you split a BED file into sub-regions? This way
each child GATK job would look at the full BAM file but only for
a small region described in the split BED region file?

Peter


On Wed, May 6, 2015 at 11:19 AM, Roberto Alonso CIPF <[hidden email]> wrote:
> Hello,
>
> I have been working in the Galaxy parallelization module and I would like to
> ask you some questions that I have about how to face one problem.
> I have done one pull request about splitting bams:
> https://github.com/galaxyproject/galaxy/pull/184
>
> Regarding this, I think it is useful but it could be more while accessing
> somehow the interval. I better explain it with an example:
> If I define a simple tool like this, with the parallelism tag "actived":
>
> <tool id="gatk" name="call with gatk">
>   <description>gatk</description>
>   <parallelism method="multi" split_mode="by_interval"
> split_size="100000000" merge_outputs="output" split_inputs="input"
>></parallelism>
>
>   <command>
> ## by_rname
> ln -s $input input.bam;
> samtools index input.bam;
> UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam
> -o $output -L REGION ;
>
>   </command>
>   <inputs>
>     <param format="bam" name="input" type="data" label="bam"/>
>   </inputs>
>   <outputs>
>       <data format="vcf" name="output" />
>   </outputs>
>
>   <help>
>   bwa
>   </help>
> </tool>
>
> The region is based on the field split_size, it is better explained in the
> PR.
> How does the code from the PR work? It goes through the bam file and does
> something like "samtools view REGION -o bam_splitted.bam", so then GATK does
> the calling for this small bam, but what is the problem? As you know, in the
> software GATK if you don't pass the region as an argument in the command
> line it goes through all the genome, so it is very slow. So, what would you
> recommend to me to be able to pass this information to GATK? I was thinking
> to create, at the same time the bam is splitted, a file region.bed and use
> it in the tool definition xml, so the command would be like this:
>   <command>
> ...
> UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/hg19_ucsc.fa -I input.bam
> -o $output -L region.bed;
> </command>
>
> This solution does not convince me too much because it is a bit intrusive in
> the tool definition and also because you have to trust that the region.bed
> file exists.
> Do you have any opinion, suggestion...?
>
> Thanks a lot!
>
> Best regards
>
>
> --
> Roberto Alonso
> Functional Genomics Unit
> Bioinformatics and Genomics Department
> Prince Felipe Research Center (CIPF)
> C./Eduardo Primo Yúfera (Científic), nº 3
> (junto Oceanografico)
> 46012 Valencia, Spain
> Tel: <a href="tel:%2B34%20963289680%20Ext.%201021" value="+34963289680" target="_blank">+34 963289680 Ext. 1021
> Fax: <a href="tel:%2B34%20963289574" value="+34963289574" target="_blank">+34 963289574
> E-Mail: [hidden email]
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/



--
Joshua Udall (5133 LSB)
Brigham Young University
701 E. University Parkway
Plant and Wildlife Science Depart.
Provo, UT 84602

Office: 801-422-9307



--
Roberto Alonso
Functional Genomics Unit
Bioinformatics and Genomics Department
Prince Felipe Research Center (CIPF)
C./Eduardo Primo Yúfera (Científic), nº 3
(junto Oceanografico)
46012 Valencia, Spain
Tel: +34 963289680 Ext. 1021
Fax: +34 963289574
E-Mail: [hidden email]

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/