question about splitting bams

classic Classic list List threaded Threaded
4 messages Options
| Threaded
Open this post in threaded view
|

question about splitting bams

Roberto Alonso CIPF
Hello,
I ma trying ti write some code in order to give the possibility of parallelize some tasks. Now, I was with the problem of splitting a bam in some parts, for this I create this simple tool

<parallelism method="multi" split_size="3" split_mode="number_of_parts" merge_outputs="output" split_inputs="input" ></parallelism>

  <command>
    java -jar /home/ralonso/software/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/chr_19_hg19_ucsc.fa -I $input -o $output 2&gt; /dev/null;

  </command>
  <inputs>
    <param format="bam" name="input" type="data" label="bam"/>
  </inputs>
  <outputs>
      <data format="vcf" name="output" />
  </outputs>

But I have one problem, when I execute the tool it goes through this part of code (I am working in dev branch):

$galaxy/lib/galaxy/jobs/splitters/multi.py, line 75:

    for input in parent_job.input_datasets:
        if input.name in split_inputs:
            this_input_files = job_wrapper.get_input_dataset_fnames(input.dataset)
            if len(this_input_files) > 1:
                log_error = "The input '%s' is composed of multiple files - splitting is not allowed" % str(input.name)
                log.error(log_error)
                raise Exception(log_error)
            input_datasets.append(input.dataset)

So, it is raising the exception because this_input_files=2, concretely: ['/home/ralonso/galaxy/database/files/000/dataset_171.dat', '/home/ralonso/galaxy/database/files/_metadata_files/000/metadata_13.dat'], I guess that:
dataset_171.dat: It is the bam file.
metadata_13.dat: It is the bai file.

So, Galaxy can't move on and I don't know which would be the best solution. Maybe change the if to check only non-metadata files? I think I should use both files in order to create the bam sub-files, but this would be inside the Bam class, under binary.py file.
Could you please guide me before I mess things up?

Thanks so much
--
Roberto Alonso
Functional Genomics Unit
Bioinformatics and Genomics Department
Prince Felipe Research Center (CIPF)
C./Eduardo Primo Yúfera (Científic), nº 3
(junto Oceanografico)
46012 Valencia, Spain
Tel: +34 963289680 Ext. 1021
Fax: +34 963289574
E-Mail: [hidden email]

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: question about splitting bams

Roberto Alonso CIPF
Regarding my previous mail I found this thread

is it still alive? is it maybe the best choice to do the bam parallelization? 

Thanks!
Best regards

On 23 April 2015 at 17:55, Roberto Alonso CIPF <[hidden email]> wrote:
Hello,
I ma trying ti write some code in order to give the possibility of parallelize some tasks. Now, I was with the problem of splitting a bam in some parts, for this I create this simple tool

<parallelism method="multi" split_size="3" split_mode="number_of_parts" merge_outputs="output" split_inputs="input" ></parallelism>

  <command>
    java -jar /home/ralonso/software/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/chr_19_hg19_ucsc.fa -I $input -o $output 2&gt; /dev/null;

  </command>
  <inputs>
    <param format="bam" name="input" type="data" label="bam"/>
  </inputs>
  <outputs>
      <data format="vcf" name="output" />
  </outputs>

But I have one problem, when I execute the tool it goes through this part of code (I am working in dev branch):

$galaxy/lib/galaxy/jobs/splitters/multi.py, line 75:

    for input in parent_job.input_datasets:
        if input.name in split_inputs:
            this_input_files = job_wrapper.get_input_dataset_fnames(input.dataset)
            if len(this_input_files) > 1:
                log_error = "The input '%s' is composed of multiple files - splitting is not allowed" % str(input.name)
                log.error(log_error)
                raise Exception(log_error)
            input_datasets.append(input.dataset)

So, it is raising the exception because this_input_files=2, concretely: ['/home/ralonso/galaxy/database/files/000/dataset_171.dat', '/home/ralonso/galaxy/database/files/_metadata_files/000/metadata_13.dat'], I guess that:
dataset_171.dat: It is the bam file.
metadata_13.dat: It is the bai file.

So, Galaxy can't move on and I don't know which would be the best solution. Maybe change the if to check only non-metadata files? I think I should use both files in order to create the bam sub-files, but this would be inside the Bam class, under binary.py file.
Could you please guide me before I mess things up?

Thanks so much
--
Roberto Alonso
Functional Genomics Unit
Bioinformatics and Genomics Department
Prince Felipe Research Center (CIPF)
C./Eduardo Primo Yúfera (Científic), nº 3
(junto Oceanografico)
46012 Valencia, Spain
Tel: +34 963289680 Ext. 1021
Fax: +34 963289574
E-Mail: [hidden email]



--
Roberto Alonso
Functional Genomics Unit
Bioinformatics and Genomics Department
Prince Felipe Research Center (CIPF)
C./Eduardo Primo Yúfera (Científic), nº 3
(junto Oceanografico)
46012 Valencia, Spain
Tel: +34 963289680 Ext. 1021
Fax: +34 963289574
E-Mail: [hidden email]

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: question about splitting bams

John Chilton-4
In reply to this post by Roberto Alonso CIPF
I am a pragmatist - I have no problem just splitting the inputs and
skipping the metadata files. I would just convert the error into an
log.info() and warn that the tool cannot use metadata files. If the
underlying tool needs an index it can recreate it instead I think. One
can imagine a more intricate solution that would recreate metadata
files as needed - but that would be a lot of work I think.

Does that make sense?

About BB PR 175 there were some recent discussions about that approach
- I would check out
http://dev.list.galaxyproject.org/Parallelism-using-metadata-td4666763.html.

-John

On Thu, Apr 23, 2015 at 11:55 AM, Roberto Alonso CIPF <[hidden email]> wrote:

> Hello,
> I ma trying ti write some code in order to give the possibility of
> parallelize some tasks. Now, I was with the problem of splitting a bam in
> some parts, for this I create this simple tool
>
> <parallelism method="multi" split_size="3" split_mode="number_of_parts"
> merge_outputs="output" split_inputs="input" ></parallelism>
>
>   <command>
>     java -jar
> /home/ralonso/software/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T
> UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/chr_19_hg19_ucsc.fa -I
> $input -o $output 2&gt; /dev/null;
>
>   </command>
>   <inputs>
>     <param format="bam" name="input" type="data" label="bam"/>
>   </inputs>
>   <outputs>
>       <data format="vcf" name="output" />
>   </outputs>
>
> But I have one problem, when I execute the tool it goes through this part of
> code (I am working in dev branch):
>
> $galaxy/lib/galaxy/jobs/splitters/multi.py, line 75:
>
>     for input in parent_job.input_datasets:
>         if input.name in split_inputs:
>             this_input_files =
> job_wrapper.get_input_dataset_fnames(input.dataset)
>             if len(this_input_files) > 1:
>                 log_error = "The input '%s' is composed of multiple files -
> splitting is not allowed" % str(input.name)
>                 log.error(log_error)
>                 raise Exception(log_error)
>             input_datasets.append(input.dataset)
>
> So, it is raising the exception because this_input_files=2, concretely:
> ['/home/ralonso/galaxy/database/files/000/dataset_171.dat',
> '/home/ralonso/galaxy/database/files/_metadata_files/000/metadata_13.dat'],
> I guess that:
> dataset_171.dat: It is the bam file.
> metadata_13.dat: It is the bai file.
>
> So, Galaxy can't move on and I don't know which would be the best solution.
> Maybe change the if to check only non-metadata files? I think I should use
> both files in order to create the bam sub-files, but this would be inside
> the Bam class, under binary.py file.
> Could you please guide me before I mess things up?
>
> Thanks so much
> --
> Roberto Alonso
> Functional Genomics Unit
> Bioinformatics and Genomics Department
> Prince Felipe Research Center (CIPF)
> C./Eduardo Primo Yúfera (Científic), nº 3
> (junto Oceanografico)
> 46012 Valencia, Spain
> Tel: +34 963289680 Ext. 1021
> Fax: +34 963289574
> E-Mail: [hidden email]
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: question about splitting bams

Roberto Alonso CIPF
Hello,

I have been reading those different threads  and I have some doubts that you maybe can clarify me. In the thread you said: "ability to write tools that split up a single input into a collection. ", I think this is focused for workflows, but in any case, could we use this to split bams?
Another comment is the next:

"These common pipelines where you split up a BAM files, run a bunch of 
steps, and then merge the results will be executable in the near 
future (though 15.03 won't have workflow editor support for it - I 
will try to get to this by the following release - and you can 
manually build up workflows to do this - "

As I was trying to write something that will do exactly this and I guess there is someone working on this, do you think is it worth to continue doing this or just switch to another thing? would you know the road-map of this feature?

Thanks a lot,

Roberto

On 23 April 2015 at 20:09, John Chilton <[hidden email]> wrote:
I am a pragmatist - I have no problem just splitting the inputs and
skipping the metadata files. I would just convert the error into an
log.info() and warn that the tool cannot use metadata files. If the
underlying tool needs an index it can recreate it instead I think. One
can imagine a more intricate solution that would recreate metadata
files as needed - but that would be a lot of work I think.

Does that make sense?

About BB PR 175 there were some recent discussions about that approach
- I would check out
http://dev.list.galaxyproject.org/Parallelism-using-metadata-td4666763.html.

-John

On Thu, Apr 23, 2015 at 11:55 AM, Roberto Alonso CIPF <[hidden email]> wrote:
> Hello,
> I ma trying ti write some code in order to give the possibility of
> parallelize some tasks. Now, I was with the problem of splitting a bam in
> some parts, for this I create this simple tool
>
> <parallelism method="multi" split_size="3" split_mode="number_of_parts"
> merge_outputs="output" split_inputs="input" ></parallelism>
>
>   <command>
>     java -jar
> /home/ralonso/software/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T
> UnifiedGenotyper -R /home/ralonso/BiB/Galaxy/data/chr_19_hg19_ucsc.fa -I
> $input -o $output 2&gt; /dev/null;
>
>   </command>
>   <inputs>
>     <param format="bam" name="input" type="data" label="bam"/>
>   </inputs>
>   <outputs>
>       <data format="vcf" name="output" />
>   </outputs>
>
> But I have one problem, when I execute the tool it goes through this part of
> code (I am working in dev branch):
>
> $galaxy/lib/galaxy/jobs/splitters/multi.py, line 75:
>
>     for input in parent_job.input_datasets:
>         if input.name in split_inputs:
>             this_input_files =
> job_wrapper.get_input_dataset_fnames(input.dataset)
>             if len(this_input_files) > 1:
>                 log_error = "The input '%s' is composed of multiple files -
> splitting is not allowed" % str(input.name)
>                 log.error(log_error)
>                 raise Exception(log_error)
>             input_datasets.append(input.dataset)
>
> So, it is raising the exception because this_input_files=2, concretely:
> ['/home/ralonso/galaxy/database/files/000/dataset_171.dat',
> '/home/ralonso/galaxy/database/files/_metadata_files/000/metadata_13.dat'],
> I guess that:
> dataset_171.dat: It is the bam file.
> metadata_13.dat: It is the bai file.
>
> So, Galaxy can't move on and I don't know which would be the best solution.
> Maybe change the if to check only non-metadata files? I think I should use
> both files in order to create the bam sub-files, but this would be inside
> the Bam class, under binary.py file.
> Could you please guide me before I mess things up?
>
> Thanks so much
> --
> Roberto Alonso
> Functional Genomics Unit
> Bioinformatics and Genomics Department
> Prince Felipe Research Center (CIPF)
> C./Eduardo Primo Yúfera (Científic), nº 3
> (junto Oceanografico)
> 46012 Valencia, Spain
> Tel: <a href="tel:%2B34%20963289680%20Ext.%201021" value="+34963289680">+34 963289680 Ext. 1021
> Fax: <a href="tel:%2B34%20963289574" value="+34963289574">+34 963289574
> E-Mail: [hidden email]
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/



--
Roberto Alonso
Functional Genomics Unit
Bioinformatics and Genomics Department
Prince Felipe Research Center (CIPF)
C./Eduardo Primo Yúfera (Científic), nº 3
(junto Oceanografico)
46012 Valencia, Spain
Tel: +34 963289680 Ext. 1021
Fax: +34 963289574
E-Mail: [hidden email]

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/