1) by_rname -> splits the bam into files based on the chromosome
2) by_interval -> splits the bam into files based on a defined bp length, and does so across the entire genome present in the BAM file
3) by_read -> splits the bam into files based on the number of reads encountered (if multiple files, all other files match the interval as the first)
As I think the easiest is the first one, I started with this option.
First of all , I had to change line 82 of lib/galaxy/jobs/splitters/multi.py as that "if" didn't let the code to continue (I talked this in another thread).
Next, I had to do some changes in lib/galaxy/datatypes/binary.py. I added a method "split" that creates the json for the script extract_dataset_parts.sh. Here, in the next code you can see that I call samtools -H in order to get the chromosome names,
now I realized that I can get that information directly from metadatas in the input_datasets variable, so in the future I will change this.
Well, this works correctly and writes the json as expected. Now I have to write the code that is called by scripts/extract_dataset_part.py (inside of extract_dataset_parts.sh) "cls.process_split_file(data)".
So I created the next two function in the Bam class:
This is called in the context of an external process launched by a Task (possibly not on the Galaxy machine)
to create the input files for the Task. The parameters:
data - a dict containing the contents of the split file
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at: