splitting bams bai

classic Classic list List threaded Threaded
1 message Options
| Threaded
Open this post in threaded view
|

splitting bams bai

Roberto Alonso CIPF
Hello,

I am writing some code in Galaxy for splitting bams. Up to know, I am following the ideas that Marco Albuquerque proposed in this thread http://dev.list.galaxyproject.org/Parallelism-using-metadata-td4666763.html
He proposed three ways of splitting:

1) by_rname -> splits the bam into files based on the chromosome
2) by_interval -> splits the bam into files based on  a defined bp length, and does so across the entire genome present in the BAM file
3) by_read -> splits the bam into files based on the number of reads encountered (if multiple files, all other files match the interval as the first)

As I think the easiest is the first one, I started with this option. 

First of all , I had to change line 82 of lib/galaxy/jobs/splitters/multi.py as that "if" didn't let the code to continue (I talked this in another thread).

Next, I had to do some changes in lib/galaxy/datatypes/binary.py. I added a method "split" that creates the json for the script extract_dataset_parts.sh. Here, in the next code you can see that I call samtools -H in order to get the chromosome names,
now I realized that I can get that information directly from metadatas in the input_datasets variable, so in the future I will change this.

def split(cls, input_datasets, subdir_generator_function, split_params):

        # 1) by_rname -> splits the bam into files based on the chromosome
        # 2) by_interval -> splits the bam into files based on  a defined bp length, and does so across the entire genome present in the BAM file
        # 3) by_read -> splits the bam into files based on the number of reads encountered (if multiple files, all other files match the interval as the first)

        if split_params is None:
            return
        if len(input_datasets) > 1:
            raise Exception("BAM file splitting does not support multiple files")
        input_file = input_datasets[0].file_name


        if 'split_mode' not in split_params:
            raise Exception('Tool does not define a split mode')
        elif split_params['split_mode'] == 'by_rname':
            log.debug("Attemping to split BAM file %s by chromosome", input_file)

            #First get bam header
            params = ["samtools", "view", "-H", input_file]
            output = subprocess.Popen( params, stderr=subprocess.PIPE, stdout=subprocess.PIPE ).communicate()[0]
            output = output.split("\n")
            chrList = []
   #Get chromosome list from the header.
            for line in output:
                fields = line.strip().split("\t")
                if fields[0].startswith("@SQ") and fields[1].startswith("SN:"):
                    chrList.append(fields[1].split("SN:")[1])

   # Write json for extract_dataset_parts
            for chrName in chrList:
                try:
                    part_dir = subdir_generator_function()
                    base_name = os.path.basename(input_file)
                    part_path = os.path.join(part_dir, base_name)
                    split_data = dict(class_name='%s.%s' % (cls.__module__, cls.__name__),
                                      output_name=part_path,
                                      input_name=input_file,
                                      args=dict(chromosome=chrName))
                    f = open(os.path.join(part_dir, 'split_info_%s.json' % base_name), 'w')
                    json.dump(split_data, f)
                    f.close()

                except Exception, e:
                    log.error("Error: " + str(e))
                    raise
        else:
            raise Exception('Unsupported split mode %s' % split_params['split_mode'])
    split = classmethod(split)

Well, this works correctly and writes the json as expected. Now I have to write the code that is called by scripts/extract_dataset_part.py (inside of extract_dataset_parts.sh) "cls.process_split_file(data)".
So I created the next two function in the Bam class:

def process_split_file(data):
        """
        This is called in the context of an external process launched by a Task (possibly not on the Galaxy machine)
        to create the input files for the Task. The parameters:
        data - a dict containing the contents of the split file
        """
        args = data['args']
        input_name = data['input_name']
        output_name = data['output_name']
        chromosome = args['chromosome']

        commands = Bam.get_split_commands_chromosome(input_name, output_name, chromosome)
        for cmd in commands:
            if 0 != os.system(cmd):
                raise Exception("Executing '%s' failed" % cmd)
        return True
process_split_file = staticmethod(process_split_file)

def get_split_commands_chromosome(input_name, output_name, chromosome):
        params = ["samtools view -h " + input_name + " " + output_name + " " + chromosome]
        return params
get_split_commands_chromosome = staticmethod(get_split_commands_chromosome)

Which is my problem? That I need the .bai related with that dataset "input_name", I think is in a metadata table, but I don't know how to get it, Could you please help me with this?
In any case, if you find that I am doing something wrong, or you have a better idea of implementing this, please don't hesitate to contact me.

Best regards



--
Roberto Alonso
Functional Genomics Unit
Bioinformatics and Genomics Department
Prince Felipe Research Center (CIPF)
C./Eduardo Primo Yúfera (Científic), nº 3
(junto Oceanografico)
46012 Valencia, Spain
Tel: +34 963289680 Ext. 1021
Fax: +34 963289574
E-Mail: [hidden email]

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/