Galaxy for Natural Language Processing

classic Classic list List threaded Threaded
10 messages Options
| Threaded
Open this post in threaded view
|

Galaxy for Natural Language Processing

Keith Suderman
Hello,

Our group is investigating using Galaxy as a workflow engine for NLP (Natural Language Processing) tasks. I have installed a local Galaxy instance and created wrappers for the services we use and so far everything is working great.  I do have a few questions and they all fall under the “Advanced Topics”  section as defined at the end of the tutorial for creating a Histogram [1]

1. parameter validation:

Many of our tools rely on additions made by previous tools in the workflow; for example, a tool that identifies noun phrases may require that the input has been run through a part of speech (POS) tagger, the POS tagger may require that the input has been run through a tokenizer, etc.  Our tools can do this validation, I am just looking for a way to wire this into Galaxy so a user can only connect tools in the workflow editor if this validation passes.

I have been looking looking at the code for lib/galaxy/tools/parameters/validation.py and I don’t see anything that I can (easily) bend to our use case.  What I was hoping for was something like:

        <input type=“data” format=“our_custom_format” name=“input”>
                <validator type=“dataset_custom”>
                        <command interpreter=“bash”>validate.sh $input</command>
                        <!— OR —>
                        <tool file=“custom_validator.xml”/>
                </validator>
        </input>

I also see the tantalizing sentence, "Custom code execution at various time points of the workflow that allows a fine grained control over the execution process", but I can't find any examples of how this is done.


2. data repositories / data collections

I need to be able to process collections of data pulled from remote servers. I have been looking at DataManagers and data collections in Galaxy, but everything seems to assume the data is local to the server, or can be copied/uploaded to the server.  For practical and legal reasons beyond my pay grade this is not a solution in our case.  For example, an organization may be willing to allow our users to query their service for documents, run the documents through our workflow, and store the intermediate results; but they will not allow us to copy their data to another server verbatim.  There are possibilities for me to cache data, but the general use case is that I have to call an external service to fetch documents one at a time and then run the same workflow on each document.

Any suggestions on how to accomplish this in Galaxy?  I can do single documents, I just need to expand this to include collections of documents.  A typical workflow might look something like:

a) Query Tool -> Server, find all documents that contain the word “cheese”
b) Server -> Here is the list of document IDs [ id1, id2, …, idn ]
c) WorkFlow -> for each id in the list do
        c1) Download document
        c2 ) Work work work work…
        c3) Persist output

I can do all of the above except the most important bit; iterating…


3. format conversion:  

Is it possible for Galaxy to automatically convert between formats when designing a workflow?  I see the <change_format/> tag, but that seems to change the output format of a tool based on the input (or some other condition) in the same tool; I need to be able to change the format based on the input requirements of the next tool in the workflow. For example, if Tool A produces format X, Tool B requires format Y,  and a converter from X to Y has been defined in the datatypes_conf.xml; I would like for Galaxy to implicitly insert the converter from X to Y when I drag the output noodle from Tool A to Tool B in the designer.  Is this possible?


4. OAuth 2.0 / OpenID Connect:

I need to be able to fetch documents from data providers that require an OAuth 2.0 access token. Currently, I use a separate service to go through the OAuth authentication/authorization process and then have the user copy/paste their access token into a text field in Galaxy.   Is there a way to perform the OAuth authentication dance required by the remote service inside Galaxy itself?   I’ve looked at the Trello site for Galaxy and see that both OAuth 2.0 and OpenID Connect are on the radar, hopefully this use case is being considered as well.


I’m sure to have more questions after working through some visualization examples, but this should keep me busy for now.

Sincerely,
Keith Suderman

REFERENCES

1. https://wiki.galaxyproject.org/Admin/Tools/AddingTools

------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Galaxy for Natural Language Processing

Björn Grüning-3
Hi Keith,

Am 13.04.2015 um 20:12 schrieb Keith Suderman:
> Hello,
>
> Our group is investigating using Galaxy as a workflow engine for NLP (Natural Language Processing) tasks.

Good choice! :)

> I have installed a local Galaxy instance and created wrappers for the services we use and so far everything is working great.  I do have a few questions and they all fall under the “Advanced Topics”  section as defined at the end of the tutorial for creating a Histogram [1]
>
> 1. parameter validation:
>
> Many of our tools rely on additions made by previous tools in the workflow; for example, a tool that identifies noun phrases may require that the input has been run through a part of speech (POS) tagger, the POS tagger may require that the input has been run through a tokenizer, etc.  Our tools can do this validation, I am just looking for a way to wire this into Galaxy so a user can only connect tools in the workflow editor if this validation passes.
>
> I have been looking looking at the code for lib/galaxy/tools/parameters/validation.py and I don’t see anything that I can (easily) bend to our use case.  What I was hoping for was something like:
>
> <input type=“data” format=“our_custom_format” name=“input”>
> <validator type=“dataset_custom”>
> <command interpreter=“bash”>validate.sh $input</command>
> <!— OR —>
> <tool file=“custom_validator.xml”/>
> </validator>
> </input>

Can you tell me how your tool detects if it was processed before by an
other tool? Metadata detection? Is this is different file type? If so
you can define your own datatype(s). One of your tools can only consume
the file types of an other tools output and so on.

> I also see the tantalizing sentence, "Custom code execution at various time points of the workflow that allows a fine grained control over the execution process", but I can't find any examples of how this is done.

This is currently only accessible via the API I think. The backend is
currently under testing and it will be integrated during the next
releases afaik.

> 2. data repositories / data collections
>
> I need to be able to process collections of data pulled from remote servers.
> I have been looking at DataManagers and data collections in Galaxy, but everything seems to assume
> the data is local to the server, or can be copied/uploaded to the server.

This is the preferred way, for reproducibility reasons.

> For practical and legal reasons beyond my pay grade this is not a solution in our case.  

> For example, an organization may be willing to allow our users to query their service for documents,
> run the documents through our workflow, and store the intermediate results; but they will not allow us to copy their data to another server verbatim.  

> There are possibilities for me to cache data, but the general use case is that I have to call an external service to fetch documents one at a
> time and then run the same workflow on each document.

I don't think you can use data-collections for this :(
What you can do is simply write a tool which takes an URL and consumes
this document and do the first step. But in the end this resutl/document
will be stored on a server.

> Any suggestions on how to accomplish this in Galaxy?  I can do single documents, I just need to expand this to include collections of documents.  
> A typical workflow might look something like:
>
> a) Query Tool -> Server, find all documents that contain the word “cheese”
> b) Server -> Here is the list of document IDs [ id1, id2, …, idn ]
> c) WorkFlow -> for each id in the list do
> c1) Download document
> c2 ) Work work work work…
> c3) Persist output
>
> I can do all of the above except the most important bit; iterating…

Oh yes, this is simple. Just create one workflow that deals with one ID.
This workflow you can run on multiple ids.

>
> 3. format conversion:  
>
> Is it possible for Galaxy to automatically convert between formats when designing a workflow?  
> I see the <change_format/> tag, but that seems to change the output format of a tool based on the input
> (or some other condition) in the same tool; I need to be able to change the format based on the input requirements
> of the next tool in the workflow. For example, if Tool A produces format X, Tool B requires format Y,  and a converter
> from X to Y has been defined in the datatypes_conf.xml; I would like for Galaxy to implicitly insert the converter
> from X to Y when I drag the output noodle from Tool A to Tool B in the designer.  Is this possible?

Oh yes this is supported out of the box!
See here for a small documentation:
https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox#supported-filetypes

Here is a example of how you can write your own datatypes:

https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox/datatypes

>
> 4. OAuth 2.0 / OpenID Connect:
>
> I need to be able to fetch documents from data providers that require an OAuth 2.0 access token. Currently, I use a separate service to go
> through the OAuth authentication/authorization process and then have the user copy/paste their access token into a text field in Galaxy.  
> Is there a way to perform the OAuth authentication dance required by the remote service inside Galaxy itself?  

I don't think so, but maybe someone else has an idea here.

> I’ve looked at the Trello site for Galaxy and see that both OAuth 2.0 and OpenID Connect are on the radar, hopefully this use case is being considered as well.
>
> I’m sure to have more questions after working through some visualization examples, but this should keep me busy for now.

Hope you are busy now :)
Cheers and keep us up to date!
Bjoern

> Sincerely,
> Keith Suderman
>
> REFERENCES
>
> 1. https://wiki.galaxyproject.org/Admin/Tools/AddingTools
>
> ------------------------------
> Research Associate
> Department of Computer Science
> Vassar College
> Poughkeepsie, NY
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
>
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Galaxy for Natural Language Processing

Keith Suderman
Hi Björn,

Thanks for the quick reply.

It might help to take a look at what I have so far. Our Galaxy server is running at http://grid.anc.org:8000  The front page contains a very basic tutorial (and I use the term loosely) with instructions for creating a simple workflow with our tools.  The "simple tutorial" assumes a good understanding of Galaxy.

It is also likely worthwhile pointing out that all of our "tools" invoke SOAP/REST web services running on other servers; nothing is "local".  I tend to use the terms "tool" and "service" interchangeably since all of our tools are web services.


Can you tell me how your tool detects if it was processed before by an
other tool? Metadata detection? Is this is different file type? If so
you can define your own datatype(s). One of your tools can only consume
the file types of an other tools output and so on.

We have two issues:
1. the format of the data
2. what the data contains.

Format:  This is easy; I have converters defined in the datatypes_conf.xml file and the workflow editor won't let me connect tools if the output/input formats don't match. 

Data Contents:  Just because Tool A produces format X and Tool B accepts format X does not mean the tools can be connected, I need to do deeper validation than simple format matching.  For example, if you drop several of the GATE tools into a workflow you can connect them in any order as they all accept and produce the same format,  however the tools must be connected in the correct order to produce something other than error messages.

There are two ways we do validation right now: each document contains metadata that describes the contents, and each service (tool) can produce metadata describing its input and output.  So we can check if two tools can be connected (the output of the first satisfies the input requirements of the second) and each tool checks at runtime that the input contains the necessary data.


A typical workflow might look something like:

a) Query Tool -> Server, find all documents that contain the word “cheese”
b) Server -> Here is the list of document IDs [ id1, id2, …, idn ]
c) WorkFlow -> for each id in the list do
c1) Download document
c2 ) Work work work work…
c3) Persist output

I can do all of the above except the most important bit; iterating…

Oh yes, this is simple. Just create one workflow that deals with one ID.
This workflow you can run on multiple ids.

That is the question; how do I run the same workflow on multiple ids?  The server may return hundreds or thousands of id values so running the workflow manually for each id is not really an option.


Oh yes this is supported out of the box!
See here for a small documentation:
https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox#supported-filetypes

Here is a example of how you can write your own datatypes:

https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox/datatypes

I feel like I must be missing the obvious.  Here is the relevant section of my datatypes_conf.xml (you can see the full file at https://github.com/oanc/Galaxy/blob/master/config/datatypes_conf.xml)

<datatype extension="lif" type="galaxy.datatypes.text:Json" display_in_upload="True">
<converter file="convert.json2gate_2.0.0.xml" target_datatype="gate"/>
</datatype>
<datatype extension="gate" type="galaxy.datatypes.xml:GenericXml" mimetype="application/xml" display_in_upload="true">
<converter file="convert.gate2json_2.0.0.xml" target_datatype="lif"/>
</datatype>
 
Is there anything I need to do beyond defining the datatypes for implicit conversions to take place?

Thanks,
Keith Suderman



4. OAuth 2.0 / OpenID Connect:

I need to be able to fetch documents from data providers that require an OAuth 2.0 access token. Currently, I use a separate service to go
through the OAuth authentication/authorization process and then have the user copy/paste their access token into a text field in Galaxy.   
Is there a way to perform the OAuth authentication dance required by the remote service inside Galaxy itself?   

I don't think so, but maybe someone else has an idea here.

I’ve looked at the Trello site for Galaxy and see that both OAuth 2.0 and OpenID Connect are on the radar, hopefully this use case is being considered as well.

I’m sure to have more questions after working through some visualization examples, but this should keep me busy for now.

Hope you are busy now :)
Cheers and keep us up to date!
Bjoern

Sincerely,
Keith Suderman

REFERENCES

1. https://wiki.galaxyproject.org/Admin/Tools/AddingTools

------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
 https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
 http://galaxyproject.org/search/mailinglists/



------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY



___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Galaxy for Natural Language Processing

Björn Grüning-3
Hi Keith,


> Hi Björn,
>
> Thanks for the quick reply.
>
> It might help to take a look at what I have so far. Our Galaxy server is running at http://grid.anc.org:8000 
> The front page contains a very basic tutorial (and I use the term loosely) with instructions for creating a simple workflow with our tools.  
> The "simple tutorial" assumes a good understanding of Galaxy.

Nice! Can I convince you to put these tools into the TS if everything is
working? Maybe with a German beer? ;)

> It is also likely worthwhile pointing out that all of our "tools" invoke SOAP/REST web services running on other servers; nothing is "local".  
> I tend to use the terms "tool" and "service" interchangeably since all of our tools are web services.
>
>
>> Can you tell me how your tool detects if it was processed before by an
>> other tool? Metadata detection? Is this is different file type? If so
>> you can define your own datatype(s). One of your tools can only consume
>> the file types of an other tools output and so on.
>
> We have two issues:
> 1. the format of the data
> 2. what the data contains.
>
> Format:  This is easy; I have converters defined in the datatypes_conf.xml file and the workflow editor won't let me connect tools if the output/input formats don't match.

True, but you can add a post-processing action, where you can change the
data type.

> Data Contents:  Just because Tool A produces format X and Tool B accepts format X does not mean the tools can be connected,
> I need to do deeper validation than simple format matching.  For example, if you drop several of the GATE tools into a
> workflow you can connect them in any order as they all accept and produce the same format,  however the tools must be
> connected in the correct order to produce something other than error messages.

> There are two ways we do validation right now: each document contains metadata that describes the contents, and each service (tool)
> can produce metadata describing its input and output.  So we can check if two tools can be connected (the output of the first satisfies
> the input requirements of the second) and each tool checks at runtime that the input contains the necessary data.

You can probably use data type metadata for such a use case.
Please have a look at the sqlite datatype and gemini that is using it:

https://github.com/galaxyproject/tools-iuc/blob/master/tools/gemini/gemini_macros.xml#L127

You can filter sqlite according to some attached metadata, for example
the version. You need to define your own metadata, and during file
creation the metadata will be 'calculated' and set. You can then filter
in your next tool according to this metadata.

>
>>> A typical workflow might look something like:
>>>
>>> a) Query Tool -> Server, find all documents that contain the word “cheese”
>>> b) Server -> Here is the list of document IDs [ id1, id2, …, idn ]
>>> c) WorkFlow -> for each id in the list do
>>> c1) Download document
>>> c2 ) Work work work work…
>>> c3) Persist output
>>>
>>> I can do all of the above except the most important bit; iterating…
>>
>> Oh yes, this is simple. Just create one workflow that deals with one ID.
>> This workflow you can run on multiple ids.
>
> That is the question; how do I run the same workflow on multiple ids?  
> The server may return hundreds or thousands of id values so running the workflow manually for each id is not really an option.

You have all ids in one file?
You could split this file into 100 of files per id and collect
everything in a data collection (this is new feature in Galaxy) and run
your workflow over this collection.

Should work, will work for sure in the near future, no promises ;)

>> Oh yes this is supported out of the box!
>> See here for a small documentation:
>> https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox#supported-filetypes
>>
>> Here is a example of how you can write your own datatypes:
>>
>> https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox/datatypes
>
> I feel like I must be missing the obvious.  Here is the relevant section of my datatypes_conf.xml (you can see the full file at https://github.com/oanc/Galaxy/blob/master/config/datatypes_conf.xml)
>
> <datatype extension="lif" type="galaxy.datatypes.text:Json" display_in_upload="True">
> <converter file="convert.json2gate_2.0.0.xml" target_datatype="gate"/>
> </datatype>
> <datatype extension="gate" type="galaxy.datatypes.xml:GenericXml" mimetype="application/xml" display_in_upload="true">
> <converter file="convert.gate2json_2.0.0.xml" target_datatype="lif"/>
> </datatype>
>  
> Is there anything I need to do beyond defining the datatypes for implicit conversions to take place?

I guess you need to place your converters under
https://github.com/oanc/Galaxy/tree/master/lib/galaxy/datatypes/converters/

And get rid of 'convert.' in your datatypes_conf.xml at least if you are
not using the TS.

Hope this helps you a little bit more,
Bjoern

> Thanks,
> Keith Suderman
>
>>
>>>
>>> 4. OAuth 2.0 / OpenID Connect:
>>>
>>> I need to be able to fetch documents from data providers that require an OAuth 2.0 access token. Currently, I use a separate service to go
>>> through the OAuth authentication/authorization process and then have the user copy/paste their access token into a text field in Galaxy.  
>>> Is there a way to perform the OAuth authentication dance required by the remote service inside Galaxy itself?  
>>
>> I don't think so, but maybe someone else has an idea here.
>>
>>> I’ve looked at the Trello site for Galaxy and see that both OAuth 2.0 and OpenID Connect are on the radar, hopefully this use case is being considered as well.
>>>
>>> I’m sure to have more questions after working through some visualization examples, but this should keep me busy for now.
>>
>> Hope you are busy now :)
>> Cheers and keep us up to date!
>> Bjoern
>>
>>> Sincerely,
>>> Keith Suderman
>>>
>>> REFERENCES
>>>
>>> 1. https://wiki.galaxyproject.org/Admin/Tools/AddingTools
>>>
>>> ------------------------------
>>> Research Associate
>>> Department of Computer Science
>>> Vassar College
>>> Poughkeepsie, NY
>>>
>>>
>>> ___________________________________________________________
>>> Please keep all replies on the list by using "reply all"
>>> in your mail client.  To manage your subscriptions to this
>>> and other Galaxy lists, please use the interface at:
>>>  https://lists.galaxyproject.org/
>>>
>>> To search Galaxy mailing lists use the unified search at:
>>>  http://galaxyproject.org/search/mailinglists/
>>>
>>
>
> ------------------------------
> Research Associate
> Department of Computer Science
> Vassar College
> Poughkeepsie, NY
>
>
>
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Galaxy for Natural Language Processing

Keith Suderman
On Apr 15, 2015, at 5:35 AM, Björn Grüning <[hidden email]> wrote:

> Nice! Can I convince you to put these tools into the TS if everything is
> working? Maybe with a German beer? ;)

Do you have a beer preference?  

We will definitely share our tools when they are stable, either on the public tool shed or we will set up our own (public) tool shed.  However, there are two issues I will have to address before releasing anything:

1. All of our tools use a custom command interpreter; anyone wanting to install our tools would have to install our interpreter, make it available on the $PATH and restart Galaxy.  That is why I was thinking we should set up our own tool shed; to install from our tool shed assumes you have our interpreter installed.

2. Tests: almost all of our tools call the same script that in turn calls some remote web service.  The web services go through their own unit/integration tests before they are deployed so all the Galaxy tests really do is use a lot of bandwidth to check if the server has an internet connection.


> True, but you can add a post-processing action, where you can change the
> data type.

Ok, I found it. I can add the post-processing step to change the data type and that allows me to connect the tools in the workflow editor, but the converter is not being invoked when I run the workflow.  The editor also allows me to select output formats that have no converters defined, so either I am still missing something or the workflow editor does not do what I want.  I can convert formats through the "Edit attributes" menu, so Galaxy knows about my converters and how to invoke them, just not in the workflow editor.


> You can filter sqlite according to some attached metadata, for example
> the version. You need to define your own metadata, and during file
> creation the metadata will be 'calculated' and set. You can then filter
> in your next tool according to this metadata.

Do you have more pointers to tools that use the attached metadata?  In particular tools that set metadata that is consumed by subsequent tools.


>>>> A typical workflow might look something like:
>>>>
>>>> a) Query Tool -> Server, find all documents that contain the word “cheese”
>>>> b) Server -> Here is the list of document IDs [ id1, id2, …, idn ]
>>>> c) WorkFlow -> for each id in the list do
>>>> c1) Download document
>>>> c2 ) Work work work work…
>>>> c3) Persist output
>>>>
>>>> I can do all of the above except the most important bit; iterating…
>>>
>>> Oh yes, this is simple. Just create one workflow that deals with one ID.
>>> This workflow you can run on multiple ids.
>>
>> That is the question; how do I run the same workflow on multiple ids?  
>> The server may return hundreds or thousands of id values so running the workflow manually for each id is not really an option.
>
> You have all ids in one file?
> You could split this file into 100 of files per id and collect
> everything in a data collection (this is new feature in Galaxy) and run
> your workflow over this collection.
>
> Should work, will work for sure in the near future, no promises ;)

The ids are all in one file, but it is easy enough to split them into 100s (1000s) of files if needed.  

Do you have pointers to any documentation on data collections?  My searches haven't turned up much but tantalizing references [1], and my experiments trying to return a data collection from a tool have been unsuccessful.

I have tried the instructions for tools that generate multiple output files [2],  but the Galaxy UI starts having problems when I add more than a few hundred history items; sorry I didn't make better notes, but the problem (timeouts) seem to be with JQuery updating the CSS styles in the history panel.  It also makes the UI a bit unwieldy with that many history items.

I have also been trying John Chiltons blend4j and managed to populate a data library, and this is almost what I want, but I would like a tool that can be included in a workflow as the data from the library may not necessarily be the first step.   I have no problem calling the Galaxy API from my tools, except that between the bioinformatics lingo and Python (I'm a Java programmer) it's slow going.

Cheers,
Keith

REFERENCES

1. https://wiki.galaxyproject.org/Learn/API#Collections
2. https://wiki.galaxyproject.org/Admin/Tools/Multiple%20Output%20Files#Number_of_Output_datasets_cannot_be_determined_until_tool_run

>
>>> Oh yes this is supported out of the box!
>>> See here for a small documentation:
>>> https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox#supported-filetypes
>>>
>>> Here is a example of how you can write your own datatypes:
>>>
>>> https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox/datatypes
>>
>> I feel like I must be missing the obvious.  Here is the relevant section of my datatypes_conf.xml (you can see the full file at https://github.com/oanc/Galaxy/blob/master/config/datatypes_conf.xml)
>>
>> <datatype extension="lif" type="galaxy.datatypes.text:Json" display_in_upload="True">
>> <converter file="convert.json2gate_2.0.0.xml" target_datatype="gate"/>
>> </datatype>
>> <datatype extension="gate" type="galaxy.datatypes.xml:GenericXml" mimetype="application/xml" display_in_upload="true">
>> <converter file="convert.gate2json_2.0.0.xml" target_datatype="lif"/>
>> </datatype>
>>
>> Is there anything I need to do beyond defining the datatypes for implicit conversions to take place?
>
> I guess you need to place your converters under
> https://github.com/oanc/Galaxy/tree/master/lib/galaxy/datatypes/converters/
>
> And get rid of 'convert.' in your datatypes_conf.xml at least if you are
> not using the TS.
>
> Hope this helps you a little bit more,
> Bjoern
>
>> Thanks,
>> Keith Suderman
>>
>>>
>>>>
>>>> 4. OAuth 2.0 / OpenID Connect:
>>>>
>>>> I need to be able to fetch documents from data providers that require an OAuth 2.0 access token. Currently, I use a separate service to go
>>>> through the OAuth authentication/authorization process and then have the user copy/paste their access token into a text field in Galaxy.  
>>>> Is there a way to perform the OAuth authentication dance required by the remote service inside Galaxy itself?  
>>>
>>> I don't think so, but maybe someone else has an idea here.
>>>
>>>> I’ve looked at the Trello site for Galaxy and see that both OAuth 2.0 and OpenID Connect are on the radar, hopefully this use case is being considered as well.
>>>>
>>>> I’m sure to have more questions after working through some visualization examples, but this should keep me busy for now.
>>>
>>> Hope you are busy now :)
>>> Cheers and keep us up to date!
>>> Bjoern
>>>
>>>> Sincerely,
>>>> Keith Suderman
>>>>
>>>> REFERENCES
>>>>
>>>> 1. https://wiki.galaxyproject.org/Admin/Tools/AddingTools
>>>>
>>>> ------------------------------
>>>> Research Associate
>>>> Department of Computer Science
>>>> Vassar College
>>>> Poughkeepsie, NY
>>>>
>>>>
>>>> ___________________________________________________________
>>>> Please keep all replies on the list by using "reply all"
>>>> in your mail client.  To manage your subscriptions to this
>>>> and other Galaxy lists, please use the interface at:
>>>> https://lists.galaxyproject.org/
>>>>
>>>> To search Galaxy mailing lists use the unified search at:
>>>> http://galaxyproject.org/search/mailinglists/
>>>>
>>>
>>
>> ------------------------------
>> Research Associate
>> Department of Computer Science
>> Vassar College
>> Poughkeepsie, NY
>>
>>
>>
>

------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Galaxy for Natural Language Processing

Björn Grüning-3
Hi Keith,

Am 21.04.2015 um 21:21 schrieb Keith Suderman:
> On Apr 15, 2015, at 5:35 AM, Björn Grüning <[hidden email]> wrote:
>
>> Nice! Can I convince you to put these tools into the TS if everything is
>> working? Maybe with a German beer? ;)
>
> Do you have a beer preference?  

Outing: I'm one of the rare Germans that do not drink alcohol ;)

> We will definitely share our tools when they are stable, either on the public tool shed or we will set up our own (public) tool shed.  

Nice!

> However, there are two issues I will have to address before releasing anything:
>
> 1. All of our tools use a custom command interpreter; anyone wanting to install our tools would have to install our interpreter, make it available on the $PATH and restart Galaxy.  That is why I was thinking we should set up our own tool shed; to install from our tool shed assumes you have our interpreter installed.

This can be done via the ToolShed. I assume your custom command
interpreter is not different than python or perl as interpreter? Fine,
we have python and perl in the TS and so we can have "your custom
interpreter" as well. In the end your tools will depend in this and
voila! No extra Tool Shed needed.

> 2. Tests: almost all of our tools call the same script that in turn calls some remote web service.  
> The web services go through their own unit/integration tests before they are deployed so all the Galaxy tests really do is use a
> lot of bandwidth to check if the server has an internet connection.

Is this a question? Do you want to improve this? Even if you depend on
such a webservice these test will still check the syntax and the outputs
of your tool.

>
>> True, but you can add a post-processing action, where you can change the
>> data type.
>
> Ok, I found it. I can add the post-processing step to change the data type and that allows me to connect the tools in the workflow editor,
> but the converter is not being invoked when I run the workflow.

This sounds like a bug. Or you are mixing conversion with "relabelling".
Galaxy has two concepts. One is really converting files ... creating a
new dataset. The other is changing metadata ... telling Galaxy you know
better what file type it is. Without creating a new datatype. You are
probably searching for conversion tools. These needs to be reachable via
the normal toolbox if I'm not misleading.

> The editor also allows me to select output formats that have no converters defined,
> so either I am still missing something or the workflow editor does not do what I want.  I can convert formats through the "Edit attributes" menu,
> so Galaxy knows about my converters and how to invoke them, just not in the workflow editor.

Ok, I think I understood. Not sure if this is the best way but put your
converters into the toolbox.

>> You can filter sqlite according to some attached metadata, for example
>> the version. You need to define your own metadata, and during file
>> creation the metadata will be 'calculated' and set. You can then filter
>> in your next tool according to this metadata.
>
> Do you have more pointers to tools that use the attached metadata?  In particular tools that set metadata that is consumed by subsequent tools.

The sqlite datatype should be a good example. Keep in mind, we can not
set metadata from inside a tool. Imho this is not possible, yet, but a
common requested feature. But you can "calculate" such metadata inside
your datatype definition and set it implicitly after your tool is finished.

>>>>> A typical workflow might look something like:
>>>>>
>>>>> a) Query Tool -> Server, find all documents that contain the word “cheese”
>>>>> b) Server -> Here is the list of document IDs [ id1, id2, …, idn ]
>>>>> c) WorkFlow -> for each id in the list do
>>>>> c1) Download document
>>>>> c2 ) Work work work work…
>>>>> c3) Persist output
>>>>>
>>>>> I can do all of the above except the most important bit; iterating…
>>>>
>>>> Oh yes, this is simple. Just create one workflow that deals with one ID.
>>>> This workflow you can run on multiple ids.
>>>
>>> That is the question; how do I run the same workflow on multiple ids?  
>>> The server may return hundreds or thousands of id values so running the workflow manually for each id is not really an option.
>>
>> You have all ids in one file?
>> You could split this file into 100 of files per id and collect
>> everything in a data collection (this is new feature in Galaxy) and run
>> your workflow over this collection.
>>
>> Should work, will work for sure in the near future, no promises ;)
>
> The ids are all in one file, but it is easy enough to split them into 100s (1000s) of files if needed.  
>
> Do you have pointers to any documentation on data collections?  My searches haven't turned up much but tantalizing references [1],
> and my experiments trying to return a data collection from a tool have been unsuccessful.

https://wiki.galaxyproject.org/Histories?highlight=%28collection%29#Dataset_Collections

https://wiki.galaxyproject.org/Admin/Tools/ToolConfigSyntax ->
data_collection

And have a look at:
https://github.com/galaxyproject/galaxy/tree/dev/test/functional/tools

> I have tried the instructions for tools that generate multiple output files [2],  but the Galaxy UI starts having problems when I add more than a few hundred history items;
> sorry I didn't make better notes, but the problem (timeouts) seem to be with JQuery updating the CSS styles in the history panel.  
> It also makes the UI a bit unwieldy with that many history items.

Are you running a recent Galaxy version?
Try to run the latest developer version, data collections are really new
and I hope it will shine even more with the next release.

> I have also been trying John Chiltons blend4j and managed to populate a data library, and this is almost what I want,
> but I would like a tool that can be included in a workflow as the data from the library may not necessarily be the first step.  
> I have no problem calling the Galaxy API from my tools, except that between the bioinformatics lingo and Python (I'm a Java programmer) it's slow going.

If possible at all you should avoid this, but as last resort probably an
option.

Ciao,
Bjoern

> Cheers,
> Keith
>
> REFERENCES
>
> 1. https://wiki.galaxyproject.org/Learn/API#Collections
> 2. https://wiki.galaxyproject.org/Admin/Tools/Multiple%20Output%20Files#Number_of_Output_datasets_cannot_be_determined_until_tool_run
>
>>
>>>> Oh yes this is supported out of the box!
>>>> See here for a small documentation:
>>>> https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox#supported-filetypes
>>>>
>>>> Here is a example of how you can write your own datatypes:
>>>>
>>>> https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox/datatypes
>>>
>>> I feel like I must be missing the obvious.  Here is the relevant section of my datatypes_conf.xml (you can see the full file at https://github.com/oanc/Galaxy/blob/master/config/datatypes_conf.xml)
>>>
>>> <datatype extension="lif" type="galaxy.datatypes.text:Json" display_in_upload="True">
>>> <converter file="convert.json2gate_2.0.0.xml" target_datatype="gate"/>
>>> </datatype>
>>> <datatype extension="gate" type="galaxy.datatypes.xml:GenericXml" mimetype="application/xml" display_in_upload="true">
>>> <converter file="convert.gate2json_2.0.0.xml" target_datatype="lif"/>
>>> </datatype>
>>>
>>> Is there anything I need to do beyond defining the datatypes for implicit conversions to take place?
>>
>> I guess you need to place your converters under
>> https://github.com/oanc/Galaxy/tree/master/lib/galaxy/datatypes/converters/
>>
>> And get rid of 'convert.' in your datatypes_conf.xml at least if you are
>> not using the TS.
>>
>> Hope this helps you a little bit more,
>> Bjoern
>>
>>> Thanks,
>>> Keith Suderman
>>>
>>>>
>>>>>
>>>>> 4. OAuth 2.0 / OpenID Connect:
>>>>>
>>>>> I need to be able to fetch documents from data providers that require an OAuth 2.0 access token. Currently, I use a separate service to go
>>>>> through the OAuth authentication/authorization process and then have the user copy/paste their access token into a text field in Galaxy.  
>>>>> Is there a way to perform the OAuth authentication dance required by the remote service inside Galaxy itself?  
>>>>
>>>> I don't think so, but maybe someone else has an idea here.
>>>>
>>>>> I’ve looked at the Trello site for Galaxy and see that both OAuth 2.0 and OpenID Connect are on the radar, hopefully this use case is being considered as well.
>>>>>
>>>>> I’m sure to have more questions after working through some visualization examples, but this should keep me busy for now.
>>>>
>>>> Hope you are busy now :)
>>>> Cheers and keep us up to date!
>>>> Bjoern
>>>>
>>>>> Sincerely,
>>>>> Keith Suderman
>>>>>
>>>>> REFERENCES
>>>>>
>>>>> 1. https://wiki.galaxyproject.org/Admin/Tools/AddingTools
>>>>>
>>>>> ------------------------------
>>>>> Research Associate
>>>>> Department of Computer Science
>>>>> Vassar College
>>>>> Poughkeepsie, NY
>>>>>
>>>>>
>>>>> ___________________________________________________________
>>>>> Please keep all replies on the list by using "reply all"
>>>>> in your mail client.  To manage your subscriptions to this
>>>>> and other Galaxy lists, please use the interface at:
>>>>> https://lists.galaxyproject.org/
>>>>>
>>>>> To search Galaxy mailing lists use the unified search at:
>>>>> http://galaxyproject.org/search/mailinglists/
>>>>>
>>>>
>>>
>>> ------------------------------
>>> Research Associate
>>> Department of Computer Science
>>> Vassar College
>>> Poughkeepsie, NY
>>>
>>>
>>>
>>
>
> ------------------------------
> Research Associate
> Department of Computer Science
> Vassar College
> Poughkeepsie, NY
>
>
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Galaxy for Natural Language Processing

Keith Suderman
Hi Björn,

On Apr 22, 2015, at 8:00 AM, Björn Grüning <[hidden email]> wrote:

Do you have a beer preference?  

Outing: I'm one of the rare Germans that do not drink alcohol ;)

That must be awkward ;)


This can be done via the ToolShed. I assume your custom command
interpreter is not different than python or perl as interpreter? 

One difference is that my interpreter is a Java program. I likely should have mentioned that little detail... anyone wanting to install our tools would need my interpreter AND Java 1.7+ on their server.  Hopefully that is not an insurmountable problem. 

However, does the bioinformatics community really want a bunch of NLP tools in their tool shed?


The editor also allows me to select output formats that have no converters defined,
so either I am still missing something or the workflow editor does not do what I want.  I can convert formats through the "Edit attributes" menu,
so Galaxy knows about my converters and how to invoke them, just not in the workflow editor.

Ok, I think I understood. Not sure if this is the best way but put your
converters into the toolbox.

By the "toolbox" do you mean adding my converters to the tool_conf.xml file so they are available on the Tools menu?  I have done that and I can add the converters to a workflow manually. I was just hoping the workflow editor could detect when it could perform the conversion and insert the converters as needed; it seems this is not possible.


Do you have more pointers to tools that use the attached metadata?  In particular tools that set metadata that is consumed by subsequent tools.

The sqlite datatype should be a good example. Keep in mind, we can not
set metadata from inside a tool.
Imho this is not possible, yet, but a
common requested feature. But you can "calculate" such metadata inside
your datatype definition and set it implicitly after your tool is finished.

Setting the metadata in the tool wrapper is fine, and after grepping through some of the other wrappers I think I need something like:

  <!-- Output from a tokenizer -->
  <outputs>
    <data name="output" format="xml" label="Output">
        <actions>
            <action type="metadata" name="tokens">True</action>
        </actions>
    </data>
  </outputs>

  <!-- Input to a part of speech tagger -->
  <inputs>
    <param name="input" type="data" format="xml">
        <validator type="expression" message="Please run a tokenizer first.">metadata.tokens is not None</validator>
    </param>
  </inputs>

That is, the input validator simply checks if some value has been set in the metadata, and the output sets a value in the metadata.  The above does not work, but at least Galaxy stopped complaining about the tool XML with this.  However, the documentation for <option/> <filter/> and <action/> does not match up with what existing wrappers (in the dev branch) are doing so I am having problems with the exact syntax.


Do you have pointers to any documentation on data collections?  My searches haven't turned up much but tantalizing references [1],
and my experiments trying to return a data collection from a tool have been unsuccessful.

https://wiki.galaxyproject.org/Histories?highlight=%28collection%29#Dataset_Collections

https://wiki.galaxyproject.org/Admin/Tools/ToolConfigSyntax ->
data_collection

And have a look at:
https://github.com/galaxyproject/galaxy/tree/dev/test/functional/tools

Success!  I was running the code from master, so I suspect that was part of my problem.  

However, my browser is still complaining about long running scripts.

A script on this page may be busy, or it may have stopped responding. You can stop the script now, open the script in the debugger, or let the script continue.


I accidentally left visible="true" when creating the dataset collection and ended up with +1500 items in my history; the above message kept popping up while the workflow was running (at least until I selected "Don't show this again").  Deleting +1500 datasets from the history is also very slow, but that is a different issue. On the bright side, at least I had +1500 items in the history to delete.


I have also been trying John Chiltons blend4j and managed to populate a data library, and this is almost what I want,
but I would like a tool that can be included in a workflow as the data from the library may not necessarily be the first step.   
I have no problem calling the Galaxy API from my tools, except that between the bioinformatics lingo and Python (I'm a Java programmer) it's slow going.

If possible at all you should avoid this, but as last resort probably an
option.

Out of curiosity, what exactly should I avoid; making calls to the Galaxy REST/API from inside a tool, using blend4j, or populating a data library from inside a tool?  I can see myself doing all three in the near future.

Cheers,
Keith


Ciao,
Bjoern

Cheers,
Keith

REFERENCES

1. https://wiki.galaxyproject.org/Learn/API#Collections
2. https://wiki.galaxyproject.org/Admin/Tools/Multiple%20Output%20Files#Number_of_Output_datasets_cannot_be_determined_until_tool_run


Oh yes this is supported out of the box!
See here for a small documentation:
https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox#supported-filetypes

Here is a example of how you can write your own datatypes:

https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox/datatypes

I feel like I must be missing the obvious.  Here is the relevant section of my datatypes_conf.xml (you can see the full file at https://github.com/oanc/Galaxy/blob/master/config/datatypes_conf.xml)

<datatype extension="lif" type="galaxy.datatypes.text:Json" display_in_upload="True">
<converter file="convert.json2gate_2.0.0.xml" target_datatype="gate"/>
</datatype>
<datatype extension="gate" type="galaxy.datatypes.xml:GenericXml" mimetype="application/xml" display_in_upload="true">
<converter file="convert.gate2json_2.0.0.xml" target_datatype="lif"/>
</datatype>

Is there anything I need to do beyond defining the datatypes for implicit conversions to take place?

I guess you need to place your converters under
https://github.com/oanc/Galaxy/tree/master/lib/galaxy/datatypes/converters/

And get rid of 'convert.' in your datatypes_conf.xml at least if you are
not using the TS.

Hope this helps you a little bit more,
Bjoern

Thanks,
Keith Suderman



4. OAuth 2.0 / OpenID Connect:

I need to be able to fetch documents from data providers that require an OAuth 2.0 access token. Currently, I use a separate service to go
through the OAuth authentication/authorization process and then have the user copy/paste their access token into a text field in Galaxy.   
Is there a way to perform the OAuth authentication dance required by the remote service inside Galaxy itself?   

I don't think so, but maybe someone else has an idea here.

I’ve looked at the Trello site for Galaxy and see that both OAuth 2.0 and OpenID Connect are on the radar, hopefully this use case is being considered as well.

I’m sure to have more questions after working through some visualization examples, but this should keep me busy for now.

Hope you are busy now :)
Cheers and keep us up to date!
Bjoern

Sincerely,
Keith Suderman

REFERENCES

1. https://wiki.galaxyproject.org/Admin/Tools/AddingTools

------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
http://galaxyproject.org/search/mailinglists/



------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY





------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY




------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY



___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Galaxy for Natural Language Processing

Nicola Soranzo-2
Il 24.04.2015 03:18 Keith Suderman ha scritto:

>> This can be done via the ToolShed. I assume your custom command
>> interpreter is not different than python or perl as interpreter?
>
> One difference is that my interpreter is a Java program. I likely
> should
> have mentioned that little detail... anyone wanting to install our
> tools
> would need my interpreter AND Java 1.7+ on their server. Hopefully
> that
> is not an insurmountable problem.
> However, does the bioinformatics community really want a bunch of NLP
> tools in their tool shed?

Sorry to jump in here, I'd only like to confirm you that we would
appreciate to have your NLP tools in the ToolShed, that would help us
prove the fact that Galaxy is a useful platform also in other scientific
fields.

Cheers,
Nicola


Connetti gratis il mondo con la nuova indoona:  hai la chat, le
chiamate, le video chiamate e persino le chiamate di gruppo.
E chiami gratis anche i numeri fissi e mobili nel mondo!
Scarica subito l’app Vai su https://www.indoona.com/


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Galaxy for Natural Language Processing

Björn Grüning-3
In reply to this post by Keith Suderman

On 24.04.2015 03:18, Keith Suderman wrote:
Hi Björn,

On Apr 22, 2015, at 8:00 AM, Björn Grüning <[hidden email]> wrote:

Do you have a beer preference?  

Outing: I'm one of the rare Germans that do not drink alcohol ;)

That must be awkward ;)

:)


This can be done via the ToolShed. I assume your custom command
interpreter is not different than python or perl as interpreter? 

One difference is that my interpreter is a Java program. I likely should have mentioned that little detail... anyone wanting to install our tools would need my interpreter AND Java 1.7+ on their server.  Hopefully that is not an insurmountable problem.

In Galaxy we advertise people to have a JRE around if they use the TS. Currently, the TS is not able it install Java. I was not able to compile java by my own :(
https://wiki.galaxyproject.org/Admin/Config/ToolDependenciesList

So this is ok!

However, does the bioinformatics community really want a bunch of NLP tools in their tool shed?

Yes, Yes, Yes!

The editor also allows me to select output formats that have no converters defined,
so either I am still missing something or the workflow editor does not do what I want.  I can convert formats through the "Edit attributes" menu,
so Galaxy knows about my converters and how to invoke them, just not in the workflow editor.

Ok, I think I understood. Not sure if this is the best way but put your
converters into the toolbox.

By the "toolbox" do you mean adding my converters to the tool_conf.xml file so they are available on the Tools menu?  I have done that and I can add the converters to a workflow manually. I was just hoping the workflow editor could detect when it could perform the conversion and insert the converters as needed; it seems this is not possible.

Maybe someone else can jump in here, I do not see why this shouldn't be possible? Maybe this is just an UI issue?!

Do you have more pointers to tools that use the attached metadata?  In particular tools that set metadata that is consumed by subsequent tools.

The sqlite datatype should be a good example. Keep in mind, we can not
set metadata from inside a tool.
Imho this is not possible, yet, but a
common requested feature. But you can "calculate" such metadata inside
your datatype definition and set it implicitly after your tool is finished.

Setting the metadata in the tool wrapper is fine, and after grepping through some of the other wrappers I think I need something like:

  <!-- Output from a tokenizer -->
  <outputs>
    <data name="output" format="xml" label="Output">
        <actions>
            <action type="metadata" name="tokens">True</action>
        </actions>
    </data>
  </outputs>

  <!-- Input to a part of speech tagger -->
  <inputs>
    <param name="input" type="data" format="xml">
        <validator type="expression" message="Please run a tokenizer first.">metadata.tokens is not None</validator>
    </param>
  </inputs>

That is, the input validator simply checks if some value has been set in the metadata, and the output sets a value in the metadata.  The above does not work, but at least Galaxy stopped complaining about the tool XML with this.  However, the documentation for <option/> <filter/> and <action/> does not match up with what existing wrappers (in the dev branch) are doing so I am having problems with the exact syntax.

Can you try:         <action type="metadata" name="tokens" default="True"/>

You can also filter your inputs in the speech tagger:

 <options options_filter_attribute="metadata.tokens" > <filter type="add_value" value="True" /> </options>



Do you have pointers to any documentation on data collections?  My searches haven't turned up much but tantalizing references [1],
and my experiments trying to return a data collection from a tool have been unsuccessful.

https://wiki.galaxyproject.org/Histories?highlight=%28collection%29#Dataset_Collections

https://wiki.galaxyproject.org/Admin/Tools/ToolConfigSyntax ->
data_collection

And have a look at:
https://github.com/galaxyproject/galaxy/tree/dev/test/functional/tools

Success!  I was running the code from master, so I suspect that was part of my problem. 

Nice!

However, my browser is still complaining about long running scripts.

Can you put this in a different thread?

A script on this page may be busy, or it may have stopped responding. You can stop the script now, open the script in the debugger, or let the script continue.


I accidentally left visible="true" when creating the dataset collection and ended up with +1500 items in my history; the above message kept popping up while the workflow was running (at least until I selected "Don't show this again").  Deleting +1500 datasets from the history is also very slow, but that is a different issue. On the bright side, at least I had +1500 items in the history to delete.

+1500 different elements is a lot for a history, for usability we should try to use collections here. No one wants to deal with such an mount of history objects :)


I have also been trying John Chiltons blend4j and managed to populate a data library, and this is almost what I want,
but I would like a tool that can be included in a workflow as the data from the library may not necessarily be the first step.   
I have no problem calling the Galaxy API from my tools, except that between the bioinformatics lingo and Python (I'm a Java programmer) it's slow going.

If possible at all you should avoid this, but as last resort probably an
option.

Out of curiosity, what exactly should I avoid; making calls to the Galaxy REST/API from inside a tool, using blend4j, or populating a data library from inside a tool?  I can see myself doing all three in the near future.

* making calls to the Galaxy REST/API from inside a tool

Think big! Your tools will run in large cluster environments, one job schedulers and Cloud-Infrstructures. You don't know if you job is allowed to connect to your Galaxy instance - security wise. Also you need to authenticate, more issues ....

Ciao,
Bjoern

Cheers,
Keith


Ciao,
Bjoern

Cheers,
Keith

REFERENCES

1. https://wiki.galaxyproject.org/Learn/API#Collections
2. https://wiki.galaxyproject.org/Admin/Tools/Multiple%20Output%20Files#Number_of_Output_datasets_cannot_be_determined_until_tool_run


Oh yes this is supported out of the box!
See here for a small documentation:
https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox#supported-filetypes

Here is a example of how you can write your own datatypes:

https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox/datatypes

I feel like I must be missing the obvious.  Here is the relevant section of my datatypes_conf.xml (you can see the full file at https://github.com/oanc/Galaxy/blob/master/config/datatypes_conf.xml)

<datatype extension="lif" type="galaxy.datatypes.text:Json" display_in_upload="True">
<converter file="convert.json2gate_2.0.0.xml" target_datatype="gate"/>
</datatype>
<datatype extension="gate" type="galaxy.datatypes.xml:GenericXml" mimetype="application/xml" display_in_upload="true">
<converter file="convert.gate2json_2.0.0.xml" target_datatype="lif"/>
</datatype>

Is there anything I need to do beyond defining the datatypes for implicit conversions to take place?

I guess you need to place your converters under
https://github.com/oanc/Galaxy/tree/master/lib/galaxy/datatypes/converters/

And get rid of 'convert.' in your datatypes_conf.xml at least if you are
not using the TS.

Hope this helps you a little bit more,
Bjoern

Thanks,
Keith Suderman



4. OAuth 2.0 / OpenID Connect:

I need to be able to fetch documents from data providers that require an OAuth 2.0 access token. Currently, I use a separate service to go
through the OAuth authentication/authorization process and then have the user copy/paste their access token into a text field in Galaxy.   
Is there a way to perform the OAuth authentication dance required by the remote service inside Galaxy itself?   

I don't think so, but maybe someone else has an idea here.

I’ve looked at the Trello site for Galaxy and see that both OAuth 2.0 and OpenID Connect are on the radar, hopefully this use case is being considered as well.

I’m sure to have more questions after working through some visualization examples, but this should keep me busy for now.

Hope you are busy now :)
Cheers and keep us up to date!
Bjoern

Sincerely,
Keith Suderman

REFERENCES

1. https://wiki.galaxyproject.org/Admin/Tools/AddingTools

------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
http://galaxyproject.org/search/mailinglists/



------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY





------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY




------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY




___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
| Threaded
Open this post in threaded view
|

Re: Galaxy for Natural Language Processing

Keith Suderman
Hi Björn and Nicola,

You can expect NLP tools on the ToolShed sometime in the (near?) future.

Our project [1] shares funding sources with Galaxy (NSF SI2) and the NSF loves this kind of cross-domain collaboration.  I also know of at least one other NLP project [2] that uses Galaxy that would likely share their tools as well.

I am going to break the rest of my reply into separate threads so hopefully they will get a little more visibility.

Cheers,
Keith

REFERENCES


On Apr 24, 2015, at 9:30 AM, Bjoern Gruening <[hidden email]> wrote:

However, does the bioinformatics community really want a bunch of NLP tools in their tool shed?

Yes, Yes, Yes!
By the "toolbox" do you mean adding my converters to the tool_conf.xml file so they are available on the Tools menu?  I have done that and I can add the converters to a workflow manually. I was just hoping the workflow editor could detect when it could perform the conversion and insert the converters as needed; it seems this is not possible.

Maybe someone else can jump in here, I do not see why this shouldn't be possible? Maybe this is just an UI issue?!

I am going to break this into its own thread so hopefully it gets more visibility.


Setting the metadata in the tool wrapper is fine, and after grepping through some of the other wrappers I think I need something like:

  <!-- Output from a tokenizer -->
  <outputs>
    <data name="output" format="xml" label="Output">
        <actions>
            <action type="metadata" name="tokens">True</action>
        </actions>
    </data>
  </outputs>

  <!-- Input to a part of speech tagger -->
  <inputs>
    <param name="input" type="data" format="xml">
        <validator type="expression" message="Please run a tokenizer first.">metadata.tokens is not None</validator>
    </param>
  </inputs>

That is, the input validator simply checks if some value has been set in the metadata, and the output sets a value in the metadata.  The above does not work, but at least Galaxy stopped complaining about the tool XML with this.  However, the documentation for <option/> <filter/> and <action/> does not match up with what existing wrappers (in the dev branch) are doing so I am having problems with the exact syntax.

Can you try:         <action type="metadata" name="tokens" default="True"/>

You can also filter your inputs in the speech tagger:

 <options options_filter_attribute="metadata.tokens" > <filter type="add_value" value="True" /> </options>



Do you have pointers to any documentation on data collections?  My searches haven't turned up much but tantalizing references [1],
and my experiments trying to return a data collection from a tool have been unsuccessful.

https://wiki.galaxyproject.org/Histories?highlight=%28collection%29#Dataset_Collections

https://wiki.galaxyproject.org/Admin/Tools/ToolConfigSyntax ->
data_collection

And have a look at:
https://github.com/galaxyproject/galaxy/tree/dev/test/functional/tools

Success!  I was running the code from master, so I suspect that was part of my problem. 

Nice!

However, my browser is still complaining about long running scripts.

Can you put this in a different thread?

A script on this page may be busy, or it may have stopped responding. You can stop the script now, open the script in the debugger, or let the script continue.


I accidentally left visible="true" when creating the dataset collection and ended up with +1500 items in my history; the above message kept popping up while the workflow was running (at least until I selected "Don't show this again").  Deleting +1500 datasets from the history is also very slow, but that is a different issue. On the bright side, at least I had +1500 items in the history to delete.

+1500 different elements is a lot for a history, for usability we should try to use collections here. No one wants to deal with such an mount of history objects :)


I have also been trying John Chiltons blend4j and managed to populate a data library, and this is almost what I want,
but I would like a tool that can be included in a workflow as the data from the library may not necessarily be the first step.   
I have no problem calling the Galaxy API from my tools, except that between the bioinformatics lingo and Python (I'm a Java programmer) it's slow going.

If possible at all you should avoid this, but as last resort probably an
option.

Out of curiosity, what exactly should I avoid; making calls to the Galaxy REST/API from inside a tool, using blend4j, or populating a data library from inside a tool?  I can see myself doing all three in the near future.

* making calls to the Galaxy REST/API from inside a tool

Think big! Your tools will run in large cluster environments, one job schedulers and Cloud-Infrstructures. You don't know if you job is allowed to connect to your Galaxy instance - security wise. Also you need to authenticate, more issues ....

Ciao,
Bjoern

Cheers,
Keith


Ciao,
Bjoern

Cheers,
Keith

REFERENCES

1. https://wiki.galaxyproject.org/Learn/API#Collections
2. https://wiki.galaxyproject.org/Admin/Tools/Multiple%20Output%20Files#Number_of_Output_datasets_cannot_be_determined_until_tool_run


Oh yes this is supported out of the box!
See here for a small documentation:
https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox#supported-filetypes

Here is a example of how you can write your own datatypes:

https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox/datatypes

I feel like I must be missing the obvious.  Here is the relevant section of my datatypes_conf.xml (you can see the full file at https://github.com/oanc/Galaxy/blob/master/config/datatypes_conf.xml)

<datatype extension="lif" type="galaxy.datatypes.text:Json" display_in_upload="True">
<converter file="convert.json2gate_2.0.0.xml" target_datatype="gate"/>
</datatype>
<datatype extension="gate" type="galaxy.datatypes.xml:GenericXml" mimetype="application/xml" display_in_upload="true">
<converter file="convert.gate2json_2.0.0.xml" target_datatype="lif"/>
</datatype>

Is there anything I need to do beyond defining the datatypes for implicit conversions to take place?

I guess you need to place your converters under
https://github.com/oanc/Galaxy/tree/master/lib/galaxy/datatypes/converters/

And get rid of 'convert.' in your datatypes_conf.xml at least if you are
not using the TS.

Hope this helps you a little bit more,
Bjoern

Thanks,
Keith Suderman



4. OAuth 2.0 / OpenID Connect:

I need to be able to fetch documents from data providers that require an OAuth 2.0 access token. Currently, I use a separate service to go
through the OAuth authentication/authorization process and then have the user copy/paste their access token into a text field in Galaxy.   
Is there a way to perform the OAuth authentication dance required by the remote service inside Galaxy itself?   

I don't think so, but maybe someone else has an idea here.

I’ve looked at the Trello site for Galaxy and see that both OAuth 2.0 and OpenID Connect are on the radar, hopefully this use case is being considered as well.

I’m sure to have more questions after working through some visualization examples, but this should keep me busy for now.

Hope you are busy now :)
Cheers and keep us up to date!
Bjoern

Sincerely,
Keith Suderman

REFERENCES

1. https://wiki.galaxyproject.org/Admin/Tools/AddingTools

------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
http://galaxyproject.org/search/mailinglists/



------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY





------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY




------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY




------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY



___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/