Skip to content

madgik/mip-algorithms

Repository files navigation

Human Brain Project logo

Medical Informatics Platform

Data Mining Algorithms - Algorithm developer's manual

This is the repository containing all data mining algorithms for the Medical Informatics Platform of Human Brain Project, that are executed on Exareme.

Types of algorithm workflows currently supported

  • local execution of an SQL script on master node only.
  • local-global execution of a local SQL script on worker nodes/containers, which is followed by a global SQL script on global node.
  • multiple local-global execution of a fixed (as in fixed number) sequences of local-global workflows.
    Each local_global is executed according to the order appeared in the algorithm's directory structure.
  • iterative execution of an iterative algorithm, which is expressed in four phases:

    1. initialization (actually a multiple_local_global)
    2. step (actually a multiple_local_global)
    3. termination condition (actually a local) and
    4. finalization (actually a multiple_local_global)

    Firstly, init phase is executed, which is followed by a pair of step and termination_condition phases. In each termination condition, the iterations module of execution engine reads the value of termination_condition related table and "decides" whether to continue the iterative execution. If so, a step phase is resubmitted, otherwise the finalize phase of the algorithm is submitted.

Expected format for each algorithm workflow

Algorithm properties file

For all algorithms, a properties file is required (namely properties.json). This JSON file contains the algorithm's:

  1. name
  2. description (appears in web portal)
  3. type, specifically one of:
    • local
    • local_global
    • multiple_local_global
    • iterative
  4. parameters
    These parameters are required for the algorithm to run and are provided by the user. The algorithms' SQL files require these variables as input. The parameter has the following properties:
    • name (String)
    • desc (String) Will be shown in the properties of the algorithm.
    • type Defines the type of the parameter. It can take the following values:
      1. column (Used for querying the columns of the database.)
      2. formula (Same as the column type but is is parsed as a formula of R. Allowed characters are '+ - * : 0.' )
      3. filter (Used to filter the results of the database.)
      4. dataset (If the property is of type dataset then it will be used to choose on which dataset to run the algorithm on.)
      5. other (For any other reason use this type.)
    • columnValuesSQLType (String) If type is column or formula then this is required. Specifies the possible types that the column can have. Allowed types 'text, integer, real'. They could be more than one in combination with a comma. Empty string means that there is no constraint.
    • columnValuesIsCategorical (String) If type is column or formula then this is required. Specifies the categorical type that the column can have. Allowed types 'true','false'. Empty string means that there is no constraint.
    • columnValuesNumOfEnumerations (String) If type is column or formula then this is required. Specifies the number of enumerations that the column can have. Example of possible values '1','2'. Empty string means that there is no constraint.
    • value (String) It is used as an example value.
    • valueNotBlank (Boolean) Defines if the value can be blank.
    • valueMultiple (Boolean) Defines if the parameter can have multiple values.
    • valueType Defines the type of the value. It can take the following values:
      1. string
      2. integer
      3. real
      4. json

Example: See here for the properties file of LINEAR_REGRESSION algorithm.

Expected directory structure for each algorithm workflow

For each algorithm workflow refer to the corresponding link for a hands-on example:

  1. local => LIST_VARIABLES algorithm
  2. local_global => VARIABLE_PROFILE algorithm
  3. multiple_local_global => LINEAR_REGRESSION algorithm
  4. iterative => SAMPLE_ITERATIVE algorithm

General directions for writing algorithms

Input of algorithm workflows

The input of algorithm workflows can be retrieved in the 1st local.template.sql by using the input_local_tbl variable. (It must also be defined in requirevars.)

Sharing context among different SQL template scripts

defaultDB
To share context (and thus data) among SQL template files, a database named defaultDB is provided.
For example, it can be used to create and insert values in a table at a local.template.sql, which can then be read from the global.template.sql.
To be able to use defaultDB in a template.sql, the script file is required to begin with:

- `requirevars 'defaultDB'` (more variables can be _required_ using this command, see [here](WP_LINEAR_REGRESSION/1/global.template.sql))
- `attach Database '%{defaultDB}' as defaultDB`  

Output of previous phase
An additional way of sharing context when in a local_global or multiple_local_global algorithm workflow is:

  • _[only for multiple_local_global]_ for local.template.sql files, the output from the previous global.template.sql execution can be read by using the input_local_tbl variable.
  • for global.template.sql files, the output from the previous local.template.sql file can be read by using the input_global_tbl variable.

N.B.: It should be noted here, that defaultDB is shared over the network from local nodes to global and vice versa.

Every template file must have output

It is required by the runtime engine that every *.template.sql file must have some output.
If this isn't applicable in a script file, simply write select "ok"; at the end.

Algorithm's output format

The final results (i.e. the algorithm's output) must be formatted using jdict UDF of madIS.
This converts the results to a JSON format.

Specifics pertaining to iterative algorithms

Regarding context sharing among iteration execution phases

For sharing context among iteration execution phases, the previous_phase_output_tbl variable can be used. This follows the same convention as the one used for sharing context between local and global scripts. In other words, output of the previous iterative execution phase is "forwarded" as input to the next one (e.g. output of step-1 is forwarded as input to step-2 and output of step-N is forwarded as input to finalize).

Regarding the properties file

For all iterative algorithms (in the parameters JSON array of its properties file), the following properties must be defined:

- `iterations_max_number`   
   The iterative algorithm will run at most `iterations_max_number` times.  
- `iterations_condition_query_provided`    
   Defines if a termination query is provided (under the `termination_condition` directory, in the corresponding file). 
   Otherwise `iterations_max_number` will be solely used as a termination condition criterion.  
   **Note 1**: In the case which a termination condition query has been provided, the iterations module in Exareme takes into
    account its output along with the `iterations_number < iterations_max_number` condition.  
   **Note 2**: In the case which a termination condition query has **not** been provided, the `termination_condition.template.sql`
   must exist, and solely contain a `select "ok";` query.

Regarding iterations logic requirements

The algorithm developer need not to worry about iterations control logic, such as setting up an iterations number counter, or writing a query for ensuring that iterations_number < iterations_max_number. This is all handled by the iterations module of Exareme.
The only requirement imposed by the iterations module is the one mentioned below.

Regarding iterations_condition_query

If an iterative algorithm requires a termination condition that is not solely based on
iterations_number < iterations_max_number criterion, the algorithm developer needs to write a query that abides by the following rules:

  • updating iterationsDB.iterations_condition_check_result_tbl table, and specifically
  • setting iterations_condition_check_result column's value with the output of the termination condition query.

The template which must be followed is this:

update iterationsDB.iterations_condition_check_result_tbl set iterations_condition_check_result = (
  select termination_condition_query... 
);

N.B.: iterationsDB does not need to be defined in the requirevars section. Again, this is handled by the runtime engine's iterations module.
An example of a termination condition query is presented below:

update iterationsDB.iterations_condition_check_result_tbl set iterations_condition_check_result = (
  select sum_tbl.sum_val < 5
    from defaultDB.sum_tbl
);

In this example, the iterative algorithm calculates a sum (saved at defaultDB.sub_tbl table) and the termination condition reads:

if the sum is lower than 5 AND iterations_number < iterations_max_number, then continue iterations.