Skip to content

Configuration

Kai Keller edited this page Sep 17, 2018 · 3 revisions

[Basic]
head
node_size
ckpt_dir
glbl_dir
meta_dir
ckpt_L1
ckpt_L2
ckpt_L3
ckpt_L4
dcp_L4
inline_L2
inline_L3
inline_L4
keep_last_ckpt
keep_l4_ckpt
group_size
max_synch_intv
ckpt_io
enable_staging
enable_dcp
dcp_mode
dcp_block_size
verbosity
[Restart]
failure
exec_id
[Advanced]
block_size
transfer_size
general_tag
ckpt_tag
stage_tag
final_tag
local_test
lustre_striping_unit
lustre_striping_factor
lustre_striping_offset


[Basic]

head

⬆️ Top

The checkpointing safety levels L2, L3 and L4 produce additional overhead due to the necessary postprocessing work on the checkpoints. FTI offers the possibility to create an MPI process, called HEAD, in which this postprocessing will be accomplished. This allows it for the application processes to continue the execution immediately after the checkpointing.

Value Meaning
0 The checkpoint postprocessing work is covered by the application processes
1 The HEAD process accomplishes the checkpoint postprocessing work (notice: In this case, the number of application processes will be (n-1)/node)

(default = 0)

node_size

⬆️ Top

Lets FTI know, how many processes will run on each node (ppn). In most cases this will be the amount of processing units within the node (e.g. 2 CPU’s/node and 8 cores/CPU ! 16 processes/node).

Value Meaning
ppn (int > 0)
Number of processing units within each node (notice: The total number of processes must be a multiple of group_size*node_size)

(default = 2)

ckpt_dir

⬆️ Top

This entry defines the path to the local hard drive on the nodes.

Value Meaning
string Path to the local hard drive on the nodes

(default = /scratch/username/)

glbl_dir

⬆️ Top

This entry defines the path to the checkpoint folder on the PFS (L4 checkpoints).

Value Meaning
string Path to the checkpoint directory on the PFS

(default = /work/project/)

meta_dir

⬆️ Top

This entry defines the path to the meta files directory. The directory has to be accessible from each node. It keeps files with information about the topology of the execution.

Value Meaning
string Path to the meta files directory

(default = /home/user/.fti)

ckpt_L1

⬆️ Top

Here, the user sets the checkpoint frequency of L1 checkpoints when using FTI_Snapshot().

Value Meaning
L1 intv. (int >= 0)
L1 checkpointing interval in minutes
0 Disable L1 checkpointing

(default = 3)

ckpt_L2

⬆️ Top

Here, the user sets the checkpoint frequency of L2 checkpoints when using FTI_Snapshot().

Value Meaning
L2 intv. (int >= 0)
L2 checkpointing interval in minutes
0 Disable L2 checkpointing

(default = 5)

ckpt_L3

⬆️ Top

Here, the user sets the checkpoint frequency of L3 checkpoints when using FTI_Snapshot().

Value Meaning
L3 intv. (int >= 0)
L3 checkpointing interval in minutes
0 Disable L3 checkpointing

(default = 7)

ckpt_L4

⬆️ Top

Here, the user sets the checkpoint frequency of L4 checkpoints when using FTI_Snapshot().

Value Meaning
L4 intv. (int >= 0)
L4 checkpointing interval in minutes
0 Disable L4 checkpointing

(default = 11)

dcp_L4

⬆️ Top

Here, the user sets the checkpoint frequency of L4 differential checkpoints when using FTI_Snapshot().

Value Meaning
L4 dCP intv. (int >= 0)
L4 dCP checkpointing interval in minutes
0 Disable L4 dCP checkpointing

(default = 0)

inline_L2

⬆️ Top

In this setting, the user chose whether the post-processing work on the L2 checkpoints is done by an FTI process or by the application process.

Value Meaning
0 The post-processing work of the L2 checkpoints is done by an FTI process (notice: This setting is only alowed if head = 1)
1 The post-processing work of the L2 checkpoints is done by the application process

(default = 1)

inline_L3

⬆️ Top

In this setting, the user chose whether the post-processing work on the L3 checkpoints is done by an FTI process or by the application process.

Value Meaning
0 The post-processing work of the L3 checkpoints is done by an FTI process (notice: This setting is only alowed if head = 1)
1 The post-processing work of the L3 checkpoints is done by the application process

(default = 1)

inline_L4

⬆️ Top

In this setting, the user chose whether the post-processing work on the L4 checkpoints is done by an FTI process or by the application process.

Value Meaning
0 The post-processing work of the L4 checkpoints is done by an FTI process (notice: This setting is only alowed if head = 1)
1 The post-processing work of the L4 checkpoints is done by the application process

(default = 1)

keep_last_ckpt

⬆️ Top

This setting tells FTI whether the last checkpoint taken during the execution will be kept in the case of a successful run or not.

Value Meaning
0 During FTI_Finalize(), all checkpoints will be removed (except case 'keep_l4_ckpt=1')
1 After FTI_Finalize(), the last checkpoint will be kept and stored on the PFS as a L4 checkpoint (notice: Additionally, the setting failure in the configuration file is set to 2. This will lead to a restart from the last checkpoint if the application is executed again)

(default = 0)

keep_l4_ckpt

⬆️ Top

This setting triggers FTI to keep all level 4 checkpoints taken during the execution. The checkpoint files will be saved in glbl_dir/l4_archive.

Value Meaning
0 During FTI_Finalize(), all checkpoints will be removed (except case 'keep_last_ckpt=1')
1 All level 4 checkpoints taken during the execution, will be stored under glbl_dir/l4_archive. This folder will not be deleted during the FTI_Finalize() call.

(default = 0)

group_size

⬆️ Top

The group size entry sets, how many nodes (members) forming a group.

Value Meaning
int i (2 <= i <= 32)
Number of nodes contained in a group (notice: The total number of processes must be a multiple of group_size*node_size)

(default = 4)

max_sync_intv

⬆️ Top

Sets the maximum number of iterations between synchronisations of the iteration length (used for FTI_Snapshot()). Internally the value will be rounded to the next lower value which is a power of 2.

Value Meaning
int i (0 <= i <= INT_MAX )
maximum number of iterations between measurements of the global mean iteration time (MPI_Allreduce call)
0 Sets the value to 512, the default value for FTI

(default = 0)

ckpt_io

⬆️ Top

Sets the I/O mode.

Value Meaning
1 POSIX I/O mode
2 MPI-IO I/O mode
3 FTI-FF I/O mode
4 SIONLib I/O mode
5 HDF5 I/O mode

(default = 1)

enable_staging

⬆️ Top

Enable the staging feature. This feature allows to stage files asynchronously from local (e.g. node local NVMe storage) to the PFS. FTI offers the API functions FTI_SendFile, FTI_GetStageDir and FTI_GetStageStatus for that.

Value Meaning
0 Staging disabled
1 Stagin enabled (creation of the staging directory in folde 'ckpt_dir')

(default = 0)

enable_dcp

⬆️ Top

Enable differential checkpointing. In order to use this feature, ckpt_io has to be set to 3 (FTI-FF). To trigger differential checkpoints, use either level FTI_L4_DCP in FTI_Checkpoint or set the interval in dcp_L4 for usage in FTI_Snapshot.

Value Meaning
0 dCP disabled
1 dCP enabled

dcp_mode

⬆️ Top

Set the hash algorithm used for differential checkpointing.

Value Meaning
0 MD5
1 CRC32

(default = 0)

dcp_block_size

⬆️ Top

Set the desired partition block size for differential checkpointing in bytes. The block size must be within 512 .. USHRT_MAX (65535 on most systems).

Value Meaning
b (512 <= i <= USHRT_MAX)
block size for dataset partition for dCP

(default = 16384)

verbosity

⬆️ Top

Sets the level of verbosity.

Value Meaning
1 Debug sensitive. Beside warnings, errors and information, FTI debugging information will be printed
2 Information sensitive. FTI prints warnings, errors and information
3 FTI prints only warnings and errors
4 FTI prints only errors

(default = 2)

[Restart]

failure

⬆️ Top

This setting should mainly set by FTI itself. The behaviour within FTI is the following:

  • Within FTI_Init(), it remains on it initial value.
  • After the first checkpoint is taken, it is set to 1.
  • After FTI_Finalize() and keep_last_ckpt = 0, it is set to 0.
  • After FTI_Finalize() and keep_last_ckpt = 1, it is set to 2.
Value Meaning
0 The application starts with its initial conditions (notice: In order to force a clean start, the value may be set to 0 manually. In this case the user has to take care about removing the checkpoint data from the last execution)
1 FTI is searching for checkpoints and starts from the highest checkpoint level (notice: If no readable checkpoints are found, the execution stops)
2 FTI is searching for the last L4 checkpoint and restarts the execution from there (notice: If checkpoint is not L4 or checkpoint is not readable, the execution stops)

(default = 0)

exec_id

⬆️ Top

This setting should mainly set by FTI itself. During FTI_Init() the execution ID is set if the application starts for the first time (failure = 0) or the execution ID is used by FTI in order to find the checkpoint files for the case of a restart (failure = 1,2)

Value Meaning
yyyy-mm-dd_hh-mm-ss
Execution ID (notice: If variate checkpoint data is available, the execution ID may set by the user to assign the desired starting point)

(default = NULL)

[Advanced]

The settings in this section, should ONLY be changed by advanced users.

block_size

⬆️ Top

FTI temporarily copies small blocks of the L2 and L3 checkpoints to send them through MPI. The size of the data blocks can be set here.

Value Meaning
int Size in KB of the data blocks send by FTI through MPI for the checkpoint levels L2 and L3

(default = 1024)

transfer_size

⬆️ Top

FTI transfers in chunks local checkpoint files to PFS. The size of the chunk can be set here.

Value Meaning
int Size in MB of the chunks send by FTI from local to PFS

(default = 16)

general_tag

⬆️ Top

FTI uses a certain tags for the MPI messages. The tag for general messages can be set here.

Value Meaning
int Tag, used for general MPI messages within FTI

(default = 2612)

ckpt_tag

⬆️ Top

FTI uses a certain tags for the MPI messages. The tag for messages related to checkpoint communication can be set here.

Value Meaning
int Tag, used for MPI messages related to a checkpoint context within FTI

(default = 711)

stage_tag

⬆️ Top

FTI uses a certain tags for the MPI messages. The tag for messages related to staging communication can be set here.

Value Meaning
int Tag, used for MPI messages related to a staging context within FTI

(default = 406)

final_tag

⬆️ Top

FTI uses a certain tags for the MPI messages. The tag for the message to the heads to trigger the end of the execution can be set here.

Value Meaning
int Tag, used for the MPI message that marks the end of the execution send from application processes to the heads within FTI

(default = 3107)

lustre_striping_unit

⬆️ Top

This option only impacts if -DENABLE_LUSTRE was added to the Cmake command. It sets the striping unit for the MPI-IO file.

Value Meaning
int i (0 <= i <= INT_MAX )
Striping size in Bytes. The default in Lustre systems is 1MB (1048576 Bytes), FTI uses 4MB (4194304 Bytes) as the dafault value
0 Assigns the Lustre default value

(default = 4194304)

lustre_striping_factor

⬆️ Top

This option only impacts if -DENABLE_LUSTRE was added to the Cmake command. It sets the striping factor for the MPI-IO file.

Value Meaning
int i (0 <= i <= INT_MAX )
Striping factor. The striping factor determines the number of OST’s to use for striping.
-1 Stripe over all available OST’s. This is the default in FTI.
0 Assigns the Lustre default value

(default = -1)

lustre_striping_offset

⬆️ Top

This option only impacts if -DENABLE_LUSTRE was added to the Cmake command. It sets the striping offset for the MPI-IO file.

Value Meaning
int i (0 <= i <= INT_MAX )
Striping offset. The striping offset selects a particular OST to begin striping at.
-1 Assigns the Lustre default value

(default = -1)

local_test

⬆️ Top

FTI is building the topology of the execution, by determining the hostnames of the nodes on which each process runs. Depending on the settings for group_size, node_size and head, FTI assigns each particular process to a group and decides which process will be Head or Application dedicated. This is meant to be a local test. In certain situations (e.g. to run FTI on a local machine) it is necessary to disable this function.

Value Meaning
0 Local test is disabled. FTI will simulate the situation set in the configuration
1 Local test is enabled (notice: FTI will check if the settings are correct on initialization and if necessary stop the execution)

(default = 1)