Intel® Cluster Checker executes in two phases. In the data collection phase, Intel® Cluster Checker collects data from the cluster for use in analysis. In the analysis phase, Intel® Cluster Checker analyzes the data in the database and produces the results of analysis. It is possible to invoke these phases together or separately and to customize their scope. By default, Intel® Cluster Checker verifies the overall health of the cluster using the health framework definition.
Intel® Cluster Checker is a Linux* command-line tool and can be executed using three different commands. The clck command executes data collection followed immediately by analysis and displays the results of analysis. A typical invocation of this command is:
clck -f nodefile
This command will run data collection and analysis using the specified nodefile to determine which nodes to examine and their roles.
It is also possible to run data collection and analysis separately. The clck-collect command executes only data collection without analyzing the data. Intel® Cluster Checker stores collected data in the shared directory (typically the home directory) in the database file .clck/201n/clck.db. A typical invocation of the data collection command is:
clck-collect -f nodefile
The clck-analyze command executes analysis using the most recent data available in the database. A typical invocation of the analysis command is:
clck-analyze -f nodefile
With these three command-line tools it is possible to execute data collection and analysis together (using the clck command) or separately (using the clck-collect and clck-analyze commands).
Additionally, Intel® Cluster Checker includes a database retrieval tool that displays data from the database in a readable format. To display the available data, use the command:
clckdb
Use the --help option for more information about this command.
Note that Intel® Cluster Checker requires a shared directory to run data collection. This value is set to $HOME by default, but there may be some cases (such as running as root) when $HOME is not shared across nodes. It is possible to change this option by setting the environment variable CLCK_SHARED_TEMP_DIR to the desired shared directory, either locally or in the Intel® Cluster Checker configuration file.
In some cases, a message may appear indicating that root access is required to obtain more information. Except in these cases, it is recommended to limit use of Intel® Cluster Checker with root access.
A typical use of the three available commands includes a nodefile using the -f option, as displayed above. A custom nodefile specifies which nodes to include and, if applicable, their roles. Intel® Cluster Checker contains a set of pre-defined roles. A separate hostname appears on each line. If applicable, a role can be specified after the hostname using the annotation # role: compute. If no role is specified for a node, that node is considered a compute node. The following example includes four nodes - one head node and three compute nodes.
node1 #role: head role: compute node2 #role: compute node3 #role: compute node4
A cluster with a single node would only include one hostname in the nodefile.
Additionally, Intel® Cluster Checker provides automatic node detection for data collection using Slurm. On machines with Slurm available, Intel® Cluster Checker will automatically gather allocated hostnames using a Slurm query. To use this functionality, invoke either the clck command or the clck-collect command without using the -f command-line option. For example, the command:
clck
will automatically detect allocated hostnames, collect data on those nodes, and analyze the collected data. Calling clck-analyze without a nodefile will cause Intel® Cluster Checker to analyze recent data from every node available in the database.
For more information about writing nodefiles, see the Selecting Nodes section.
Framework definitions are XML files that define the behavior of Intel® Cluster Checker. They can specify what data is collected, how data is analyzed, and how that information is displayed. By default, Intel® Cluster Checker runs the health framework definition, which provides an overall examination of the health of the cluster. Intel® Cluster Checker provides a wide variety of framework definitions to customize your results, and all framework definitions are located in the Intel® Cluster Checker install directory in the path etc/fwd.
For example, to verify Intel® Omni-Path Architecture (Intel® OPA) Interface functionality, one could run the Intel® OPA framework definition (located at etc/fwd/opa.xml) using the following command:
clck -f nodefile -F opa
A full list of framework definitions and their descriptions is located in the Appendix. Additionally, the command
clck -X list
provides a full list of available framework definitions. To see a description of a specific framework definition (for example, opa.xml), run the following command:
clck -X opa
The following framework definitions are recommended for new users:
It is possible to create custom framework definitions to further configure desired results. For more information about the contents of framework definitions, see the Framework Definitions chapter.