Appendix

Terminology

 

Additional Configuration Options

Configuring the Database

You can specify a datastore configuration file in the main configuration file using the tags:

<datastore_extensions>
    <group path="datastore/intel64/">
        <entry config_file="default_sqlite.xml">libsqlite.so</entry>
    </group>
</datastore_extensions>

To use odbc instead of sqlite3, enter libodbc.so instead of libsqlite.so. Multiple entry tags will allow you to specify multiple databases through multiple datastore configuration files.

The datastore configuration file, by default, is located at /opt/intel/clck/201n/etc/datastore/default_sqlite.xml and takes the following format:

<configuration>
    <instance_name>clck_default</instance_name>
    <source_parameters>read_only=false|source=$HOME/.clck/201n/clck.db</source_parameters>
    <type>sqlite3</type>
    <source_types>data</source_types>
</configuration>

The instance_name tag defines a database source name. This value must be unique.

The source_parameters tag determines whether or not to open the database in read-only mode and indicates which database to use.

The type tag specifies what type of database to use. Currently, the only accepted value is sqlite3.

The source_types tag specifies what source type to use. Currently, the only accepted value is data.

 

Database Schema

The database consists of a single SQL view named clck_1. The Intel® Cluster Checker database is a standard SQLite* database and any SQLite* compatible tool may be used to browse the database contents. In addition, the clckdb utility is provided with Intel® Cluster Checker (see clckdb -h for more information).

rowid (INTEGER)

 

Provider (TEXT)

 

Hostname (TEXT)

 

num_nodes (INTEGER)

 

node_names (TEXT)

 

Exit_status (INTEGER)

 

Timestamp (INTEGER)

 

Duration (REAL)

 

Encoding (INTEGER)

 

STDOUT (TEXT)

 

STDERR (TEXT)

 

OptionID (TEXT)

 

Version (INTEGER)

 

Username (TEXT)

 

Unique_timestamp (INTEGER)

 

List of Analyzer Extensions

all_to_all

 

cpu

 

datconf

 

devices

 

dgemm

 

environment

 

ethernet

 

files

 

hardware

 

hpcg_cluster

 

hpcg_single

 

hpl

 

imb_pingpong

 

infiniband

 

iozone

 

kernel

 

kernel_param

 

libraries

 

lsb_tools

 

lshw

 

lustre

 

memory

 

mount

 

mpi_internode

 

mpi_local

 

ntp

 

opa

 

perl

 

process

 

python

 

rpm

 

rpm_baseline

 

sgemm

 

shells

 

ssf_version

 

storage

 

stream

 

tcl

 

Blacklists

Kernel Parameters Blacklist

The following is a comprehensive list of blacklisted kernel parameters. The uniformity of these kernel parameters are checked in the kernel_parameter_uniformity Framework Definition. This list is located in the kernel_param analyzer extension and is not accessible to the user. The user can specify other blacklisted items through the default configuration file.

Lshw Blacklist

The following is a comprehensive list of items blacklisted by the lshw check through the regex function. This blacklist is located in the lshw analyzer extension and is not accessible to the user. The user can specify other blacklisted items through the default configuration file.

 

Included Framework Definitions

All included Framework Definitions are located at /opt/intel/clck/201n/etc/fwd.

basic_internode_connectivity.xml

Validates internode accessibility by confirming the consistency of node IP addresses. Includes the providers:

Includes the analyzer extension:

Includes the knowledge base module:

basic_shells.xml

Identifies missing and failing bash and sh shells. Includes the providers:

Includes the analyzer extension:

Includes the knowledge base module:

benchmarks.xml

Runs all benchmarks and their dependencies. These benchmarks evaluate CPU performance, floating poing computation, network bandwidth and latency, I/O bandwidth, and memory bandwidth. Includes the Framework Definitions:

clock.xml

Verifies that the clock offset is not above the threshold, the ntp client is connected to the ntp server, and the ntpq or chronyc data is recent and available in the database. Includes the Framework Definition:

cluster.xml

Ensures that all nodes in the cluster are able to communicate with one another by confirming the consistency of node IP addresses, verifying Ethernet consistency, executing the HPL benchmark and the Intel® MPI Benchmarks PingPong benchmark, and ensuring that the Intel® MPI Library is functional and can successfully run across the cluster. Includes the Framework Definitions:

cpu.xml

Verifies the uniformity of cpu model names, the Intel® Turbo Boost Technology status, the number of logical cores, the number of threads per core, and the presence of kernel flags. Confirms that the cpu is a 64 bit Intel® processor. For Intel® Xeon Phi™ processors, verifies the uniformity of cluster/memory modes; verifies the nohz_full, isolcpus, and rcu_nocbs kernel configuration parameters; and confirms that the memoryside cache file is the latest version. Includes the providers:

Includes the analyzer extension:

Includes the knowledge base module:

dapl_fabric_providers_present.xml

Verifies that DAPL (Direct Access Programming Libraries) providers are present. Includes the providers:

Includes the analyzer extension:

Includes the knowledge base module:

dgemm_cpu_performance.xml

A double precision matrix multiplication routine that is used to verify the cpu performance. Reports nodes with substandard FLOPS relative to a threshold based on the hardware and performance outliers outside the range defined by the median absolute deviation. Includes the providers:

Includes the analyzer extensions:

Includes the knowledge base module:

environment_variables_uniformity.xml

Verifies the uniformity of all environment variables. Includes the providers:

Includes the analyzer extension:

Includes the knowledge base module:

ethernet.xml

Verifies the consistency of Ethernet drivers, driver versions, and MTU (maximum transmission unit) across the cluster. Verifies that Ethernet interrupt coalescing is enabled.

Includes the providers:

Includes the analyzer extension:

Includes the knowledge base module:

exclude_hpl.xml

Provides a complete analysis of the cluster, excluding the hpl_cluster_performance framework definition and analysis related to specific specs. Includes the framework definitions:

Includes the providers:

files_snapshot.xml

Looks for configuration file changes between  snapshot_x and snapshot_y. Includes the providers:

Includes the analyzer extension:

Includes the knowledge base module:

file_system_uniformity.xml

Confirms that /tmp directory has appropriate permissions, /dev/shm and /proc are properly mounted, and the home path is uniform and shared across the cluster. Includes the providers:

Includes the analyzer extension:

Includes the knowledge base module:

hardware.xml

Verifies cpu configuration, InfiniBand functionality, hardware uniformity, and Intel® Omni-Path Host Fabric Interface functionality. Includes the Framework Definitions:

hardware_snapshot.xml

Looks for hardware location changes between snapshot_x and snapshot_y. Includes the providers:

Includes the analyzer extension:

Includes the knowledge base module:

health.xml

Provides a complete analysis of the cluster, excluding analysis related to specific specs. Includes the Framework Definitions:

hpcg_cluster

The High Performance Conjugate Gradients (HPCG) Benchmark project is an effort to create a new metric for ranking HPC systems. HPCG is designed to exercise computational and data access patterns that more closely match a broad set of applications. This will give an incentive to computer system designers to invest in capabilities that will have an impact on the collective performance of these applications. Intel® Cluster Checker uses the Intel® Optimized High Performance Conjugate Gradient Benchmark, which is executed on four-node sub-clusters as an Intel® MPI Library based benchmark. Includes the providers:

Includes the analyzer extension:

Includes the knowledge base modules

hpcg_single

The High Performance Conjugate Gradients (HPCG) Benchmark project is an effort to create a new metric for ranking HPC systems. HPCG is designed to exercise computational and data access patterns that more closely match a broad set of applications. This will give an incentive to computer system designers to invest in capabilities that will have an impact on the collective performance of these applications. Intel® Cluster Checker uses the Intel® Optimized High Performance Conjugate Gradient Benchmark, which is executed on each individual node as an Intel® MPI Library based benchmark. Includes the providers:

Includes the analyzer extension:

Includes the knowledge base modules

hpl_cluster_performance.xml

Reports if the HPL benchmark ran successfully on the cluster and each pair of nodes within the cluster. Reports performance outliers for the pairwise execution outside the range defined by the median absolute deviation. Includes the providers:

Includes the analyzer extension:

Includes the knowledge base module:

imb_pingpong_fabric performance.xml

Confirms that the Intel® MPI Benchmarks PingPong benchmark ran successfully for nodes within the cluster. Also reports network bandwidth and latency outliers defined by other measured values in the same grouping and if latency or network bandwidth fall below a certain threshold. Includes the providers:

Includes the analyzer extension:

Includes the knowledge base module:

imb_pingpong.xml

Confirms if the Intel® MPI Benchmarks PingPong benchmark ran successfully for nodes within the cluster. Includes additional framework definitions that identify problems that could cause this benchmark to fail to run. Includes the Framework Definitions:

infiniband.xml

Verifies InfiniBand functionality by confirming the consistency of InfiniBand hardware and firmware, confirming that memlock size is sufficient and consistent across the cluster, verifying that InfiniBand HCA ports are in the Active state and the LinkUp physical state, verifying that HCA states are consistent, confirming that the InfiniBand HCA rate is consistent, and verifying InfiniBand card presence and functionality. Includes the framework definition:

Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base module:

iozone_disk_bandwidth_performance.xml

Verifies the I/O performance of a storage device by searching for I/O bandwidth outliers outside the range defined by the median absolute deviation. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base modules

kernel_parameter_preferred

Verifies that kernel parameter value is the preferred one across the cluster. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base modules:

In order to use this framework definition, specify any preferred kernel parameter values in the Intel® Cluster Checker config file using the following format:

<analyzer>
    <config>
        <kernel-param-preferred>
            <entry>kernel.parameter|node_role|value<entry>
        </kernel-param-preferred>
    </config>
</analyzer>

In this format, the first value is the kernel parameter, the second value is the node role, and the third value is the preferred value for the given kernel parameter.

kernel_parameter_uniformity.xml

Verifies that kernel parameter data is uniform across the cluster. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base modules

kernel_version_uniformity.xml

For each node, verifies that the kernel version is the same as at least 90% of the other nodes. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base modules

local_disk_storage.xml

Verifies that there is enough free memory on each node. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base modules

lshw_hardware_uniformity.xml

Verifies the uniformity of hardware installed across the cluster. Determines missing hardware parameters. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base module:

lustre_mounted.xml

Verifies that the Lustre kernel modules are loaded and the object storage targets are active, mounted, uniform and writable across the cluster. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base module:

memory_uniformity.xml

Determines if the amount of physical memory is uniform across the cluster. Includes the data providers:

Includes the analyzer extensions:

Includes the knowledge base module:

mpi_local_functionality.xml

Determines if MPI is present and the path is uniform with all other nodes. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base module:

mpi_multinode_functionality.xml

Verifies that the Intel® MPI Library is functional and can successfully run across the cluster. Includes the data providers:

Includes the connector extension module:

Includes the knowledge base module:

mpi.xml

Verifies that MPI is present, that the path is uniform across nodes, and that MPI successfully runs across the cluster. Runs benchmarks related to MPI performance. Includes the framework definitions:

network_time_uniformity.xml

Verifies that the clock offset is not above the threshold, the Network Time Protocol (NTP) client is connected to the NTP server, and the ntpq or chronyc data is recent and available in the database. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base module:

node_process_status.xml

Identifies nodes with zombie processes and nodes with processes that have high CPU and memory requirements. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base module:

opa.xml

Verifies Intel® Omni-Path Architecture (Intel® OPA) Interface functionality by confirming the consistency of Intel® OPA hardware and firmware, by verifying that Intel® OPA HCA ports are in the Active state and the LinkUp physical state, by verifying that HCA states are consistent, by confirming that the Intel® OPA HCA rate is consistent, by verifying that an Intel® OPA subnet manager is running, and by confirming that memlock size is sufficient and consistent across the cluster. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base module:

perl_functionality.xml

Verifies the presence, functionality, and consistency of the Perl version. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base module:

python_functionality.xml

Verifies the presence, functionality, and consistency of the Python version. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base module:

rpm_snapshot.xml

Checks for RPMs installed across the cluster and compares the data from snapshot_x with the data from snapshot_y. Includes the providers:

Includes the analyzer extension:

Includes the knowledge base module:

rpm_uniformity.xml

Verifies the uniformity of the RPMs installed across the cluster and reports absent and superfluous RPMs. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base module:

select_solutions_sim_mod_benchmarks

Checks benchmark performance against thresholds required by Intel® Select Solutions for Simulation and Modeling. These benchmarks evaluate CPU performance for double precision floating point operations on a single node and a four node cluster, network bandwidth and latency, and memory bandwidth. Includes the data providers:

Includes the analyzer extensions:

Includes the knowledge base module:

select_solutions_sim_mod_priv

Verifies that the cluster meets the part of the Intel® Select for Simulation and Modeling requirements that has to be checked as a privileged user. It checks for system requirements to processor, memory, and fabric. Must be run as a privileged user. A pass of this framework definition along with a pass of the framework definition select_solutions_sim_mod_user.xml (run as normal user) will verify compliance with Intel® Select Solutions for Simulation and Modeling. Includes the data providers:

Includes the analyzer extensions:

Includes the knowledge base module:

select_solutions_sim_mod_user

Verifies that the cluster meets the part of the Intel® Select Solutions for Simulation and Modeling requirements that has to be checked as a non-privileged user. It checks benchmark performance and compliance with Intel® Scalable System Framework. A pass of this framework definition along with a pass of the framework definition select_solutions_sim_mod_priv.xml (run as a privileged user) will verify compliance with Intel® Select Solutions for Simulation and Modeling. Includes the framework definitions:

services_status

Verifies the service status is as required by the provided configuration file. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base module:

To use this framework definition, specify the preferred service status in the Intel® Cluster Checker configuration file using the following format:

<analyzer>
    <config>
        <preferred-services-status>
            <entry>service_name|compute|loaded|active|running</entry>
        </preferred-services-status>
    </config>
</analyzer>

This format takes five values:

  1. Service name
  2. Node role
  3. LOAD status - whether the unit definition was properly loaded
  4. ACTIVE status - the high level unit activation state (i.e. generalization of SUB)
  5. SUB status - the low level unit activation state (values depend on unit type)

sgemm_cpu_performance.xml

Verifies CPU performance using a single precision matrix multiplication routine and reports node outliers outside the range defined by the median absolute deviation. Includes the data providers:

Includes the analyzer extensions:

Includes the knowledge base module:

shell_functionality.xml

Identifies missing and failing bash, csh, sh and tcsh shells. Includes the framework definition:

Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base module:

single.xml

Runs all framework definitions relevant to single node. Evaluates CPU functionality, network connectivity, file systems, shell functionality, environment variables, and Perl and Python versions and verifies clock offset and Intel® MPI Library functionality. Includes the framework definitions:

Includes the data providers:

ssf_compat-base-2016.0.xml

Verifies that the cluster meets Intel® Scalable System Framework base application compatibility requirements. See the Intel® Scalable System Framework Architecture Specification version 2016.0 for more information. Includes the framework definition:

Includes the data providers:

Includes the analyzer extensions:

Includes the knowledge base module:

ssf_compat-hpc-2016.0.xml

Verifies that the cluster meets Intel(R) Scalable System Framework high performance computer cluster application compatibility requirements. See the Intel® Scalable System Framework Architecture Specification version 2016.0 for more information. Includes the framework definitions:

Includes the data providers:

Includes the analyzer extensions:

Includes the knowledge base modules:

ssf_compliance_perl_version.xml

Determines if the Perl version is 5.10 or greater per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:

Includes the analyzer extensions:

Includes the knowledge base module:

ssf_compliance_python_version.xml

Determines if the Python version is 2.6 or greater per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base module:

ssf_compliance_shell.xml

Determines if shells meet Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base module:

ssf_compliance_tcl_version.xml

Determines if the tcl version is 8.5 or greater per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:

Includes the connector extension module:

Includes the knowledge base module:

ssf_core-2016.0.xml

Verifies that the cluster meets Intel® Scalable System Framework core requirements. See the Intel® Scalable System Framework Architecture Specification version 2016.0 for more information. Includes the data providers:

Includes the analyzer extensions:

Includes the knowledge base modules

ssf_environment_variables_mounted.xml

Verifies that TMPDIR and HOME environment variables meet Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base module:

ssf_hpc-cluster-2016.0.xml

Verifies that the cluster meets Intel® Scalable System Framework requirements for a classic high performance compute cluster. See the Intel® Scalable System Framework Architecture Specification version 2016.0 for more information. Includes the Framework Definitions:

Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base modules

ssf_kernel_version.xml

Verifies that the kernel is version 2.6.32 or greater per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data provider:

Includes the connector extension module:

Includes the knowledge base module:

ssf_libraries.xml

Verifies that the Intel® Scalable System Framework libraries are present. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base module:

ssf_linux_based_tools_present.xml

Verifies that the Intel® Scalable System Framework (Intel® SSF) required Linux*-based tools are present. Includes the data providers:

Includes the connector extension module:

Includes the knowledge base module:

ssf_minimum_memory_requirements_base.xml

Verifies that the amount of physical memory per core is above 16 GiB per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:

Includes the connector extension module:

Includes the knowledge base module:

ssf_minimum_memory_requirements_hpc

Verifies that the amount of physical memory per core is above 32 GiB per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the providers:

Includes the analyzer extension:

Includes the knowledge base module:

ssf_minimum_storage.xml

Verifies that the head node has at least 200 GiB of direct access storage and that all compute nodes have access to at least 80 GiB of persistent storage, per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base module:

ssf_version.xml

Verifies that the Intel® Scalable System Framework (Intel® SSF) file is present and the file /etc/ssf-release contains the correct version and layers. Includes the data providers:

Includes the analyzer extension:

Includes the knowledge base module:

stream_memory_bandwidth_performance.xml

Identifies nodes with memory bandwidth outliers (as reported by the STREAM benchmark) outside the range defined by the median absolute deviation. Includes the data providers:

Includes the connector extension module:

Includes the knowledge base module:

tcl_functionality.xml

Verifies that Tcl is installed, functional and uniform across all nodes. Includes the data providers:

Includes the connector extension module:

Includes the knowledge base module:

tools.xml

Verifies that Tcl, Python, and Perl are installed, functional, and uniform. Includes the framework definitions:

 

Rules

The C Language Integrated Production Systems (CLIPS) is an expert system shell that combines an inference engine with a language for representing knowledge. Intel® Cluster Checker uses CLIPS to implement its knowledge base component and define CLIPS classes and rules. Each CLIPS class has one or more CLIPS associated rules. These rules are defined through unique IDs. An example is all-to-all-data-is-too-old, which is associated with the all_to_all analyzer extension.

The remainder of this section contains a short description of rules integrated into the knowledge base. Most rule names are composed of the class name plus a very short description of the rule. For instance the cpu-data-is-too-old rule checks that the CPU data collected is recent.