Terminology
- Analyzer Extension: An analyzer extension takes raw data and converts it in a form that knowledge base modules can use.
- Configuration File: The configuration file is an XML file that provides greater configuration of Intel® Cluster Checker.
- Data Provider: A data provider defines what data to collect from the cluster.
- Diagnosis: A diagnosis is a broader inference based on one or more observations. For example, non-uniform memory would lead to a broader non-uniform hardware diagnosis.
- Framework Definition - A framework definition is a XML file that defines the scope of data collection and analysis.
- Issue: An issue is an observation about the cluster. It may indicate a problem or provide additional information. Issues can either be an observation or a diagnosis.
- Knowledge Base Module: A knowledge base module contains a group of rules.
- Message Catalog: The message catalog contains messages for display. Each issue has a message ID that maps to a message in the message catalog.
- Nodefile - A nodefile is a file containing a list of nodes and their roles. The nodefile directs Intel® Cluster Checker which nodes to examine.
- Observation: An observation provides objective information t about the cluster. It may indicate a problem or provide additional information. Observations may indicate a broader problem, in which case they would lead to a diagnosis. For example, a cluster with different amounts of memory per node would produce a memory not uniform observation.
- Remedy: A potential actionable solution to the issue.
- Rule: A rule takes data and, if the data meets certain conditions, triggers an observation or diagnosis. Rules are implemented in the CLIPS language.
Additional Configuration Options
Configuring the Database
You can specify a datastore configuration file in the main configuration file using the tags:
<datastore_extensions>
<group path="datastore/intel64/">
<entry config_file="default_sqlite.xml">libsqlite.so</entry>
</group>
</datastore_extensions>
To use odbc instead of sqlite3, enter libodbc.so instead of libsqlite.so. Multiple entry tags will allow you to specify multiple databases through multiple datastore configuration files.
The datastore configuration file, by default, is located at /opt/intel/clck/201n/etc/datastore/default_sqlite.xml and takes the following format:
<configuration>
<instance_name>clck_default</instance_name>
<source_parameters>read_only=false|source=$HOME/.clck/201n/clck.db</source_parameters>
<type>sqlite3</type>
<source_types>data</source_types>
</configuration>
The instance_name tag defines a database source name. This value must be unique.
The source_parameters tag determines whether or not to open the database in read-only mode and indicates which database to use.
The type tag specifies what type of database to use. Currently, the only accepted value is sqlite3.
The source_types tag specifies what source type to use. Currently, the only accepted value is data.
Database Schema
The database consists of a single SQL view named clck_1. The Intel® Cluster Checker database is a standard SQLite* database and any SQLite* compatible tool may be used to browse the database contents. In addition, the clckdb utility is provided with Intel® Cluster Checker (see clckdb -h for more information).
rowid (INTEGER)
Provider (TEXT)
Hostname (TEXT)
- Hostname of the node where the data provider ran
num_nodes (INTEGER)
- Number of nodes used by the data provider
node_names (TEXT)
- Comma-separated list of nodes used by the data provider (empty if num_nodes = 1)
Exit_status (INTEGER)
- Exit status of the data provider
Timestamp (INTEGER)
- Timestamp when the data provider started (seconds since the UNIX epoch)
Duration (REAL)
- Data provider walltime (seconds)
Encoding (INTEGER)
- Encoding format of the STDOUT and STDERR columns (0 = no encoding, 1 = base64 encoding)
STDOUT (TEXT)
- Data provider standard output
STDERR (TEXT)
- Data provider standard error
OptionID (TEXT)
- The ID of the option set with which the provider was run
Version (INTEGER)
- Output format version of the data provider
Username (TEXT)
- Username of the user who ran the data provider
Unique_timestamp (INTEGER)
- Unique timestamp when the data was collected (seconds since the UNIX epoch)
List of Analyzer Extensions
all_to_all
cpu
- CPU compliance and uniformity
datconf
- InfiniBand* DAPL configuration
devices
- Intel® Select Solutions for Simulation and Modeling devices compliance
dgemm
- Floating point performance by double precision matrix multiplication
environment
ethernet
- Ethernet driver uniformity and wellness
files
hardware
hpcg_cluster
- High Performance Conjugate Gradients (HPCG) benchmark four node
hpcg_single
- High Performance Conjugate Gradients (HPCG) benchmark single node
hpl
imb_pingpong
infiniband
- InfiniBand* uniformity and wellness
iozone
kernel
kernel_param
- Kernel parameter uniformity
libraries
- Intel® Scalable System Framework runtime library compliance
lsb_tools
lshw
lustre
- Lustre* storage cluster functionality
memory
mount
- Mount point compliance and uniformity
mpi_internode
- Multi-node Intel® MPI Library functionality
mpi_local
- Single-node Intel® MPI Library functionality
ntp
opa
- Intel® Omni-Path Host Fabric Interface uniformity and wellness
perl
- Perl* compliance, uniformity, and functionality
process
python
- Python* compliance, uniformity, and functionality
rpm
rpm_baseline
sgemm
- Floating point performance by single precision matrix multiplication
shells
ssf_version
- Intel® Scalable System Framework version compliance
storage
stream
- Memory bandwidth performance
tcl
- Tcl compliance, uniformity, and functionality
Blacklists
Kernel Parameters Blacklist
The following is a comprehensive list of blacklisted kernel parameters. The uniformity of these kernel parameters are checked in the kernel_parameter_uniformity Framework Definition. This list is located in the kernel_param analyzer extension and is not accessible to the user. The user can specify other blacklisted items through the default configuration file.
-
dev.cdrom.autoclose
-
dev.cdrom.autoeject
-
dev.cdrom.check_media
-
dev.cdrom.debug
-
dev.cdrom.info
-
dev.cdrom.lock
-
fs.binfmt_misc.jexec
-
fs.dentry-state
-
fs.epoll.max_user_watches
-
fs.file-max
-
fs.file-nr
-
fs.inode-nr
-
fs.inode-state
-
fs.nfs.
-
fs.quota.syncs
-
kernel.domainname
-
kernel.host-name
-
kernel.hostname
-
kernel.hung_task_warnings
-
kernel.ns_last_pid
-
kernel.perf_event_max_sample_rate
-
kernel.pty.nr
-
kernel.random.
-
kernel.sched_domain.
-
kernel.shmmax
-
kernel.threads-max
-
lnet.buffers
-
lnet.fefslog_daemon_pid
-
lnet.lnet_memused
-
lnet.memused
-
lnet.net_status
-
lnet.nis
-
lnet.peers
-
lnet.routes
-
lnet.stats
-
lustre.memused
-
net.bridge.bridge-n
-
net.core.netdev_rss_key
-
net.ipv4.conf.
-
net.ipv4.neigh.
-
net.ipv4.net-filter.
-
net.ipv4.netfilter.ip_conntrack_count
-
net.ipv4.rt_cache_rebuild_count
-
net.ipv4.tcp_mem
-
net.ipv4.udp_mem
-
net.ipv6
-
net.netfilter.nf_conntrack_count
-
sunrpc.transports
Lshw Blacklist
The following is a comprehensive list of items blacklisted by the lshw check through the regex function. This blacklist is located in the lshw analyzer extension and is not accessible to the user. The user can specify other blacklisted items through the default configuration file.
-
regex(".*bank.*clock")
-
regex(".*bank.*product")
-
regex(".*bank.*vendor")
-
regex(".*cache.*instruction")
-
regex(".*cache.*unified")
-
regex(".*cdrom.*")
-
regex(".*generic.*")
-
regex(".*irq")
-
regex(".*isa.*")
-
regex(".*network.*size")
-
regex(".*physid")
-
regex(".*signature.*")
-
regex(".*sku.*")
-
regex(".*usb.*")
-
regex(".*volume.*")
-
regex("^pci.*businfo.*$")
-
regex("^pci.*cap_list.*$")
-
regex("^pci.*ioport.*$")
-
regex("^pci.*memory.*")
-
regex("^pci.*width.*$")
-
regex("^cpu:.*-size$")
-
regex("^cpu:.*-capacity$")
-
regex(".*scsi:*[0-9]*-driver")
-
regex(".*scsi:*[0-9]*-businfo")
-
regex(".*scsi:*[0-9]*-logicalname")
-
regex(".*scsi:*[0-9]*-scsi-host")
Included Framework Definitions
All included Framework Definitions are located at /opt/intel/clck/201n/etc/fwd.
basic_internode_connectivity.xml
Validates internode accessibility by confirming the consistency of node IP addresses. Includes the providers:
Includes the analyzer extension:
Includes the knowledge base module:
- basic_internode_connectivity.clp
basic_shells.xml
Identifies missing and failing bash and sh shells. Includes the providers:
Includes the analyzer extension:
Includes the knowledge base module:
benchmarks.xml
Runs all benchmarks and their dependencies. These benchmarks evaluate CPU performance, floating poing computation, network bandwidth and latency, I/O bandwidth, and memory bandwidth. Includes the Framework Definitions:
- dgemm_cpu_performance.xml
- ethernet.xml
- hpl_cluster_performance.xml
- imb_pingpong_fabric_performance.xml
- iozone_disk_bandwidth_performance.xml
- sgemm_cpu_performance.xml
- stream_memory_bandwidth_performance.xml
clock.xml
Verifies that the clock offset is not above the threshold, the ntp client is connected to the ntp server, and the ntpq or chronyc data is recent and available in the database. Includes the Framework Definition:
- network_time_uniformity.xml
cluster.xml
Ensures that all nodes in the cluster are able to communicate with one another by confirming the consistency of node IP addresses, verifying Ethernet consistency, executing the HPL benchmark and the Intel® MPI Benchmarks PingPong benchmark, and ensuring that the Intel® MPI Library is functional and can successfully run across the cluster. Includes the Framework Definitions:
- basic_internode_connectivity.xml
- ethernet.xml
- hpl_cluster_performance.xml
- imb_pingpong_fabric_performance.xml
- mpi_multinode_functionality.xml
cpu.xml
Verifies the uniformity of cpu model names, the Intel® Turbo Boost Technology status, the number of logical cores, the number of threads per core, and the presence of kernel flags. Confirms that the cpu is a 64 bit Intel® processor. For Intel® Xeon Phi™ processors, verifies the uniformity of cluster/memory modes; verifies the nohz_full, isolcpus, and rcu_nocbs kernel configuration parameters; and confirms that the memoryside cache file is the latest version. Includes the providers:
- cpuid
- cpuinfo
- cpupower
- dmesg
- hwloc_dump_hwdata
- intel_pstate_status
- kernel_tools
- lscpu
- numactl
- uname
Includes the analyzer extension:
Includes the knowledge base module:
dapl_fabric_providers_present.xml
Verifies that DAPL (Direct Access Programming Libraries) providers are present. Includes the providers:
- datconf
- ibstat
- ipaddr
- uname
Includes the analyzer extension:
Includes the knowledge base module:
- dapl_fabric_providers_present.clp
dgemm_cpu_performance.xml
A double precision matrix multiplication routine that is used to verify the cpu performance. Reports nodes with substandard FLOPS relative to a threshold based on the hardware and performance outliers outside the range defined by the median absolute deviation. Includes the providers:
- cpuid
- cpuinfo
- cpupower
- dgemm
- dmesg
- dmidecode
- hwloc_dump_hwdata
- intel_pstate_status
- kernel_tools
- lscpu
- meminfo
- numactl
- uname
Includes the analyzer extensions:
Includes the knowledge base module:
- dgemm_cpu_performance.clp
environment_variables_uniformity.xml
Verifies the uniformity of all environment variables. Includes the providers:
Includes the analyzer extension:
Includes the knowledge base module:
- environment_variables_uniformity.clp
ethernet.xml
Verifies the consistency of Ethernet drivers, driver versions, and MTU (maximum transmission unit) across the cluster. Verifies that Ethernet interrupt coalescing is enabled.
Includes the providers:
- ethtool
- ethtool_show_coalesce
- ipaddr
- uname
Includes the analyzer extension:
Includes the knowledge base module:
exclude_hpl.xml
Provides a complete analysis of the cluster, excluding the hpl_cluster_performance framework definition and analysis related to specific specs. Includes the framework definitions:
- basic_internode_connectivity.xml
- cpu.xml
- dapl_fabric_providers_present.xml
- dgemm_cpu_performance.xml
- environment_variables_uniformity.xml
- ethernet.xml
- file_system_uniformity.xml
- imb_pingpong_fabric_performance.xml
- infiniband.xml
- iozone_disk_bandwidth_performance.xml
- kernel_version_uniformity.xml
- kernel_parameter_uniformity.xml
- local_disk_storage.xml
- lshw_hardware_uniformity.xml
- lustre_mounted.xml
- memory_uniformity.xml
- mpi_local_functionality.xml
- mpi_multinode_functionality.xml
- network_time_uniformity.xml
- node_process_status.xml
- opa.xml
- perl_functionality.xml
- python_functionality.xml
- rpm_uniformity.xml
- sgemm_cpu_performance.xml
- shell_functionality.xml
- stream_memory_bandwidth_performance.xml
- tcl_functionality.xml
Includes the providers:
- chkconfig
- checksums
- loadavg
- mtab
- ulimit
- who
files_snapshot.xml
Looks for configuration file changes between snapshot_x and snapshot_y. Includes the providers:
- files_head
- files_compute
- uname
Includes the analyzer extension:
Includes the knowledge base module:
file_system_uniformity.xml
Confirms that /tmp directory has appropriate permissions, /dev/shm and /proc are properly mounted, and the home path is uniform and shared across the cluster. Includes the providers:
- mount
- stat_home
- stat_tmp
- uname
Includes the analyzer extension:
Includes the knowledge base module:
- file_system_uniformity.clp
hardware.xml
Verifies cpu configuration, InfiniBand functionality, hardware uniformity, and Intel® Omni-Path Host Fabric Interface functionality. Includes the Framework Definitions:
- cpu.xml
- infiniband.xml
- lshw_hardware_uniformity.xml
- opa.xml
hardware_snapshot.xml
Looks for hardware location changes between snapshot_x and snapshot_y. Includes the providers:
Includes the analyzer extension:
Includes the knowledge base module:
health.xml
Provides a complete analysis of the cluster, excluding analysis related to specific specs. Includes the Framework Definitions:
- basic_internode_connectivity.xml
- basic_shells.xml
- cpu.xml
- dapl_fabric_providers_present.xml
- dgemm_cpu_performance.xml
- environment_variables_uniformity.xml
- ethernet.xml
- file_system_uniformity.xml
- hpl_cluster_performance.xml
- imb_pingpong_fabric_performance.xml
- infiniband.xml
- kernel_version_uniformity.xml
- kernel_parameter_uniformity.xml
- local_disk_storage.xml
- lshw_hardware_uniformity.xml
- lustre_mounted.xml
- memory_uniformity.xml
- mpi_local_functionality.xml
- mpi_multinode_functionality.xml
- network_time_uniformity.xml
- node_process_status.xml
- opa.xml
- perl_functionality.xml
- python_functionality.xml
- rpm_uniformity.xml
- services_status.xml
- sgemm_cpu_performance.xml
- shell_functionality.xml
- stream_memory_bandwidth_performance.xml
- tcl_functionality.xml
hpcg_cluster
The High Performance Conjugate Gradients (HPCG) Benchmark project is an effort to create a new metric for ranking HPC systems. HPCG is designed to exercise computational and data access patterns that more closely match a broad set of applications. This will give an incentive to computer system designers to invest in capabilities that will have an impact on the collective performance of these applications. Intel® Cluster Checker uses the Intel® Optimized High Performance Conjugate Gradient Benchmark, which is executed on four-node sub-clusters as an Intel® MPI Library based benchmark. Includes the providers:
Includes the analyzer extension:
Includes the knowledge base modules
hpcg_single
The High Performance Conjugate Gradients (HPCG) Benchmark project is an effort to create a new metric for ranking HPC systems. HPCG is designed to exercise computational and data access patterns that more closely match a broad set of applications. This will give an incentive to computer system designers to invest in capabilities that will have an impact on the collective performance of these applications. Intel® Cluster Checker uses the Intel® Optimized High Performance Conjugate Gradient Benchmark, which is executed on each individual node as an Intel® MPI Library based benchmark. Includes the providers:
Includes the analyzer extension:
Includes the knowledge base modules
hpl_cluster_performance.xml
Reports if the HPL benchmark ran successfully on the cluster and each pair of nodes within the cluster. Reports performance outliers for the pairwise execution outside the range defined by the median absolute deviation. Includes the providers:
- hpl_cluster
- hpl_pairwise
- uname
Includes the analyzer extension:
Includes the knowledge base module:
- hpl_cluster_performance.clp
imb_pingpong_fabric performance.xml
Confirms that the Intel® MPI Benchmarks PingPong benchmark ran successfully for nodes within the cluster. Also reports network bandwidth and latency outliers defined by other measured values in the same grouping and if latency or network bandwidth fall below a certain threshold. Includes the providers:
- datconf
- ethtool
- ethtool_show_coalesce
- ibstat
- imb_pingpong
- ipaddr
- lspci
- mpi_internode
- mpi_local
- ofedinfo
- tmiconf
- udevadm-net
- uname
Includes the analyzer extension:
Includes the knowledge base module:
- imb_pingpong_fabric_performance.clp
imb_pingpong.xml
Confirms if the Intel® MPI Benchmarks PingPong benchmark ran successfully for nodes within the cluster. Includes additional framework definitions that identify problems that could cause this benchmark to fail to run. Includes the Framework Definitions:
- imb_pingpong_fabric_performance.xml
- infiniband.xml
- mpi_multinode_functionality.xml
- mpi_local_functionality.xml
- opa.xml
infiniband.xml
Verifies InfiniBand functionality by confirming the consistency of InfiniBand hardware and firmware, confirming that memlock size is sufficient and consistent across the cluster, verifying that InfiniBand HCA ports are in the Active state and the LinkUp physical state, verifying that HCA states are consistent, confirming that the InfiniBand HCA rate is consistent, and verifying InfiniBand card presence and functionality. Includes the framework definition:
- dapl_fabric_providers_present.xml
Includes the data providers:
- datconf
- ibstat
- ibv_devinfo
- lspci
- ofedinfo
- ulimit
- uname
Includes the analyzer extension:
Includes the knowledge base module:
iozone_disk_bandwidth_performance.xml
Verifies the I/O performance of a storage device by searching for I/O bandwidth outliers outside the range defined by the median absolute deviation. Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base modules
- iozone_disk_bandwidth_performance.clp
kernel_parameter_preferred
Verifies that kernel parameter value is the preferred one across the cluster. Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base modules:
- kernel_parameter_preferred.clp
In order to use this framework definition, specify any preferred kernel parameter values in the Intel® Cluster Checker config file using the following format:
<analyzer>
<config>
<kernel-param-preferred>
<entry>kernel.parameter|node_role|value<entry>
</kernel-param-preferred>
</config>
</analyzer>
In this format, the first value is the kernel parameter, the second value is the node role, and the third value is the preferred value for the given kernel parameter.
kernel_parameter_uniformity.xml
Verifies that kernel parameter data is uniform across the cluster. Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base modules
- kernel_parameter_uniformity.clp
kernel_version_uniformity.xml
For each node, verifies that the kernel version is the same as at least 90% of the other nodes. Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base modules
- kernel_version_uniformity.clp
local_disk_storage.xml
Verifies that there is enough free memory on each node. Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base modules
lshw_hardware_uniformity.xml
Verifies the uniformity of hardware installed across the cluster. Determines missing hardware parameters. Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base module:
- lshw_hardware_uniformity.clp
lustre_mounted.xml
Verifies that the Lustre kernel modules are loaded and the object storage targets are active, mounted, uniform and writable across the cluster. Includes the data providers:
- lsmod
- lustre_check_servers
- lustre_logs
- lustre_df
- lustre_stripe
- uname
Includes the analyzer extension:
Includes the knowledge base module:
memory_uniformity.xml
Determines if the amount of physical memory is uniform across the cluster. Includes the data providers:
- cpuid
- cpuinfo
- cpupower
- dmesg
- dmidecode
- hwloc_dump_hwdata
- kernel_tools
- lscpu
- meminfo
- numactl
- uname
Includes the analyzer extensions:
Includes the knowledge base module:
mpi_local_functionality.xml
Determines if MPI is present and the path is uniform with all other nodes. Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base module:
- mpi_local_functionality.clp
mpi_multinode_functionality.xml
Verifies that the Intel® MPI Library is functional and can successfully run across the cluster. Includes the data providers:
Includes the connector extension module:
Includes the knowledge base module:
- mpi_multinode_functionality.clp
mpi.xml
Verifies that MPI is present, that the path is uniform across nodes, and that MPI successfully runs across the cluster. Runs benchmarks related to MPI performance. Includes the framework definitions:
- hpl_cluster_performance.xml
- imb_pingpong_fabric_performance.xml
- mpi_local_functionality.xml
- mpi_multinode_functionality.xml
network_time_uniformity.xml
Verifies that the clock offset is not above the threshold, the Network Time Protocol (NTP) client is connected to the NTP server, and the ntpq or chronyc data is recent and available in the database. Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base module:
- network_time_uniformity.clp
node_process_status.xml
Identifies nodes with zombie processes and nodes with processes that have high CPU and memory requirements. Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base module:
opa.xml
Verifies Intel® Omni-Path Architecture (Intel® OPA) Interface functionality by confirming the consistency of Intel® OPA hardware and firmware, by verifying that Intel® OPA HCA ports are in the Active state and the LinkUp physical state, by verifying that HCA states are consistent, by confirming that the Intel® OPA HCA rate is consistent, by verifying that an Intel® OPA subnet manager is running, and by confirming that memlock size is sufficient and consistent across the cluster. Includes the data providers:
- lspci
- opahfirev
- opatools
- opasmaquery
- saquery
- ulimit
- uname
Includes the analyzer extension:
Includes the knowledge base module:
perl_functionality.xml
Verifies the presence, functionality, and consistency of the Perl version. Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base module:
python_functionality.xml
Verifies the presence, functionality, and consistency of the Python version. Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base module:
rpm_snapshot.xml
Checks for RPMs installed across the cluster and compares the data from snapshot_x with the data from snapshot_y. Includes the providers:
Includes the analyzer extension:
Includes the knowledge base module:
rpm_uniformity.xml
Verifies the uniformity of the RPMs installed across the cluster and reports absent and superfluous RPMs. Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base module:
select_solutions_sim_mod_benchmarks
Checks benchmark performance against thresholds required by Intel® Select Solutions for Simulation and Modeling. These benchmarks evaluate CPU performance for double precision floating point operations on a single node and a four node cluster, network bandwidth and latency, and memory bandwidth. Includes the data providers:
- dgemm
- hpcg_cluster
- hpcg_single
- hpl_cluster
- imb_pingpong
- stream
- uname
Includes the analyzer extensions:
- dgemm
- hpl
- hpcg_cluster
- hpcg_single
- imb_pingpong
- stream
Includes the knowledge base module:
- select_solutions_sim_mod_benchmarks.clp
select_solutions_sim_mod_priv
Verifies that the cluster meets the part of the Intel® Select for Simulation and Modeling requirements that has to be checked as a privileged user. It checks for system requirements to processor, memory, and fabric. Must be run as a privileged user. A pass of this framework definition along with a pass of the framework definition select_solutions_sim_mod_user.xml (run as normal user) will verify compliance with Intel® Select Solutions for Simulation and Modeling. Includes the data providers:
- cpuid
- cpuinfo
- cpupower
- dmesg
- hwloc_dump_hwdata
- intel_pstate_status
- kernel_tools
- lspci_verbose
- lscpu
- numactl
- dmidecode
- meminfo
- uname
Includes the analyzer extensions:
Includes the knowledge base module:
- select_solutions_sim_mod_system_requirements.clp
select_solutions_sim_mod_user
Verifies that the cluster meets the part of the Intel® Select Solutions for Simulation and Modeling requirements that has to be checked as a non-privileged user. It checks benchmark performance and compliance with Intel® Scalable System Framework. A pass of this framework definition along with a pass of the framework definition select_solutions_sim_mod_priv.xml (run as a privileged user) will verify compliance with Intel® Select Solutions for Simulation and Modeling. Includes the framework definitions:
- select_solutions_sim_mod_benchmarks.xml
- ssf_compat-hpc-2016.0.xml
services_status
Verifies the service status is as required by the provided configuration file. Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base module:
To use this framework definition, specify the preferred service status in the Intel® Cluster Checker configuration file using the following format:
<analyzer>
<config>
<preferred-services-status>
<entry>service_name|compute|loaded|active|running</entry>
</preferred-services-status>
</config>
</analyzer>
This format takes five values:
- Service name
- Node role
- LOAD status - whether the unit definition was properly loaded
- ACTIVE status - the high level unit activation state (i.e. generalization of SUB)
- SUB status - the low level unit activation state (values depend on unit type)
sgemm_cpu_performance.xml
Verifies CPU performance using a single precision matrix multiplication routine and reports node outliers outside the range defined by the median absolute deviation. Includes the data providers:
- cpuid
- cpuinfo
- cpupower
- sgemm
- dmesg
- hwloc_dump_hwdata
- intel_pstate_status
- kernel_tools
- lscpu
- numactl
- uname
Includes the analyzer extensions:
Includes the knowledge base module:
- sgemm_cpu_performance.clp
shell_functionality.xml
Identifies missing and failing bash, csh, sh and tcsh shells. Includes the framework definition:
Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base module:
single.xml
Runs all framework definitions relevant to single node. Evaluates CPU functionality, network connectivity, file systems, shell functionality, environment variables, and Perl and Python versions and verifies clock offset and Intel® MPI Library functionality. Includes the framework definitions:
- cpu.xml
- ethernet.xml
- environment_variables_uniformity.xml
- file_system_uniformity.xml
- lustre_mounted.xml
- mpi_local_functionality.xml
- network_time_uniformity.xml
- opa.xml
- perl_functionality.xml
- python_functionality.xml
- shell_functionality.xml
Includes the data providers:
- checksums
- chkconfig
- datconf
- df
- dgemm
- ibstat
- ibv_devinfo
- ifconfig
- iozone
- issue
- kernel_tools
- ldconfig
- loadavg
- lsb
- lsb_tools
- lscpu
- lshw
- meminfo
- modinfo
- mtab
- numactl
- ofedinfo
- printenv
- ps
- resolvconf
- rpm_list
- ssf_version
- sshdconf
- stat_home
- stat_tmp
- stream
- sysctl
- tcl
- tmiconf
- tmp
- udevadm-net
- uptime
- who
ssf_compat-base-2016.0.xml
Verifies that the cluster meets Intel® Scalable System Framework base application compatibility requirements. See the Intel® Scalable System Framework Architecture Specification version 2016.0 for more information. Includes the framework definition:
Includes the data providers:
- cpuid
- cpuinfo
- cpupower
- df
- dmesg
- dmidecode
- hwloc_dump_hwdata
- kernel_tools
- libraries
- lsb_tools
- lscpu
- meminfo
- mount
- numactl
- perl
- python
- shells
- stat_home
- stat_tmp
- tcl
- uname
Includes the analyzer extensions:
Includes the knowledge base module:
- ssf_compat base-2016.0.xml
ssf_compat-hpc-2016.0.xml
Verifies that the cluster meets Intel(R) Scalable System Framework high performance computer cluster application compatibility requirements. See the Intel® Scalable System Framework Architecture Specification version 2016.0 for more information. Includes the framework definitions:
- ssf_hpc-cluster-2016.0.xml
- ssf_compat-base.2016.0.xml
Includes the data providers:
- all_to_all
- mpi_local
- uname
Includes the analyzer extensions:
Includes the knowledge base modules:
- ssf_compat-hpc-2016.0.clp
ssf_compliance_perl_version.xml
Determines if the Perl version is 5.10 or greater per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:
Includes the analyzer extensions:
Includes the knowledge base module:
- ssf_compliance_perl_version.clp
ssf_compliance_python_version.xml
Determines if the Python version is 2.6 or greater per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base module:
- ssf_compliance_python_version.clp
ssf_compliance_shell.xml
Determines if shells meet Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base module:
ssf_compliance_tcl_version.xml
Determines if the tcl version is 8.5 or greater per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:
Includes the connector extension module:
Includes the knowledge base module:
- ssf_compliance_tcl_version.clp
ssf_core-2016.0.xml
Verifies that the cluster meets Intel® Scalable System Framework core requirements. See the Intel® Scalable System Framework Architecture Specification version 2016.0 for more information. Includes the data providers:
- cpuid
- cpuinfo
- cpupower
- dmesg
- hwloc_dump_hwdata
- intel_pstate_status
- kernel_tools
- lscpu
- mount
- numactl
- printenv
- ssf_version
- stat_home
- stat_tmp
- uname
Includes the analyzer extensions:
- cpu
- environment
- kernel
- mount
- ssf_version
Includes the knowledge base modules
ssf_environment_variables_mounted.xml
Verifies that TMPDIR and HOME environment variables meet Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:
- mount
- stat_home
- stat_tmp
- uname
Includes the analyzer extension:
Includes the knowledge base module:
- ssf_environment_variables_mounted.clp
ssf_hpc-cluster-2016.0.xml
Verifies that the cluster meets Intel® Scalable System Framework requirements for a classic high performance compute cluster. See the Intel® Scalable System Framework Architecture Specification version 2016.0 for more information. Includes the Framework Definitions:
Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base modules
ssf_kernel_version.xml
Verifies that the kernel is version 2.6.32 or greater per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data provider:
Includes the connector extension module:
Includes the knowledge base module:
ssf_libraries.xml
Verifies that the Intel® Scalable System Framework libraries are present. Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base module:
ssf_linux_based_tools_present.xml
Verifies that the Intel® Scalable System Framework (Intel® SSF) required Linux*-based tools are present. Includes the data providers:
Includes the connector extension module:
Includes the knowledge base module:
- ssf_linux_based_tools_present.clp
ssf_minimum_memory_requirements_base.xml
Verifies that the amount of physical memory per core is above 16 GiB per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:
- cpuid
- cpuinfo
- cpupower
- dmesg
- hwloc_dump_hwdata
- kernel_tools
- lscpu
- meminfo
- numactl
- uname
Includes the connector extension module:
Includes the knowledge base module:
- ssf_minimum_memory_requirements_base.clp
ssf_minimum_memory_requirements_hpc
Verifies that the amount of physical memory per core is above 32 GiB per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the providers:
- cpuid
- cpuinfo
- cpupower
- dmesg
- hwloc_dump_hwdata
- kernel_tools
- lscpu
- meminfo
- numactl
- uname
Includes the analyzer extension:
Includes the knowledge base module:
- ssf_minimum_memory_requirements_hpc.clp
ssf_minimum_storage.xml
Verifies that the head node has at least 200 GiB of direct access storage and that all compute nodes have access to at least 80 GiB of persistent storage, per Intel® Scalable System Framework (Intel® SSF) requirements. Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base module:
ssf_version.xml
Verifies that the Intel® Scalable System Framework (Intel® SSF) file is present and the file /etc/ssf-release contains the correct version and layers. Includes the data providers:
Includes the analyzer extension:
Includes the knowledge base module:
stream_memory_bandwidth_performance.xml
Identifies nodes with memory bandwidth outliers (as reported by the STREAM benchmark) outside the range defined by the median absolute deviation. Includes the data providers:
Includes the connector extension module:
Includes the knowledge base module:
- stream_memory_bandwidth_performance.clp
tcl_functionality.xml
Verifies that Tcl is installed, functional and uniform across all nodes. Includes the data providers:
Includes the connector extension module:
Includes the knowledge base module:
tools.xml
Verifies that Tcl, Python, and Perl are installed, functional, and uniform. Includes the framework definitions:
- perl_functionality.xml
- python_functionality.xml
- tcl_functionality.xml
Rules
The C Language Integrated Production Systems (CLIPS) is an expert system shell that combines an inference engine with a language for representing knowledge. Intel® Cluster Checker uses CLIPS to implement its knowledge base component and define CLIPS classes and rules. Each CLIPS class has one or more CLIPS associated rules. These rules are defined through unique IDs. An example is all-to-all-data-is-too-old, which is associated with the all_to_all analyzer extension.
The remainder of this section contains a short description of rules integrated into the knowledge base. Most rule names are composed of the class name plus a very short description of the rule. For instance the cpu-data-is-too-old rule checks that the CPU data collected is recent.
- all-logical-cores-not-available:
- all-to-all-data-is-too-old:
- Identify nodes where the most recent ALL_TO_ALL data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- all-to-all-data-missing:
- Check that all-to-all data is available.
- approx-dimms-per-socket-not-balanced
- Check that DIMMs are installed in a balanced manner.
- cpu-data-is-too-old:
- Identify nodes where the most recent CPU data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- cpu-data-missing:
- Check that CPU data is available.
- cpu-min-processor-model
- Checks if the minimum processor model is met.
- cpu-min-sockets
- Checks that the minimum socket number is met.
- cpu-missing-kernel-flag:
- Check for missing CPU kernel flag.
- cpu-model-name-not-uniform:
- Check that the CPU model name is uniform.
- cpu-not-intel64:
- Check that the CPU is a 64-bit Intel® processor.
- cpu-tickless-error:
- Check if an error occurred during application nohz-full parameter during booting Intel® Xeon Phi™ processor.
- cpu-tickless-isolcpus:
- Check if CPU list in use for nohz-full parameter for the Intel® Xeon Phi™ processor is subset of isolcpus parameter (if present).
- cpu-tickless-kernel:
- Check if CPU list in use for nohz-full parameter for the Intel® Xeon Phi™ processor is same as the one applied by kernel.
- cpu-tickless-list-not-uniform:
- nohz-full parameter uniformity check for Intel® Xeon Phi™ processor
- cpu-tickless-preferred:
- Check if CPU list in use for nohz-full parameter for the Intel® Xeon Phi™ processor is in preferred CPU list provided.
- cpu-tickless-rcu-nocbs:
- Check if CPU list in use for nohz-full parameter for the Intel® Xeon Phi™ processor is a subset of rcu-nocbs parameter (if present).
- cpu-turbo-status-not-preferred:
- Check if the Intel® Turbo Boost Technology status across nodes is same as preferred by the user.
- cpu-turbo-status-not-uniform:
- Check for the consistency of Intel® Turbo Boost Technology status across a subcluster.
- data-is-too-old-initial:
- If there are any signs for out of date data, create a data-is-too-old diagnosis and mark the sign as diagnosed. This rule only fires for the first data-is-too-old sign per node; that is, when the diagnosis does not already exist. Once the diagnosis exists, it should not be duplicated. Thus, there is a corresponding rule, data-is-too-old-subsequent, for the case where there are multiple signs leading to this diagnosis.
- datconf-data-is-too-old:
- Identify nodes where the most recent datconf data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- datconf-data-missing:
- Check that datconf data is available.
- datconf-no-dapl-providers:
- Check that datconf data is available.
- dgemm-data-is-substandard:
- For the most recent DGEMM data point, identify nodes with substandard FLOPS relative to a threshold based on the hardware. The severity depends on the amount of deviation from the threshold value; the larger the deviation, the higher the severity.
- dgemm-data-is-too-old:
- Identify nodes where the most recent DGEMM data data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- dgemm-data-missing:
- Detect cases where there is no DGEMM data.
- dgemm-outlier:
- Locate values that are outliers. An outlier is a value that is outside the range defined by the median +/- 6 * median absolute deviation. The statistics are computed using all samples on all nodes (that is, use the DGEMM statistics key). Note: the statistics-control condition is required to ensure that all samples are included when computing the statistics.
- dgemm-perf-pass
- Ensure that a system meets the performance requirements defined by Intel® Select Solutions for Simulation and Modeling.
- dimms-per-socket-not-balanced:
- Checks the uniformity of the DIMMs installed per socket.
- dimms-per-socket-not-uniform:
- Checks the uniformity of the DIMMs installed per socket
- dmidecode-command-not-found.clp:
- Check that dmidecode exists on a node
- dmidecode-data-error.clp:
- Check that dmidecode data is available and parsable.
- dmidecode-data-missing.clp:
- Checks if dmidecode data is missing.
- environment-data-is-too-old:
- Identify nodes where the most recent ENVIRONMENT data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- environment-data-missing:
- Check that environment data is available.
- environment-variable-not-uniform:
- Check that an environment variable is uniform.
- ethernet-data-is-too-old:
- Identify nodes where the most recent ETHERNET data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- ethernet-data-missing:
- Check that ethernet data is available.
- ethernet-driver-is-not-consistent:
- Identify inconsistent Ethernet drivers.
- ethernet-driver-version-is-not-consistent:
- Identify inconsistent Ethernet driver versions.
- ethernet-firmware-version-is-not-consistent:
- Identify inconsistent Ethernet firmware versions.
- ethernet-interrupt-coalescing-is-enabled:
- Identify nodes where Ethernet interrupt coalescing is not disabled, that is, rx-usecs is not 0 or 1. This only matters when using Ethernet as the MPI message fabric. Since the same node may be in multiple IMB pingpong pairs, check to see if the sign has already been created to avoid duplicates.
- ethernet-mtu-is-not-consistent:
- Identify inconsistent Ethernet firmware versions.
- failing-bash:
- Check if bash is failing.
- failing-csh:
- failing-sh:
- failing-tcsh:
- Check if tcsh is failing.
- files-added:
- Check if files have been added between snapshots.
- files-group:
- Compare the file group between snapshots.
- files-md5sum:
- Compare the file md5sum between snapshots.
- files-owner:
- Compare the file owner between snapshots.
- files-perms:
- Compare the file permissions between snapshots.
- files-removed:
- Check if files have been removed between snapshots.
- hfi-width-permission-err
- Identify if lspci was run as a non-privileged user and width could not be determined.
- hfi_x16_missing
- Identify if there is at least one x16 bus HFIs on each compute node (100GBps).
- hpcg-4node-data-missing
- Check that HPCG data for a four node cluster is available.
- hpcg-4node-perf-pass
- Identify nodes that do not meet the HPCG cluster minimum performance requirements for Intel® Select Solutions for Simulation and Modeling.
- hpcg-cluster-data-missing
- Check that HPCG cluster data is available.
- hpcg-cluster-error
- Detects cases when the HPCG_CLUSTER data is invalid, i.e. data provider output exists in the database, but the analyzer extension could not parse it.
- hpl-cluster-failed:
- Look for cases where HPL cluster ran but there was no success in the output.
- hpcg-single-data-missing
- Check that HPCG single data is available.
- hpcg-single-error
- Detect cases when the HPCG_SINGLE data is invalid, i.e. data provider output exists in the database, but the analyzer extension could not parse it.
- hpcg-single-perf-pass
- Identify nodes that do not meet the HPCG single-node minimum performance requirements for Intel® Select Solutions for Simulation and Modeling.
- hpl-4node-perf-pass
- Ensure that a system meets the performance requirements defined by Intel® Select Solutions for Simulation and Modeling.
- hpl-data-is-too-old:
- Identify nodes where the most recent HPL data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- hpl-data-missing:
- Check that HPL data is available.
- hpl-pairwise-failed:
- Look for cases where HPL pairwise ran but there was no success in the output.
- hpl-pairwise-outlier:
- Locate values that are outliers. An outlier is a value that is outside the range defined by the median +/- 6 * median absolute deviations. The statistics are computed using all samples on nodes in the same grouping (that is, have the same HPL statistics key). Note: the statistics-control condition is required to ensure that all samples are included when computing the statistics.
- hw-added:
- Check if hardware has been added between snapshots.
- hw-modified:
- Compare the output line between snapshots.
- hw-removed:
- Check if hardware has been removed between snapshots.
- imb-pingpong-bandwidth-outlier:
- Check that the measured Intel® MPI Benchmarks PingPong benchmark bandwidth is within the statistical range defined by other measured values in the same grouping.
- imb-pingpong-bandwidth-perf-pass
- Ensure that a system meets the performance requirements defined by Intel® Select Solutions for Simulation and Modeling.
- imb-pingpong-bandwidth-threshold:
- Check that the measured Intel® MPI Benchmarks PingPong benchmark bandwidth is greater than or equal to the expected bandwidth.
- imb-pingpong-data-is-too-old:
- Identify nodes where the most recent IMB-PINGPONG data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- imb-pingpong-latency-outlier:
- Check that the measured Intel® MPI Benchmarks PingPong benchmark latency is within the statistical range defined by other measured values in the same grouping.
- imb-pingpong-latency-perf-pass
- Ensure that a system meets the performance requirements defined by Intel® Select Solutions for Simulation and Modeling.
- imb-pingpong-latency-threshold:
- Check that the measured Intel® MPI Benchmarks PingPong benchmark is less than or equal to the expected latency.
- imb-pingpong-data-missing:
- Check that Intel® MPI Benchmarks PingPong benchmark data is available.
- infiniband-ca-type-is-not-consistent:
- Identify inconsistent InfiniBand HCA types.
- infiniband-data-is-too-old:
- Identify nodes where the most recent INFINIBAND data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- infiniband-data-missing:
- Identify instances of missing InfiniBand information.
- infiniband-device-is-not-consistent:
- Identify inconsistent InfiniBand PCI devices.
- infiniband-driver-is-not-consistent:
- Identify inconsistent InfiniBand PCI drivers.
- infiniband-firmware-version-is-not-consistent:
- Identify inconsistent InfiniBand HCA firmware versions.
- infiniband-hardware-version-is-not-consistent:
- Identify inconsistent InfiniBand HCA hardware versions.
- infiniband-memlock-is-not-consistent:
- Identify inconsistent memlock limits.
- infiniband-memlock-too-small:
- Identify too low memlock limits.
- infiniband-ofed-version-is-not-consistent:
- Identify inconsistent OFED versions.
- infiniband-physical-state-is-not-consistent:
- Identify inconsistent InfiniBand HCA physical states
- infiniband-physlot-is-not-consistent:
- Identify inconsistent InfiniBand PCI card physical slots.
- infiniband-port-physical-state-not-linkup:
- Identify InfiniBand HCA ports not in the LinkUp physical state.
- infiniband-port-state-not-active:
- Identify InfiniBand HCA ports not in the Active state.
- infiniband-rate-is-not-consistent:
- Identify inconsistent InfiniBand HCA rate.
- infiniband-rev-is-not-consistent:
- Identify inconsistent InfiniBand PCI card revision.
- infiniband-state-is-not-consistent:
- Identify inconsistent InfiniBand HCA states.
- intel-pstate-data-error:
- Check that intel-pstate data is available and parsable.
- intel-pstate-data-missing:
- Check if intel-pstate data is missing.
- invalid-dgemm-data:
- Detect cases where the DGEMM data is invalid; that is, data provider output exists in the database, but the connector could not parse it.
- invalid-services-data
- Identify the nodes where the provider failed to report the right services data.
- invalid-services-specification
- Identifies if the preferred services specifications are given in the right format.
- invalid-sgemm-data:
- Detect cases where the SGEMM data is invalid; i.e., data provider output exists in the database, but the connector could not parse it.
- iozone-data-missing:
- Check that IOzone data is available.
- iozone-data-is-too-old:
- Identify nodes where the most recent IOZONE data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- iozone-outlier:
- Locate values that are outliers. An outlier is a value that is outside the range defined by the median +/- 6 * median absolute deviation. The statistics are computed using all samples on all nodes (that is, use the IOZONE statistics key). Note: the statistics-control condition is required to ensure that all samples are included when computing the statistics.
- iozone-ran-no-bandwidth:
- This rule fires on nodes that have bandwidth of 0.0. This is the default value and if this is the value found, it means the connector didn't find a regular expression match for the correct BW.
- iozone-ran-not-complete:
- This rule fires on nodes where bandwidth is greater than 0.0, (which means the test finished and the connector found a value) but the string 'iozone test complete' is missing from the output.
- ip-address-not-consistent:
- If the IP address of a node differs from the perspective of different nodes, this rule will fire. The IP address of a particular node must be the same on all cluster nodes.
- kernel-data-is-too-old:
- Identify nodes where the most recent KERNEL data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- kernel-data-missing:
- Check that kernel data is available.
- kernel-not-ssf:
- If the kernel version is less than 2.6.32, in which case the kernel is not Intel® Scalable System Framework compliant. If the base (everything before -) has letters, the connector will pass a flag to clips instead of the actual base version.
- kernel-not-uniform:
- If the kernel version is not the same as at least 90% of the other nodes, then the node should be flagged as non-uniform. The fewer other nodes that have the same kernel version, the higher the confidence that the node with the different version is incorrect.
- kernel-param-data-is-too-old:
- Identify nodes where the most recent KERNEL-PARAM data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- kernel-param-data-missing:
- Check that kernel parameter data is available.
- kernel-param-not-uniform:
- Checks that kernel parameters are uniform.
- kernel-param-not-preferred:
- Checks that a specified kernel parameter is in the preferred state as defined in the configuration file.
- latest-ssf-version:
- Determine whether the self-identified Intel® Scalable System Framework version contains the latest version (2016.0).
- latest-xp-hwloc-memoryside-cache-file:
- Check that the memoryside cache file for the Intel® Xeon Phi™ processor is the latest version.
- libraries-data-missing:
- Check that libraries data is available.
- logical-cores-not-uniform:
- Check for uniformity of logical core(s) among nodes having equivalent CPU(s).
- lsb-tools-data-is-too-old:
- Identify nodes where the most recent LSB tools data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- lsb-tools-data-missing:
- Check that required LSB tool data is available.
- lscpu-data-error:
- Check that lscpu data is available and parsable.
- lscpu-data-missing:
- Check that lscpu data is available or unparsable.
- lshw-data-is-too-old:
- Identify nodes where the most recent LSHW data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- lshw-data-missing:
- Check that lshw data is available.
- lshw-key-missing:
- Check if lshw key is missing.
- lshw-not-uniform:
- Check if lshw is uniform.
- lspci_verbose_data_missing
- Identify if there is data missing for devices that uses the provider lspci_verbose.
- lustre-data-is-too-old:
- Identify nodes where the most recent LUSTRE data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- lustre-data-missing:
- Emit a sign if there is no lustre data.
- lustre-kernel-modules-loaded-error
- Ensure the lustre kernel modules are loaded.
- lustre-kernel-modules-loaded-no-data:
- Emit a sign if there is no data from lsmod.
- lustre-mount-point-not-mounted:
- Check uniformity of mount points.
- lustre-target-inactive:
- Check if a target is inactive which is active on other nodes on the cluster.
- lustre-write-targets-uniform:
- Checks uniformity of object targets that are written to by the stripe test.
- lustre-no-write-targets:
- Ensure that object targets are available for the stripe test.
- lustre-write-no-mount-points:
- Ensure that at least one filesystem is mounted.
- lustre-write-targets-mismatch:
- Emit a sign if the number of available objects targets is not equal to the number of object targets written to.
- memory-data-is-too-old:
- Identify nodes where the most recent MEMORY data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- memory-data-missing:
- Check that memory data is available.
- memory-minimum-required-compat-base:
- Check that the amount of physical memory per core is >= 16 GiB.
- memory-minimum-required-compat-hpc:
- Check that the amount of physical memory per core is >= 32 GiB.
- memory-not-uniform:
- Check that the amount of physical memory is uniform.
- memory-sizes-not-uniform:
- Check if the installed DIMMs have uniform sizes.
- memory-speeds-not-uniform:
- Check if the installed DIMMs have uniform speeds.
- min-mem-per-core:
- Check that the amount of physical memory per core is >= 2 x the number of physical cores.
- min-mem-per-core-expected
- Check that the amount of physical memory per node is greater than the expected memory.
- min-mem-per-node
- Check that the amount of physical memory per node is >= 96 GiB.
- min-mem-per-node-expected
- Check that the amount of physical memory per node is greater than the expected memory.
- missing-bash:
- Check if bash is missing.
- missing-csh:
- missing-libutil-x86-64:
- Advisory Intel® Scalable System Framework compat-base. See the ssf_libraries rules directory for a list of all missing library rules.
- missing-lsb-tools:
- Check Tool(s) required but missing.
- missing-opa-tools:
- Intel® Omni-Path Architecture tools used for various checks.
- missing-saquery-tool:
- Check if saquery is missing.
- missing-sh:
- missing-sh-ssf:
- Check if sh is missing per Intel® Scalable System Framework requirements.
- missing-tcsh:
- Check if tcsh is missing.
- mount-bad-tmp-perms:
- Check that /tmp has the permissions 777.
- mount-data-is-too-old:
- Identify nodes where the most recent MOUNT data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- mount-data-missing:
- Check that mount data is available.
- mount-dev-shm-not-mounted:
- Check that /dev/shm is properly mounted.
- mount-home-not-defined:
- HOME environment variable is not defined as per Intel® Scalable System Framework Architecture Specification.
- mount-not-uniform-home-inode:
- Check that the home path is shared on the cluster by checking the uniformity of the inodes of the home directory.
- mount-not-uniform-home-path:
- Check that the home path is uniform on the cluster.
- mount-proc-not-mounted:
- Check that /proc is properly mounted.
- mount-tmpdir-not-defined:
- TMPDIR environment variable is not defined as per Intel® Scalable System Framework Architecture Specification.
- mpi-internode-broken:
- Check whether MPI intra-node Hello World is functional.
- mpi-internode-data-is-too-old:
- Identify nodes where the most recent MPI-INTERNODE data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- mpi_internode-data-missing:
- Check that MPI internode data is available.
- mpi-local-broken:
- Identify cases where there are less than 4 lines of valid output in the parsed output, but an mpirun binary executable was found.
- mpi-local-data-is-too-old:
- Identify nodes where the most recent MPI-LOCAL data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- mpi-local-not-found:
- Identify cases where an mpirun binary executable itself was not found.
- mpi-local-path-not-uniform:
- If the mpi-local-path found on each node is not the same as at least 90% of the other nodes, then the node should be flagged as non-uniform. The fewer othernodes have the same mpi-local-path, the greater the confidence that the node with the different version is incorrect.
- mpi-internode-data-missing:
- Check that MPI internode data is available.
- mpi-local-data-missing:
- If there are any signs for missing data, create a no data diagnosis and mark the sign as diagnosed. This rule only fires for the first no data sign per node, that is, when the diagnosis does not already exist. Once the diagnosis exists, it should not be duplicated. Thus, there is a corresponding rule, no-data-subsequent, for the case where there are multiple signs leading to this diagnosis.
- node-extra:
- Check if RPM information has changed (extra node) between the snapshots.
- node-removed:
- Check if RPM information has changed (node removed) between the snapshots
- no-data-subsequent:
- This rule is related to no-data-initial. The difference is that this rule fires only after the initial diagnosis has already been created. This rule marks the sign as diagnosed, and also adds to the list of signs that produced the diagnosis.
- no_hfi_detected
- Checks if no HFI was found on the node.
- non-uniform-hardware-initial:
- If there are any signs for non-uniform hardware, create a non-uniform hardware diagnosis and mark the sign as diagnosed. This rule only fires for the first non-uniform hardware sign per node, that is when the diagnosis does not already exist. Once the diagnosis exists, it should not be duplicated. Thus, there is a corresponding rule, non-uniform-hardware-subsequent, for the case where there are multiple signs leading to this diagnosis.
- non-uniform-hardware-subsequent:
- This rule is related to non-uniform-hardware-initial. The difference is that this rule fires only after the initial diagnosis has already been created. This rule marks the sign as diagnosed, and also adds to the list of signs that produced the diagnosis.
- non-uniform-software-initial:
- If there are any signs for non-uniform software, create a non-uniform software diagnosis and mark the sign as diagnosed. This rule only fires for the first non-uniform software sign per node, that is, when the diagnosis does not already exist. Once the diagnosis exists, it should not be duplicated. Thus, there is a corresponding rule, non-uniform-software-subsequent, for the case where there are multiple signs leading to this diagnosis.
- non-uniform-software-subsequent:
- This rule is related to non-uniform-software-initial. The difference is that this rule fires only after the initial diagnosis has already been created. This rule marks the sign as diagnosed, and also adds to the list of signs that produced the diagnosis.
- not-intel-ssf-compliant-initial-2016.0:
- If there are any signs for Intel® Scalable System Framework 2016.0 non-compliance, create a not Intel® SSF compliant diagnosis and mark the sign as diagnosed. This rule only fires for the first non-compliance sign per node, that is, when the diagnosis does not already exist. Once the diagnosis exists, it should not be duplicated. Thus, there is a corresponding rule, not-ssf-compliant-subsequent-2016.0, for the case where there are multiple signs leading to this diagnosis.
- ntp-data-is-too-old:
- Identify nodes where the most recent ntp data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- ntp-data-missing:
- Check that ntp data is available.
- ntp-not-connected:
- Check if ntp client is not connected to an ntp server. This is true if the remote slot is set to the default.
- ntp-offset-above-threshold:
- Check if reported time offset is larger than a threshold. Increase severity based on the size of the difference between the offset and threshold.
- opa-ca-is-not-consistent:
- Identify inconsistent Intel® Omni-Path Host Fabric Interface ca types.
- opa-data-is-too-old:
- Identify nodes where the most recent Intel® Omni-Path Host Fabric Interface data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- opa-data-missing:
- Identify instances of missing Intel® Omni-Path Host Fabric Interface information.
- opa-device-is-not-consistent:
- Identify inconsistent Intel® Omni-Path Host Fabric Interface PCI devices.
- opa-driver-is-not-consistent:
- Identify inconsistent Intel® Omni-Path Driver.
- opa-firmware-version-is-not-consistent:
- Identify inconsistent Intel® Omni-Path Host Fabric Interface firmware versions.
- opa-hardware-version-is-not-consistent:
- Identify inconsistent Intel® Omni-Path Host Fabric Interface hardware versions.
- opa-memlock-is-not-consistent:
- Identify inconsistent memlock limits.
- opa-memlock-too-small:
- Identify memlock limits that are deemed too low for the Intel® Omni-Path Fabric.
- opa-physical-state-is-not-consistent:
- Identify inconsistent Intel® Omni-Path Host Fabric Interface physical states.
- opa-physlot-is-not-consistent:
- Identify inconsistent Intel® Omni-Path Host Fabric Interface physical slots.
- opa-port-physical-state-not-linkup:
- Identify Intel® Omni-Path Host Fabric Interface ports not in the LinkUp physical state.
- opa-port-state-not-active:
- Identify Intel® Omni-Path Host Fabric Interface ports not in the Active state.
- opa-rate-is-not-consistent:
- Identify inconsistent Intel® Omni-Path Host Fabric Interface rate.
- opa-regex-error:
- If the connector regular expression fails to parse any of the Intel® Omni-Path Host Fabric Interface commands, this error should fire notifying the user of the issue.
- opa-state-is-not-consistent:
- Identify inconsistent Intel® Omni-Path Host Fabric Interface states.
- opa-subnet-manager-not-running:
- Check that an Intel® OPA subnet manager is running for Intel® Omni-Path Fabric.
- outlier-imb-pingpong-latency-due-to-ethernet-coalescing:
- Diagnose Intel® MPI Benchmarks PingPong latency performance outlier issues due to Ethernet interrupt coalescing not being disabled. If the imb-pingpong-latency-outlier sign is TRUE, the Intel® MPI Library settings are configured to use Ethernet, and the ethernet- interrupt-coalescing-is-enabled sign is TRUE, then conclude the inconsistent performance is due to Ethernet interrupt coalescing not being disabled. Note that the Ethernet interrupt coalescing only affects PingPong latency, not bandwidth, so there is no corresponding rule for bandwidth.
- perl-data-is-too-old:
- Identify nodes where the most recent Perl data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- perl-data-missing:
- Check that Perl data is available.
- perl-not-found:
- If no Perl version is found and stderr contains the string 'command not found'', then Perl is not installed / incorrectly installed.
- perl-not-functional:
- If no Perl version is present or stderr is not empty, then Perl may not be functional. If a version is present and stderr is not empty, use lower confidence and severity values, since the stderr output may be unrelated. If no version is present and stderr is not empty, then Perl is definitely not functional, so use high confidence and severity values. Avoid matching the 'command not found' case that is handled separately.
- perl-not-ssf:
- If the Perl version is less than 5.10, then Perl is not Intel® Scalable System Framework compliant.
- perl-not-uniform:
- If the Perl version is not the same as at least 90% of the other nodes, then the node should be flagged as non-uniform. The fewer other nodes that have the same Perl version increases the confidence that the node with the different version is incorrect.
- process-data-is-too-old:
- Identify nodes where the most recent PROCESS data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- process-data-missing:
- Check that process data is available.
- process-is-a-zombie:
- For the most recent PROCESS data point, identify nodes with zombie processes, that is, processes with a Z state.
- process-is-high-cpu:
- For the most recent PROCESS data point, identify nodes with high CPU processes, that is, processes using more than 20% of a CPU core.
- process-is-high-memory:
- For the most recent PROCESS data point, identify nodes with high memory processes, that is, processes using more than 50% of memory.
- python-data-is-too-old:
- Identify nodes where the most recent Python data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- python-data-missing:
- Check that Python data is available.
- python-not-found:
- If no Python version is found and stderr contains the string 'command not found', then Python is not installed or incorrectly installed.
- python-not-functional:
- If no Python version is present or stderr is not empty, then Python may not be functional. If a version is present and stderr is not empty, use lower confidence and severity values, since the stderr output may be unrelated. If no version is present and stderr is not empty, then Python is definitely not functional, so use high confidence and severity values. Avoid matching the 'command not found' case that is handled separately.
- python-not-ssf:
- If the Python version is less than 2.6, then Python is not Intel® Scalable System Framework compliant.
- python-not-uniform:
- If the Python version is not the same as at least 90% of the other nodes, then the node should be flagged as non-uniform. The fewer other nodes that have the same Python version, the greater the confidence that the node with the different version is incorrect.
- rpm-added:
- Check if RPM information has changed (extra RPM) between snapshots.
- rpm-data-is-too-old:
- Identify nodes where the most recent RPM data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- rpm-data-missing:
- Check that RPM data is available.
- rpm-is-extra:
- Check whether an RPM is present on this node, but missing on other nodes.
- rpm-is-missing:
- Check whether an RPM is present on other nodes, but missing on this one.
- rpm-missing:
- Check if RPM information has changed (RPM missing) between snapshots.
- rpm-modified:
- Check if RPM attributes (version, release, architecture) have been modified between snapshots.
- service-not-available:
- Identifies if the required services are available on the node.
- services-data-is-too-old:
- Identifies nodes where the most recent services data is considered too old. Too old is defined (by default) as no data from the last seven days (605800 seconds).
- services-data-missing:
- Identifies the nodes missing services data.
- services-preferred-status:
- Identifies if the services status matches the given preferred specification.
- sgemm-data-is-substandard:
- For the most recent SGEMM data point, identify nodes with substandard FLOPS relative to a threshold based on the hardware. The severity depends on the amount of deviation from the threshold value; the larger the deviation, the higher the severity.
- sgemm-data-is-too-old:
- Identify nodes where the most recent SGEMM data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- sgemm-data-missing:
- Detect cases where there is no SGEMM data.
- sgemm-numactl-missing:
- Checks if the numactl was not found. If this binary is not installed then sgemm performance may be affected.
- sgemm-outlier:
- Locate values that are outliers. An outlier is a value that is outside the range defined by the median +/- 6 * median absolute deviation. The statistics are computed using all samples on all nodes (i.e., use the SGEMM statistics key).
- sgemm-taskset-missing:
- Checks if the taskset binary was not found. If this binary is not installed, then sgemm performance may be affected.
- shells-data-is-too-old:
- Identify nodes where the most recent SHELL data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- shells-data-missing:
- Check that libraries data is available.
- ssf-file-not-found:
- If no Intel® Scalable System Framework (Intel® SSF) versions are found and stderr contains the string 'No such file or directory', then the file is missing.
- ssf-file-other-error:
- If no Intel® Scalable System Framework (Intel® SSF) versions are found or stderr is not empty, then the file may not be readable. If a version is present and stderr is not empty, use lower confidence and severity values, since the stderr output may be unrelated. If no version is present and stderr is not empty, then the file is definitely not readable, so use high confidence and severity values. Avoid matching the 'No such file or directory' case that is handled separately.
- ssf-layer-dependency-compat-hpc:
- Determine whether layer self is also in /etc/ssf-release.
- ssf-layer-dependency-hpc-cluster-compat-base:
- Determine whether all contained layers are also in /etc/ssf-release.
- ssf-layer-dependency-self:
- Determine whether all contained layers are also in /etc/ssf-release.
- ssf-libraries-data-is-too-old:
- Identify nodes where the most recent LIBRARIES data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- ssf-version-data-is-too-old:
- Identify nodes where the most recent Intel® SSF data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- ssf-version-data-missing:
- Check that Intel® Scalable System Framework (Intel® SSF) version data is available.
- storage-data-is-too-old:
- Identify nodes where the most recent STORAGE data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- storage-data-missing:
- Check that storage data is available.
- storage-ssf-compute:
- Checks the Intel® Scalable System Framework (Intel® SSF) required minimum for compute node storage. The compute node must have at least 16 GiB of RAM and access to at least 80 GiB of persistent storage. Login nodes should have at least 200 GiB of persistent storage.
- storage-ssf-head:
- Checks the Intel® Scalable System Framework (Intel® SSF) required minimum for head node storage. The head node must be attached to 200GiB of direct access storage.
- stream-data-error:
- Looks for cases where STREAM failed, except because libiomp5 could not be found.
- stream-data-is-too-old:
- Identify nodes where the most recent STREAM data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- stream-data-missing:
- Check that STREAM data is available.
- stream-failed-validation:
- Identifies cases where the string \"Failed validation\" is found in the STDOUT. In these cases, the triad value will still be populated, so we can't rely on the existence of the triad value.
- stream-no-runtimes:
- Look for cases where STREAM failed because libiomp5 could not be found.
- stream-outlier:
- Locate values that are outliers. An outlier is a value that is outside the range defined by the median +/- 6 * median absolute deviation. The statistics are computed using all samples on all nodes (that is, use the STREAM statistics key). Note: the statistics-control condition is required to ensure that all samples are included when computing the statistics.
- stream-perf-pass:
- Ensure that a system meets the performance requirements defined by Intel® Select Solutions for Simulation and Modeling.
- substandard-dgemm-due-to-dimms
- Diagnose substandard DGEMM performance issues due to insufficient DIMMs. If the dgemm-performance sign is substandard and the DIMMs per socket is insufficient.
- substandard-dgemm-due-to-high-cpu-process:
- Diagnose substandard DGEMM performance issues due to a conflicting process that is consuming a high amount of CPU. If the dgemm-performance sign is substandard and the high-cpu-process sign is true and the associated data points are close together in time (within 10 minutes), then conclude the substandard performance is due to the high CPU process.
- substandard-dgemm-due-to-high-memory-process:
- Diagnose substandard DGEMM performance issues due to a conflicting process that is consuming a large amount of memory. If the dgemm-performance sign is substandard and the high-memory-process sign is true and the associated data points are close together in time (within 10 minutes), then conclude the substandard performance is due to the high memory process.
- substandard-dgemm-due-to-offline-cores:
- Diagnose substandard DGEMM performance issues due to detected offline cores. If the dgemm-performance sign is substandard and the all-logical-cores-not-available sign is true and the associated data points are close together in time (within 10 minutes), then conclude the substandard performance may be due to the offline cores.
- substandard-imb-pingpong-latency-due-to-ethernet-coalescing:
- Diagnose substandard IMB pingpong latency performance issues due to Ethernet interrupt coalescing not being disabled. If the imb-pingpong-latency-threshold sign is TRUE (substandard), the Intel® MPI Library settings are configured to use Ethernet, and the ethernet-interrupt-coalescing-is-enabled sign is TRUE, then conclude the substandard performance is due to Ethernet interrupt coalescing not being disabled. Note that the Ethernet interrupt coalescing only affects IMB pingpong latency, not bandwidth, so there is no corresponding rule for bandwidth.
- substandard-sgemm-due-to-high-cpu-process
- Diagnose substandard SGEMM performance issues due to a conflicting process that is consuming a high amount of cpu. If the sgemm-performance sign is substandard and the high-cpu-process sign is true and the associated data points are close together in time (within 10 minutes), then conclude the substandard performance is due to the high cpu process.
- substandard-sgemm-due-to-high-memory-process
- Diagnose substandard SGEMM performance issues due to a conflicting process that is consuming a large amount of memory. If the sgemm-performance sign is substandard and the high-memory-process sign is true and the associated data points are close together in time (10 minutes), then conclude the substandard performance is due to the high memory process.
- substandard-sgemm-due-to-offline-cores
- Diagnose substandard SGEMM performance issues due detected offline cores. If the sgemm-performance sign is substandard and the all-logical-cores-not-available sign is true and the associated data points are close together in time (10 minutes), then conclude the substandard performance is due to the offline cores.
- tcl-data-is-too-old:
- Identify nodes where the most recent Tcl data is considered too old. Too old is defined as no data from the last 7 days (604800 seconds).
- tcl-data-missing:
- Check that Tcl data is available.
- tcl-not-found:
- If no Tcl version is found and stderr contains the string 'command not found', then Tcl is not installed / incorrectly installed.
- tcl-not-functional:
- If no Tcl version is present or stderr is not empty, then Tcl may not be functional. If a version is present and stderr is not empty, use lower confidence and severity values, since the stderr output may be unrelated. If no version is present and stderr is not empty, then Tcl is definitely not functional, so use high confidence and severity values. Avoid matching the 'command not found' case that is handled separately.
- tcl-not-ssf:
- If the Tcl version is less than 8.5, then Tcl is not Intel® Scalable System Framework (Intel® SSF) compliant.
- tcl-not-uniform:
- If the Tcl version is not the same as at least 90% of the other nodes, then the node should be flagged as non-uniform. The fewer other nodes that have the same tcl version, the greater the confidence that the node with the different version is incorrect.
- threads-per-core-not-uniform:
- Check for uniformity of threads per core among nodes having equivalent CPU(s) (for valid thread count per core).
- threads-per-core-unusual:
- Check to see if there is an unusual number of threads.
- unable-to-obtain-ip-address:
- If hostname -i does not return a valid IP address, the connector will pass an empty string to the clips slot for the IP address and this rule will fire.
- xp-cluster-mode-ambiguous:
- Check if cluster mode for the Intel® Xeon Phi™ processor is undetermined.
- xp-cluster-mode-not-uniform:
- Check that the cluster mode for the Intel® Xeon Phi™ processor is uniform.
- xp-cluster-mode-preferred:
- Check that the cluster mode for the Intel® Xeon Phi™ processor is in preferred mode.
- xp-data-source-numactl:
- Check if cluster/memory mode for the Intel® Xeon Phi™ processor is undetermined.
- xp-memory-mode-ambiguous:
- Check if memory mode for the Intel® Xeon Phi™ processor is undetermined.
- xp-memory-mode-not-uniform:
- Check that the memory mode for the Intel® Xeon Phi™ processor is uniform.
- xp-memory-mode-preferred:
- Check that the memory mode for the Intel® Xeon Phi™ processor is in preferred mode.
- xp-modes-data-is-too-old:
- Identify nodes where the most recent Intel® Xeon Phi™ processor modes data is too old. Data is considered too old when there is no data from the last 7 days (604800 seconds).
- xp-modes-data-missing:
- Check if the modes data for the Intel® Xeon Phi™ processor is available.