You are here: IT » Internal » SlurmNodeStatusTable

SlurmNodeStatusTable

22 Apr 2024 - 13:50 | Version 22 | hoffmac00

Clusterknoten Status + boot-Anleitung

Befehle sind auf allegro(allegro-ctrl01) abzusetzen.

Einfach nur Rebooten (Nach Updates etc.)

Der Befehl

scontrol reboot_nodes <name>

scheduled für die angegebenen Hosts einen Reboot bei der nächst möglichen Gelegenheit; d.h. es werden keine Jobs mehr gestartet, bis der Knoten leer ist, danach erfolgt der Reboot automagisch.

einzeln und kontrolliert

Die Schritte sind:

scontrol update NodeName ="cmp[###-###]" State=drain Reason="WARUM"
nun warten, bis der Host leer läuft, der Zustand ist dann drained
dann Updaten, Herunterfahren, etc…
z.B.: sudo bash -c 'apt-get update && apt-get dist-upgrade && salt-call state.highstate'
und am Ende rebooten
mit scontrol update node=cmp### state=resume wieder ins Scheduling aufnehmen

Checklisten der Knoten

OK == im Cluster aktiv (wenn nicht anders kommentiert)
WORK == im Cluster aktiv aber Wartungsarbeiten notwendig
TODO == läuft nicht wegen Problemen die zu beheben sind
KO == nicht mehr lauffähig

'CMPs'

Name	Status	Comment
cmp213	OK
cmp214	OK	2023-11-06 no ping→hard reset; 2022-07-06 drain entfernt; 2021-08-31 infiniband nach controller-tausch funktioniert; 2021-05-05 reboot to use /data/scratch did not help ; 2020-09-09 reboot needed to use /data/scratch
cmp215	WORK	2023-04-11 Smart Error /dev/sg1 ST9146852SS S/N: 3TB07VGL, 146 GB
cmp216	OK	2023-11-22 /data/scratch2 im devicewait; 2022-09-13 ok; 2020-10-12 /data/scratch hängt
cmp217	OK	2021-09-03 Platte getauscht und neuinstalliert 2021-05-04 /dev/sg1 6XP1QLKM smart error; on repair fix smartd in salt node config
cmp218	OK	2023-09-20 down, nach kvm-restart ok; 2022-08-24 smart ok; 2022-06-22 smart sg1 impending failure. 2021-04-09 neue platten und neu installiert HDD defekt Device: /dev/sg2, SMART Failure: FAILURE PREDICTION THRESHOLD EXCEEDED: ascq=0x5
cmp219	OK	2023-12-05 sg2 hdd eol; 2023-10-11 segfault nach oom, nach kvm-restart ok;
cmp220	OK	2022-08-19 bullseye, highstate und wieder ins Cluster 2022-08-18 → abgestürzt → eine neue Platte 2022-06-22 smart sg2 failure prediction threshold exceeded; 2020-01-31 VB Data ECC or parity error fixed
cmp221	OK
cmp222	OK
cmp223	OK
cmp224	OK	2023-09-20 ok; 2022-08 Tot laut manuel; 2021-05-06 memest stop, ok mit 60GB ; 2021-04-29 memtest start 16:00 2021-04-29 Einer der beiden P2-DIMM-1A oder P2-DIMM-1B ist wieder ausgefallen. RAM von 72GB auf 59GB 2021-04-07 alle DIMMs wieder da, unter Beobachtung 2020-05-15 low memory #258254
cmp225	OK	2022-08-24 infiniband ok; 2022-01-21 infiniband down
cmp226	OK	2022-08-24 fan ignorieren laut manuel; 2021-08: IPMI Status: Critical [Fan4 = Critical]; 2021-04-09: ein netzteil weniger, ram weniger 2020-05-15 Uncorrectable ECC @ DIMM1B CPU2, unter Beobachtung
cmp227	OK
cmp228	OK	2022-08-24 smart ok; 2022-01: smart test failed on disk 0 = 6XP4J2BD
cmp229	OK
cmp230	OK
cmp231	OK
cmp232	OK
cmp233	OK
cmp234	OK
cmp235	OK
cmp236	OK
cmp238	OK
cmp239	OK
cmp240	OK
cmp241	OK	2023-09-20 ok, 2022-09-02 out of memory ca. 1h nach reboot dann: The system board fail-safe voltage is outside of range. CPU 1 VCORE PG voltage is outside of range.
cmp242	OK	2023-09-20 ok, 2021-08-31 mainboard defekt, cpu-tausch und der fehler blieb bei der 1ten cpu 2020-09-26 erneut Voltage error 2021-04-08 reset → up ; 2021-04-07 CPU 1 VCORE PG voltage is outside of range.
cmp243	OK	2021-09-31 ok; 2021-04-28 hat reboot nicht vertragen, kvm hängt, stromlos machen und neu starten.
cmp244	OK	2023-09-20 ok; 2022-01-21 infiniband down ; 2021-04-09 neue platte; 2021-03-15 Device: /dev/bus/0 [megaraid_disk_01], SMART Failure: HARDWARE IMPENDING FAILURE DATA ERROR RATE TOO HIGH ; [FUJITSU MBD2147RC D80A ], S/N: D0A4PBA04PTN, 146 GB
cmp245	OK
cmp246	OK
cmp247	OK	2022-09-02 steht ungewollt auf drain und watchdog meldung
cmp248	OK	2022-09-02 steht ungewollt auf drain und watchdog meldung
cmp249	OK	2022-09-13 ok; 2022-09-13 kvm nicht erreichbar, kein ping und ssh; 2020-05-15 Kill task failed, 2020-05-18 Kill task failed
cmp250	OK	2022-10-10 Kill task failed, 2022-09-13 ok; 2022-09-13 kvm nicht erreichbar, kein ping und ssh; 2022-07-06 drain entfernt; 2022-01-17 reboot, kvm lieferte error und dann nichts mehr. läuft wieder.
cmp251	OK	2022-10-10 Kill task failed,
cmp252	OK	2022-10-10 Kill task failed, 2020-05-15 Kill task failed, 2020-05-18 Kill task failed

'GPUs'

Name	Status	Comment
gpu025	OK
gpu026	OK
gpu027	OK
gpu028	OK
gpu029	OK	2023-09 partition tesla100, slurm erkennt die t100 nicht als gpu deshalb 0 gpu vs nvidia-smi 2 gpu offen für AG compstatphys hat /dev/nvidia[01]
gpu030	OK	2023-09 partition tesla100, slurm erkennt die t100 nicht als gpu deshalb 0 gpu vs nvidia-smi 2 gpu
gpu031	OK	nur für direkte Arbeit (kein slurm) offen für bcp_ag-boettcher hat /dev/nvidia[0123]
gpu048	OK
gpu049	OK	2021-02-23 node is in, but chassis intrusion, schould be checked (sensor not in place)?
gpu050	OK
gpu051	OK
gpu052	OK
gpu053	OK
gpu054	OK	2020-04-30 Intel Node unexpectedly re; 2021-05-14 Read only state, no login possible #315719 error gone after reboot
gpu055	OK
gpu056	OK	2020-04-30 Intel Node unexpectedly re
gpu057	OK	2020-04-30 Intel Node unexpectedly re
gpu058	OK	2020-04-30 Intel Node unexpectedly re
gpu059	OK	2020-04-30 Intel Node unexpectedly re
gpu060	OK	2022-08-18 switched eth port. Host back online 2022-03-09 unexpected eth0→eth1 switch: disabled state network
gpu061	OK	2024-04-03 ok; 2024-02-27: rbrk@gpu061:~$ nvidia-smi -L Unable to determine the device handle for gpu 0000:02:00.0: Unknown Error GPU 1: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-cb84dc2c-2bb8-5b7c-cec8-f787861f2977) GPU 2: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-84c0b713-fbcd-67c7-37a9-9994303c0d5d) GPU 3: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-506585e5-5c90-dbb7-a430-fef24e3f5be3) 2023-09-20: Unable to determine the device handle for gpu 0000:02:00.0: Unknown Error
gpu062	OK
gpu063	KO	2020-05-15 PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) ; device [8086:2f02] error status/mask=00000080/00002000 ; [ 7] BadDLLP
gpu064	OK	2024-04-03 ok; 2023-11-08 Unable to determine the device handle for gpu 0000:03:00.0: Unknown Error; 2023-09-20 warten auf job-ende dann down; 2023-09-04 zwischendurch wieder nur 3 gpus und abgrebrochenen jobs, 2022-03-10 only 3 GPUs; reboot did not work; power off/on did restore 4th gpu; 2023-09-21 User reporten, dass jobs nach wenigen sekunden failen;
gpu065	OK	2024-04-03 ok; 2023-11-07 Unable to determine the device handle for gpu 0000:03:00.0: Unknown Error; 2023-09-20 3 gpus warten auf jobende dann down; 2022-11-23 nvidia-smi geht meldet GPU 3 lost nach reboot OK, 2022-07-06 drain entfernt; 2020-09-22 lost cpu, 2021-06-28 was down 2021-11-17 down
gpu066	TODO	2024-04-22 Unable to determine the device handle for gpu 0000:03:00.0: Unknown Error; 2024-04-03; 2023-09-20 wegen abgebrochener jobs auf drain, 4 GPUs sichtbar, 2023-01-10 wieder nur 3gpu; 2023-01-05 stromlos → 4GPU ; 2023-01-03 3gpu trotz reboot; 2022-08-10 1: gpu and kvm lost back up after power cycle 2021-06-02 nvidia-smi hängt: reboot; 2020-09-23 reboot on franks request
gpu067	OK
gpu068	OK
gpu069	OK	2024-04-03 ok; 2024-03-05 rbrk@gpu069:~$ nvidia-smi -L GPU 0: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-0d8c786e-6663-9098-3368-cbf8a8df061a) GPU 1: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-6a44c0f3-ff72-ce90-0fac-6f5221185404) GPU 2: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-683214d2-dc37-1129-12aa-6cd70a85ec14) Unable to determine the device handle for gpu 0000:82:00.0: Unknown Error 2022-11-22 kvm Konsole zeigt schwarzbild reagiert allerdings auf power optionen ,2022-08-18 1 gpu down
gpu070	OK
gpu071	KO	2024-04-09 kann kvm nicht erreichen, ssl zu alt; 2024-04-03 wurde durch apt upgrade noch auf bullseye gelöst 2023-09-20 down wegen "slow computation speed" cornelis schaut sich das an.
gpu072	KO	2020-09-09 flood of pci-express-bus-errors; 2020-05-12 reboot hung 2020-05-22 unexpected reboot
gpu073	OK
gpu074	OK
gpu075	KO	2021-04-14 down for repairs; 2021-01-28 use only 2 gpus; 2020-05-15 nur DREI /dev/nvidia{0,1,2} sichtbar (1 Karte tot?) nvidia-smi sieht nur 2 karten; karten ziehen und wieder stecken
gpu076	OK
gpu077	OK	2024-04-03 ok; 2023-10-11 Second gpu doesn't work for computations, shows errors in nvidia-smi;
gpu078	draining	2024-04-19 Unable to determine the device handle for gpu 0000:84:00.0: Unknown Error → Replace GPU in Physical Slot 6 (Others are in 2,4,8); 2024-04-03 ok; 2023-11-06 only 3 GPUs in nvidia-smi; 2023-09-22 only 3 GPUs in nvidia-smi; 2021-08-31 after reboot 4 gpus; 2021-05-31 nvidia-smi only sees 3 gpus. slurmd expects 4
gpu079	OK
gpu080	OK
gpu081	OK	2024-04-03 ok; 2022-05-02 state down (node was fine) ; 2021-07-12 state down (node was fine)
gpu082	OK
gpu083	OK
gpu084	OK
gpu085	OK
gpu086	OK
gpu087	OK
gpu088	OK
gpu089	OK
gpu090	OK
gpu091	OK
gpu092	OK
gpu093	OK
gpu094	OK
gpu095	OK
gpu096	OK
gpu097	OK
gpu098	OK	2024-04-03 ok; 2020-02-15 CPU2_ECC1 sensor of type memory logged a correctable ecc at DIMM E1 (Beobachtung)
gpu099	OK
gpu100	OK
gpu101	OK
gpu102	OK
gpu103	OK
gpu104	OK
gpu105	OK
gpu106	OK
gpu107	OK
gpu108	OK	2022-10-02 CPU1_ECC1 sensor of type memory logged a correctable ecc at DIMM B1 (Beobachtung)
gpu109	OK
gpu110	OK
gpu111	OK
gpu112	WORK	2024-04-18 smart errors; 2024-04-08 kvm antwortet nicht; 2024-04-04 Zweiter Slot der ersten CPU geht nicht, Karte wird später ausgebaut, Betrieb nur mit 3 Karten, siehe #417243; 2022-03-10 CPU2_ECC1 sensor of type memory logged a correctable ecc at DIMM D1 (Beobachtung)
gpu113	OK	2022-11-23 nvidia-smi geht in device wait nach reboot OK,2021-09-09 GPU 0 100% utilization FAN & PwrUsage ERR 2021-06-21 down without reason?
gpu114	OK	2022-10-03 CPU1_ECC1 sensor of type memory logged a correctable ecc at DIMM A1 (Beobachtung)
gpu115	OK	2020-10-27 CPU1_ECC1 sensor of type memory logged a correctable ecc at DIMM F1 (Beobachtung)
gpu116	OK	2022-08-17 reboot; 2021-07-01 ID: 34 CPU1_ECC1 sensor of type memory logged a correctable ecc at DIMM C1 Ram-Ziehen und stecken 2020-05-15 reboot hung
gpu117	WORK	2022-09-13 infiniband errors, please pull/plug the interface-cables; 2021-04-14 reinstall; 2020-10-20 /data/scratch again no permission, shutdown; 2020-09-14 umount lf /data/scratch, ls /data/scratch no permission 2020-06-02 reboot (ls /data/scratch no permission) reinstall
gpu118	OK
gpu119	OK	2023-04-12 CPU2_ECC1 sensor of type memory logged a correctable ecc at DIMM K1 (Beobachtung)
gpu120	OK	2019-09-06 CPU1_ECC1 sensor of type memory logged a correctable ecc at DIMM E1 (Beobachtung)
gpu121	OK	2023-01-23 CPU1_ECC1 sensor of type memory logged a uncorrectable ecc at DIMM D1, CPU1_ECC1 sensor of type memory logged a correctable ecc at DIMM D1 (Beobachtung)
gpu122	OK
gpu123	OK
gpu124	OK
gpu125	OK	2023-03 down für jens zum gpu einbau test
gpu126	OK	2022-07-12: new partition gpubig ; 2022-07-05 drain entfernt; 2022-06-22: vorbereitung ausbau für nächsten gpu-test von jens
gpu127	OK	2023-01-10 CPU2_ECC1 sensor of type memory logged a correctable ecc at DIMM E1 (unter Beobachtung)
gpu128	OK
gpu129	OK
gpu130	OK	2022-06-20: new partition gpubig
gpu131	OK	2022-05-02: new partition gpubig ; 2022-03-09 down für jens zum gpu einbau test

Kommentare

Obskure Fehler Wenn "inifiniband" nicht startet, stirbt die Connection zum 'slurmd', zu sehen an der Fehlermeldung im journal anscheinend

apt-helper[25557]: E: Unterprozess /lib/systemd/systemd-networkd-wait-online hat Fehlercode zurückgegeben (1)

Nachprüfbar u.a. mit

networkctl status -a |& less -n

indem man dort 'configuring' als Status findet

-- ChristophVonStuckrad - 06 Feb 2020

Bad Request als antwort von einer dell kvm-console

ssh <hostname-kvm>
racadm set idrac.webserver.HostHeaderCheck Disabled

-- BodoRiedigerKlaus - 13 Sep 2022

Sitemap

GeschichteDesComputers

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback