This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
alicebeast [2020/06/22 22:27] florido [system swRAID failing] |
alicebeast [2021/01/19 17:23] (current) florido [Emergency procedures] |
||
---|---|---|---|
Line 110: | Line 110: | ||
* **swRAID**: a device managed by software RAID, which is a feature of the kernel to create RAID disks. Such feature is less reliable than an actual physical raid but it should give enough time to take action when something breaks. It includes you being able to finish a running job before shutting down the machine. | * **swRAID**: a device managed by software RAID, which is a feature of the kernel to create RAID disks. Such feature is less reliable than an actual physical raid but it should give enough time to take action when something breaks. It includes you being able to finish a running job before shutting down the machine. | ||
- | ==== RAID disk failing ==== | + | All commands must be run while being root (''sudo -s'') |
+ | ==== HW RAID disk failing ==== | ||
- | This can happen if one or more of the RAID disks are broken or starting to fail. You are supposed to take action asap. | + | This can happen if one or more of the RAID disks are broken or starting to fail. You are supposed to take action asap. |
+ | |||
+ | The HW RAID is the big storage with lots of data. It should resist two broken disks before total failure (loss of data, unreadable data). | ||
+ | |||
+ | A list of commands to interact with the RAID is here: [[it_tips:storcliqr]] | ||
> How to detect? | > How to detect? | ||
+ | |||
+ | * I have a script that informs me of the status of the RAID. I will send the output to you as well after some testing | ||
+ | * An alternative is to run this command: <code:shell>storcli /c0/v0 show all</code> and verify there are no errors reported. | ||
> What to do | > What to do | ||
+ | |||
+ | * contact me | ||
+ | * Identify the broken disk | ||
+ | * Substitute the broken disk with a new one | ||
+ | * The system should automatically rebuild the array. One can check the status of the rebuild with <code bash>#show background activity | ||
+ | storcli/c0/v0 show bgi</code><code bash># show array initialization or rebuild | ||
+ | storcli/c0/v0 show init</code> | ||
==== Home folder swRAID failing ==== | ==== Home folder swRAID failing ==== | ||
Line 125: | Line 140: | ||
> How to detect? | > How to detect? | ||
+ | |||
+ | * I have a script that informs me of the status of the RAID. I will send the output to you as well after some testing | ||
+ | * A command that can be used to check the arrays is as follows: <code:shell>for i in 5 6 7; do mdadm --detail /dev/md12$i; done</code> | ||
> What to do | > What to do | ||
Line 171: | Line 189: | ||
If it didn't work the disk device at step 12.2 was wrong, repeat 12 | If it didn't work the disk device at step 12.2 was wrong, repeat 12 | ||
+ | |||
+ | ==== raw disks info ==== | ||
+ | |||
+ | <code bash> | ||
+ | Disk /dev/nvme0n1: 250.1 GB, 250059350016 bytes, 488397168 sectors | ||
+ | Units = sectors of 1 * 512 = 512 bytes | ||
+ | Sector size (logical/physical): 512 bytes / 512 bytes | ||
+ | I/O size (minimum/optimal): 512 bytes / 512 bytes | ||
+ | Disk label type: dos | ||
+ | Disk identifier: 0x000c07b1 | ||
+ | |||
+ | Device Boot Start End Blocks Id System | ||
+ | /dev/nvme0n1p1 * 2048 195444735 97721344 fd Linux raid autodetect | ||
+ | /dev/nvme0n1p2 195444736 390887423 97721344 fd Linux raid autodetect | ||
+ | /dev/nvme0n1p3 390887424 422135807 15624192 82 Linux swap / Solaris | ||
+ | |||
+ | Disk /dev/nvme1n1: 960.2 GB, 960197124096 bytes, 1875385008 sectors | ||
+ | Units = sectors of 1 * 512 = 512 bytes | ||
+ | Sector size (logical/physical): 512 bytes / 512 bytes | ||
+ | I/O size (minimum/optimal): 512 bytes / 512 bytes | ||
+ | Disk label type: dos | ||
+ | Disk identifier: 0x000c4c4c | ||
+ | |||
+ | Device Boot Start End Blocks Id System | ||
+ | /dev/nvme1n1p1 * 2048 195444735 97721344 fd Linux raid autodetect | ||
+ | /dev/nvme1n1p2 195444736 390887423 97721344 fd Linux raid autodetect | ||
+ | /dev/nvme1n1p3 390887424 1875384319 742248448 fd Linux raid autodetect | ||
+ | |||
+ | Disk /dev/nvme2n1: 960.2 GB, 960197124096 bytes, 1875385008 sectors | ||
+ | Units = sectors of 1 * 512 = 512 bytes | ||
+ | Sector size (logical/physical): 512 bytes / 512 bytes | ||
+ | I/O size (minimum/optimal): 512 bytes / 512 bytes | ||
+ | Disk label type: dos | ||
+ | Disk identifier: 0x000e7acc | ||
+ | |||
+ | Device Boot Start End Blocks Id System | ||
+ | /dev/nvme2n1p1 * 2048 195444735 97721344 fd Linux raid autodetect | ||
+ | /dev/nvme2n1p2 195444736 390887423 97721344 fd Linux raid autodetect | ||
+ | /dev/nvme2n1p3 390887424 1875384319 742248448 fd Linux raid autodetect | ||
+ | </code> | ||
===== Monitoring ===== | ===== Monitoring ===== |