User Tools

Site Tools


alicebeast

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
alicebeast [2020/06/18 15:24]
florido [Brief storage view]
alicebeast [2021/01/19 17:23] (current)
florido [Emergency procedures]
Line 24: Line 24:
 </​code>​ </​code>​
  
 +===== Networking ======
 +
 +The machine is only accessible within the hep network as it was for aurora. One must login to teddi first if outside the division. This can be changed but at the cost of some security service always running in the machine, and me monitoring what happens in terms of logins.
 ===== Admin rights ====== ===== Admin rights ======
  
 Direct access to root user is discouraged. Administrators should use the ''​sudo''​ command to become root. Direct access to root user is discouraged. Administrators should use the ''​sudo''​ command to become root.
  
-There is a special user whose home folder is located in ''/​ahome''​ in case the home folders get corrupted, called ''​alicedrift''​. You are free to create a password for such user and share it among the group. I recommend you do this asap. This is to be used in case of emergency.+There is a special user whose home folder is located in ''/​ahome''​ in case the home folders get corrupted, called ''​alicedrift''​. You are free to create a password for such user and share it among the group. I recommend you do this asap. This is to be used in case of emergency. Store the password in a secure place.
  
 ==== Adding an administrator ==== ==== Adding an administrator ====
Line 107: Line 110:
   * **swRAID**: a device managed by software RAID, which is a feature of the kernel to create RAID disks. Such feature is less reliable than an actual physical raid but it should give enough time to take action when something breaks. It includes you being able to finish a running job before shutting down the machine. ​   * **swRAID**: a device managed by software RAID, which is a feature of the kernel to create RAID disks. Such feature is less reliable than an actual physical raid but it should give enough time to take action when something breaks. It includes you being able to finish a running job before shutting down the machine. ​
  
-==== RAID disk failing ====+All commands must be run while being root (''​sudo -s''​) 
 +==== HW RAID disk failing ====
  
-This can happen if one or more of the RAID disks are broken or starting to fail. You are supposed to take action asap.+This can happen if one or more of the RAID disks are broken or starting to fail. You are supposed to take action asap.  
 + 
 +The HW RAID is the big storage with lots of data. It should resist two broken disks before total failure (loss of data, unreadable data).  
 + 
 +A list of commands to interact with the RAID is here: [[it_tips:​storcliqr]]
  
 > How to detect? > How to detect?
 +
 +  * I have a script that informs me of the status of the RAID. I will send the output to you as well after some testing
 +  * An alternative is to run this command: <​code:​shell>​storcli /c0/v0 show all</​code>​ and verify there are no errors reported.
  
 > What to do > What to do
 +
 +  * contact me
 +  * Identify the broken disk
 +  * Substitute the broken disk with a new one
 +  * The system should automatically rebuild the array. One can check the status of the rebuild with <code bash>#​show background activity
 +storcli/​c0/​v0 show bgi</​code><​code bash># show array initialization or rebuild
 +storcli/​c0/​v0 show init</​code>​
  
 ==== Home folder swRAID failing ==== ==== Home folder swRAID failing ====
Line 122: Line 140:
  
 > How to detect? > How to detect?
 +
 +  * I have a script that informs me of the status of the RAID. I will send the output to you as well after some testing
 +  * A command that can be used to check the arrays is as follows: <​code:​shell>​for i in 5 6 7; do mdadm --detail /​dev/​md12$i;​ done</​code>​
  
 > What to do > What to do
Line 133: Line 154:
 > How to detect? > How to detect?
  
-> What to do+> What to do - FIXME: requires own page 
 + 
 +1. Identify the broken raid disk. If it is one of the U.2 disks, then the procedure is slightly easier. One can skip steps FIXME:​[number of steps] 
 + 
 +2. Mark the RAID disk as faulty if the system did not do automatically. 
 + 
 +3. Remove the device from the array. 
 + 
 +4. Shut down the machine. 
 + 
 +5. Replace the broken disk. 
 + 
 +6. start the machine using sysrescuecd. 
 + 
 +7. Create the proper partitions on the disk.  
 +   * FIXME copy geometry and UUIDs here for reference 
 +8. Initialize the swap partition on the new disk 
 + 
 +9. If one of the arrays is inactive, must be restarted. Follow FIXME:​[procedure for restarting] 
 + 
 +10. Add the new partitions to existing arrays. Wait for the arrays to rebuild. 
 + 
 +11. Mount the array and change the UUID of the swap partition in /​etc/​fstab 
 + 
 +12. Rebuilding the boot system 
 + 
 +12.1. Boot with Centos7 rescue image in troubleshooting mode 
 +  * FIXME: provide such image or info how to boot 
 +12.2. chroot to system, identify new disk, run grub2-install /​dev/​newdisk 
 + 
 +12.3. reboot 
 + 
 +13. If it worked one should see the grub boot -- it might be required to boot again 
 + 
 +If it didn't work the disk device at step 12.2 was wrong, repeat 12 
 + 
 +==== raw disks info ==== 
 + 
 +<code bash> 
 +Disk /​dev/​nvme0n1:​ 250.1 GB, 250059350016 bytes, 488397168 sectors 
 +Units = sectors of 1 * 512 = 512 bytes 
 +Sector size (logical/​physical):​ 512 bytes / 512 bytes 
 +I/O size (minimum/​optimal):​ 512 bytes / 512 bytes 
 +Disk label type: dos 
 +Disk identifier: 0x000c07b1 
 + 
 +        Device Boot      Start         ​End ​     Blocks ​  ​Id ​ System 
 +/​dev/​nvme0n1p1 ​  ​* ​       2048   ​195444735 ​   97721344 ​  ​fd ​ Linux raid autodetect 
 +/​dev/​nvme0n1p2 ​      ​195444736 ​  ​390887423 ​   97721344 ​  ​fd ​ Linux raid autodetect 
 +/​dev/​nvme0n1p3 ​      ​390887424 ​  ​422135807 ​   15624192 ​  ​82 ​ Linux swap / Solaris 
 + 
 +Disk /​dev/​nvme1n1:​ 960.2 GB, 960197124096 bytes, 1875385008 sectors 
 +Units = sectors of 1 * 512 = 512 bytes 
 +Sector size (logical/​physical):​ 512 bytes / 512 bytes 
 +I/O size (minimum/​optimal):​ 512 bytes / 512 bytes 
 +Disk label type: dos 
 +Disk identifier: 0x000c4c4c 
 + 
 +        Device Boot      Start         ​End ​     Blocks ​  ​Id ​ System 
 +/​dev/​nvme1n1p1 ​  ​* ​       2048   ​195444735 ​   97721344 ​  ​fd ​ Linux raid autodetect 
 +/​dev/​nvme1n1p2 ​      ​195444736 ​  ​390887423 ​   97721344 ​  ​fd ​ Linux raid autodetect 
 +/​dev/​nvme1n1p3 ​      ​390887424 ​ 1875384319 ​  ​742248448 ​  ​fd ​ Linux raid autodetect 
 + 
 +Disk /​dev/​nvme2n1:​ 960.2 GB, 960197124096 bytes, 1875385008 sectors 
 +Units = sectors of 1 * 512 = 512 bytes 
 +Sector size (logical/​physical):​ 512 bytes / 512 bytes 
 +I/O size (minimum/​optimal):​ 512 bytes / 512 bytes 
 +Disk label type: dos 
 +Disk identifier: 0x000e7acc 
 + 
 +        Device Boot      Start         ​End ​     Blocks ​  ​Id ​ System 
 +/​dev/​nvme2n1p1 ​  ​* ​       2048   ​195444735 ​   97721344 ​  ​fd ​ Linux raid autodetect 
 +/​dev/​nvme2n1p2 ​      ​195444736 ​  ​390887423 ​   97721344 ​  ​fd ​ Linux raid autodetect 
 +/​dev/​nvme2n1p3 ​      ​390887424 ​ 1875384319 ​  ​742248448 ​  ​fd ​ Linux raid autodetect 
 +</​code>​ 
 + 
 +===== Monitoring ===== 
 + 
 +:WIP: The machine will be monitored by a software and system monitor and an hardware monitor. Access and details to these systems will come.
  
alicebeast.1592493892.txt.gz · Last modified: 2020/06/18 15:24 by florido

Accessibility Statement