User Tools

Site Tools


alicebeast

ALICE beast machine admin info

This page contains information on how to manage the ALICE beast machine.

For a description of the machine hardware and WIP click here

Brief storage view

 system SSD (M.2) 90GB
    /       # system root
    /opt    # folder for user software
    /var    # folder for logs and other runtime things
    /tmp    # temporary folder
 emergency admin user homes SSD (M.2) 90GB
    /ahome  
  >>>> There is still time to change these. Please speak up. I could for example allocate more space for system and opt instead of the one allocated for /ahome. <<<<

 user homes SSD (U.2) 690GB
    /home   # user home folders
    
 data disk (RAID6) 55TB
    /disk   # available for data, filesize >= 1MB 

Networking

The machine is only accessible within the hep network as it was for aurora. One must login to teddi first if outside the division. This can be changed but at the cost of some security service always running in the machine, and me monitoring what happens in terms of logins.

Admin rights

Direct access to root user is discouraged. Administrators should use the sudo command to become root.

There is a special user whose home folder is located in /ahome in case the home folders get corrupted, called alicedrift. You are free to create a password for such user and share it among the group. I recommend you do this asap. This is to be used in case of emergency. Store the password in a secure place.

Adding an administrator

To elect a new administrator, it is enough to add them to the wheel unix group.

:!: Fiddling with sudo configuration files is strongly discouraged. :!:

usermod -aG wheel <username>

Example:

usermod -aG wheel pflorido

Software planning

I suggest you install system software according to the distribution, using yum.

As for research software, I would recommend you install:

  • For small files that require performance:
    • in the /opt folder (SSD)
    • in a subfolder of /home (SSD U.2)
  • For large/huge files (> 1MB ) such as docker images, create a specific folder inside the /disk folder (RAID)

Software installation

For system/distribution software installation it is recommended to use the yum command and the rpm system. Centos7 is a RedHat based distribution. The commands are pretty much the same as ubuntu. Here I give you a quick reference of most useful commands.

Search for a package

yum search <string>

Example:

yum search tmux

Install a package

yum install <packagename>

Examples:

yum install tmux

Find which package contains a specific file

yum provides */<expression>

Examples:

yum provides */libboost*

upgrade system packages

yum upgrade

You will be asked to accept the changes.

Emergency procedures

:WIP:

This section describes how to handle some critical situations.

In what follows I will use terminology that I clarify below.

  • RAID: a device managed by a physical RAID controller card.
  • swRAID: a device managed by software RAID, which is a feature of the kernel to create RAID disks. Such feature is less reliable than an actual physical raid but it should give enough time to take action when something breaks. It includes you being able to finish a running job before shutting down the machine.

All commands must be run while being root (sudo -s)

HW RAID disk failing

This can happen if one or more of the RAID disks are broken or starting to fail. You are supposed to take action asap.

The HW RAID is the big storage with lots of data. It should resist two broken disks before total failure (loss of data, unreadable data).

A list of commands to interact with the RAID is here: storcliqr

How to detect?
  • I have a script that informs me of the status of the RAID. I will send the output to you as well after some testing
  • An alternative is to run this command:
    storcli /c0/v0 show all

    and verify there are no errors reported.

What to do
  • contact me
  • Identify the broken disk
  • Substitute the broken disk with a new one
  • The system should automatically rebuild the array. One can check the status of the rebuild with
    #show background activity
    storcli/c0/v0 show bgi
    # show array initialization or rebuild
    storcli/c0/v0 show init

Home folder swRAID failing

There are two U.2 disks in software RAID serving user homes. Should one disk fail, the homes will continue to work, but as soon as possible you are required to replace the broken disk to avoid data loss.

This can happen if one of the U.2 disks containing home folders is faulty.

How to detect?
  • I have a script that informs me of the status of the RAID. I will send the output to you as well after some testing
  • A command that can be used to check the arrays is as follows:
    for i in 5 6 7; do mdadm --detail /dev/md12$i; done
What to do

system swRAID failing

The system M.2 disk is in software raid with two partitions of the U.2 disk. This allows the machine to continue working when the M.2 disk is faulty. However, the machine will not reboot in case this disk breaks. Unfortunately the U.2 disks cannot be used for booting. It is a limitation of the machine.

Nevertheless it's possible to entirely restore the disk image by copying one of the two copies in the U.2 drives. See description below.

How to detect?
What to do - FIXME: requires own page

1. Identify the broken raid disk. If it is one of the U.2 disks, then the procedure is slightly easier. One can skip steps FIXME:[number of steps]

2. Mark the RAID disk as faulty if the system did not do automatically.

3. Remove the device from the array.

4. Shut down the machine.

5. Replace the broken disk.

6. start the machine using sysrescuecd.

7. Create the proper partitions on the disk.

  • FIXME copy geometry and UUIDs here for reference

8. Initialize the swap partition on the new disk

9. If one of the arrays is inactive, must be restarted. Follow FIXME:[procedure for restarting]

10. Add the new partitions to existing arrays. Wait for the arrays to rebuild.

11. Mount the array and change the UUID of the swap partition in /etc/fstab

12. Rebuilding the boot system

12.1. Boot with Centos7 rescue image in troubleshooting mode

  • FIXME: provide such image or info how to boot

12.2. chroot to system, identify new disk, run grub2-install /dev/newdisk

12.3. reboot

13. If it worked one should see the grub boot – it might be required to boot again

If it didn't work the disk device at step 12.2 was wrong, repeat 12

raw disks info

Disk /dev/nvme0n1: 250.1 GB, 250059350016 bytes, 488397168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x000c07b1
 
        Device Boot      Start         End      Blocks   Id  System
/dev/nvme0n1p1   *        2048   195444735    97721344   fd  Linux raid autodetect
/dev/nvme0n1p2       195444736   390887423    97721344   fd  Linux raid autodetect
/dev/nvme0n1p3       390887424   422135807    15624192   82  Linux swap / Solaris
 
Disk /dev/nvme1n1: 960.2 GB, 960197124096 bytes, 1875385008 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x000c4c4c
 
        Device Boot      Start         End      Blocks   Id  System
/dev/nvme1n1p1   *        2048   195444735    97721344   fd  Linux raid autodetect
/dev/nvme1n1p2       195444736   390887423    97721344   fd  Linux raid autodetect
/dev/nvme1n1p3       390887424  1875384319   742248448   fd  Linux raid autodetect
 
Disk /dev/nvme2n1: 960.2 GB, 960197124096 bytes, 1875385008 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x000e7acc
 
        Device Boot      Start         End      Blocks   Id  System
/dev/nvme2n1p1   *        2048   195444735    97721344   fd  Linux raid autodetect
/dev/nvme2n1p2       195444736   390887423    97721344   fd  Linux raid autodetect
/dev/nvme2n1p3       390887424  1875384319   742248448   fd  Linux raid autodetect

Monitoring

:WIP: The machine will be monitored by a software and system monitor and an hardware monitor. Access and details to these systems will come.

alicebeast.txt · Last modified: 2021/01/19 17:23 by florido

Accessibility Statement