User Tools

Site Tools


alicecrunchwip

ALICE crunch server WIP

This page contains an overview of the machine and an everyday log of what is being done. Later there will be more accurate documentation.

Overview

The machine has 3 disks.

  • One M.2 (SSD) for boot and system in raid 1 with the one below (3 copies of the system, but we will see if of any use)
  • Two U.2 (SSD) in RAID1 for user homes
  • a large capacity RAID6 for data, ~60TB

See description in this graph below.


             .------------------------------------------.
             |              M.2 Disk 250GB              |
             |------------------------------------------|
  .----------| Main system disk - boot /                |                              SW RAID1
  |          | operating system                         |-------------------------------------------------------------.
  |          | /swap                                    |                                                             |
  |          | /ahome (admin home separated from users) |                                                             |
  |          '------------------------------------------'                                                             |
  |                                                                                                                   |
  |   SW RAID1                                                                                                        |
  |         .----------------------------------------.                  .----------------------------------------.    |
  |         |            U.2 disk 1 960GB            |                  |            U.2 disk 2 960GB            |    |
  |         |----------------------------------------|                  |----------------------------------------|    |
  |         | - User folders /home 700GB             |<---------------->| - User folders /home 700GB             |    |
  |         | -------------------------------------- |     SW RAID1     | -------------------------------------- |    |
  --------->| - copy of boot disk (bootable) / 100GB |                  | - copy of boot disk (bootable) / 100GB |<---'
            |   copy of /ahome 100GB260GB            |<---------------->|   copy of /ahome 100GB                 |
            '----------------------------------------'                  '----------------------------------------'



                          .------------------------------------------------.
               __________ |      RAID6 storage - 8x 12TB disks /disk       | __________ 
              [_|||||||_°]|------------------------------------------------|[_|||||||_°]
              [_|||||||_°]| - 7x 12TB in RAID6 -> 60TB (50-55 usable)      |[_|||||||_°]
              [_|||||||_°]| - 1x 12TB spare                                |[_|||||||_°]
                          '------------------------------------------------'

One can login to the machine as it does with aurora.

Additionally admins can login to a separate monitor machine to check the status of hardware components. This machine will also send warnings in case of issues.

                                       ____   __             
                                      |    | |==|            
                  .------Admins------>|____| |  |  HW Monitor
                  |                   /::::/ |__|            
                  |                              |
              __  _                              |<monitoring>
     o       [__]|=|                             v
    /|\      /::/|_|                ________                    
    / \                            |==|=====|                   
         Own laptop / teddi        |  |     |                   
                  |                |  |     |                   
                  |                |  |     |  ALICE new machine
                  '--Researchers-->|  |     |                   
                                   |  |====°|                   
                                   |__|_____|                   
                                                                
                                                                

Florido will take care of OS installation and hardware maintenance/monitoring. This info will be shared with ALICE admins. ALICE admins will have access as root and control any parameter they want.

TODO

  • Configure networking: DONE :-D
  • Configure maintenance interface: DONE :-D
  • Configure HW monitoring server: ongoing FIXME
    • Configure machine os and services DONE :-D
    • Configure hardware monitoring FIXME
    • Configure user access to monitor FIXME
      • add NIS auth but with local folders? VS dedicated user with access to specific hw?
  • Install system: DONE :-D
  • Create RAID6: DONE :-D
    • Find best settings, discuss with users DONE :-D
  • Configure system: ongoing DONE :-D
    • Configure firewall DONE :-D
    • Test fallback operating system in case of failure FIXME
    • Configure config management system. Testing ansible DONE :-D
      • Enable snmp monitoring DONE :-D
        • add server to monitor DONE :-D
    • Create admin users ongoing DONE :-D
      • generic alice user DONE :-D
      • give admin permissions to relevant alice users DONE :-D
    • Format and connect RAID6 disk DONE :-D
    • Configure RAID monitoring DONE :-D
  • User services (tasks mainly to be done by admin users): DONE :-D
    • Create users DONE :-D
    • Create folders DONE :-D
    • Share folders across servers (requires installing new services) :?:
    • Install ALICE software DONE :-D by ALICE members
      • More details here :?:
    • Move data DONE :-D helped with more performing script
  • Documentation
    • Write a page on how to manage the system FIXME
      • Started on alicebeast
      • Must be only accessible by admins
      • Must include system description above here
      • Must describe common procedures
      • Ask if users want remote X11 - NO

Logbook

202101**

  • new entry

20210119

  • Configured RAID monitoring

2020616

  • Configured system to be managed remotely via ansible
    • Changed sshd_config
    • Added management keys to root user
    • Management currently done via my workstation, may change in the future.
  • Installed basic software
  • Created alice admin user
  • Installed basic build software

2020612

  • Formatted RAID volume as xfs
    mkfs.xfs -f -d su=1m,sw=5 -L alicedisk /dev/sda
    • Shows a warning but should be harmless
su=1m 1MB strip size
sw=5 5 RAID disks (+2 parity)
-L alicedisk xfs label
/dev/sda the raid shown as a device to the kernel
  • created folder /disk
  • created entry in fstab for disk based on UUID:
    # RAID disk
    UUID=0d4a40e5-084e-404f-9219-6c3645929ec2 /disk                   xfs     rw,seclabel,relatime,attr2,inode64,sunit=2048,swidth=10240,noquota 0 0

2020609

2020603

  • Configured server network. Current hostname: alice
  • Configured remote access to monitor using X2GO client-server technology. Works nicely.
  • Reset maintenance interface. Tested ok. Custom tools not working but web browser interface ok.
  • Installed system. Configured RAID1 on system disks. Upgraded system.
  • Configured basic firewall.

2020602

  • Created monitor machine
  • Placed monitor machine in C165
  • Connected monitor machine to server
  • Configured RAID6. Experimental. More info needed from users. Disks working but disk lights not blinking, contacted Compliq.
  • Configured remote access to monitor. Not working, better solution required
  • Configured remote access to server maintenance interface. Not working, requires reset. First attempt failed due to network config missing.

20200529

  • Server arrived. Placed in C165.
  • Inspected hardware and disk bays
  • Provided electric connectivity
  • Provided network connectivity
  • First boot successfull
alicecrunchwip.txt · Last modified: 2021/01/19 17:09 by florido

Accessibility Statement