User Tools

Site Tools


alicecrunchwip

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
alicecrunchwip [2020/06/03 15:22]
florido [Logbook]
alicecrunchwip [2020/06/18 09:58]
florido [TODO]
Line 12: Line 12:
  
 See description in this graph below. See description in this graph below.
 +
 +<​code>​
 +
 +
 +             ​.------------------------------------------.
 +             ​| ​             M.2 Disk 250GB              |
 +             ​|------------------------------------------|
 +  .----------| Main system disk - boot /                |                              SW RAID1
 +  |          | operating system ​                        ​|-------------------------------------------------------------.
 +  |          | /swap                                    |                                                             |
 +  |          | /ahome (admin home separated from users) |                                                             |
 +  |          '​------------------------------------------' ​                                                            |
 +  |                                                                                                                   |
 +  |   SW RAID1                                                                                                        |
 +  |         ​.----------------------------------------. ​                 .----------------------------------------. ​   |
 +  |         ​| ​           U.2 disk 1 960GB            |                  |            U.2 disk 2 960GB            |    |
 +  |         ​|----------------------------------------| ​                 |----------------------------------------| ​   |
 +  |         | - User folders /home 700GB             ​|<​---------------->​| - User folders /home 700GB             ​| ​   |
 +  |         | -------------------------------------- |     SW RAID1     | -------------------------------------- |    |
 +  --------->​| - copy of boot disk (bootable) / 100GB |                  | - copy of boot disk (bootable) / 100GB |<​---'​
 +            |   copy of /ahome 100GB260GB ​           |<​---------------->​| ​  copy of /ahome 100GB                 |
 +            '​----------------------------------------' ​                 '​----------------------------------------'​
 +
 +
 +
 +                          .------------------------------------------------.
 +               ​__________ |      RAID6 storage - 8x 12TB disks /disk       | __________ ​
 +              [_|||||||_°]|------------------------------------------------|[_|||||||_°]
 +              [_|||||||_°]| - 7x 12TB in RAID6 -> 60TB (50-55 usable) ​     |[_|||||||_°]
 +              [_|||||||_°]| - 1x 12TB spare                                |[_|||||||_°]
 +                          '​------------------------------------------------'​
 +</​code>​
  
 One can login to the machine as it does with aurora. ​ One can login to the machine as it does with aurora. ​
  
 Additionally admins can login to a separate monitor machine to check the status of hardware components. This machine will also send warnings in case of issues. Additionally admins can login to a separate monitor machine to check the status of hardware components. This machine will also send warnings in case of issues.
 +
 +<​code>​
 +
 +                                       ​____ ​  ​__ ​            
 +                                      |    | |==|            ​
 +                  .------Admins------>​|____| |  |  HW Monitor
 +                  |                   /::::/​ |__|            ​
 +                  |                              |
 +              __  _                              |<​monitoring>​
 +     ​o ​      ​[__]|=| ​                            v
 +    /|\      /::/​|_| ​               ________ ​                   ​
 +    / \                            |==|=====| ​                  
 +         Own laptop / teddi        |  |     ​| ​                  
 +                  |                |  |     ​| ​                  
 +                  |                |  |     ​| ​ ALICE new machine
 +                  '​--Researchers-->​| ​ |     ​| ​                  
 +                                   ​| ​ |====°| ​                  
 +                                   ​|__|_____| ​                  
 +                                                                ​
 +                                                                ​
 +
 +</​code>​
  
 Florido will take care of OS installation and hardware maintenance/​monitoring. This info will be shared with ALICE admins. Florido will take care of OS installation and hardware maintenance/​monitoring. This info will be shared with ALICE admins.
 ALICE admins will have access as root and control any parameter they want. ALICE admins will have access as root and control any parameter they want.
  
-===== Logbook ​===== +===== TODO =====
- +
-**TODO:** +
-  * Configure networking: DONE  +
-  * Configure maintenance interface: DONE +
-  * Configure monitoring server: DONE +
-  * Install system: DONE +
-  * Create RAID6: ongoing +
-    * Find best settings, discuss with users +
-  * Configure system: ongoing +
-    * Configure firewall +
-    * Configure config management system. Maybe chance to test ansible +
-    * Create admin users +
-    * Format and connect RAID6 disk+
  
 +  * Configure networking: DONE :-D
 +  * Configure maintenance interface: DONE :-D
 +  * Configure HW monitoring server: ongoing FIXME
 +    * Configure machine os and services DONE :-D
 +    * Configure hardware monitoring FIXME
 +    * Configure user access to monitor FIXME
 +      * add NIS auth but with local folders? VS dedicated user with access to specific hw?
 +  * Install system: DONE :-D
 +  * Create RAID6: DONE :-D
 +    * Find best settings, discuss with users DONE :-D
 +  * Configure system: ongoing FIXME
 +    * Configure firewall DONE :-D
 +    * Test fallback operating system in case of failure FIXME
 +    * Configure config management system. Testing **ansible** DONE :-D
 +      * Enable snmp monitoring FIXME
 +        * add server to monitor FIXME
 +    * Create admin users ongoing FIXME
 +      * generic alice user DONE :-D
 +      * give admin permissions to relevant alice users FIXME
 +    * Format and connect RAID6 disk DONE :-D
 +    * Configure RAID monitoring FIXME
 +  * User services (tasks mainly to be done by admin users): :?:
 +    * Create users :?:
 +    * Create folders :?:
 +    * Share folders across servers (requires installing new services) :?:
 +    * Install ALICE software :?:
 +      * More details here :?:
 +    * Move data :?:
 +  * Documentation
 +    * Write a page on how to manage the system :?:
 +      * Must be only accessible by admins
 +      * Must include system description above here
 +      * Must describe common procedures
 +      * Ask if users want remote X11
 ===== Logbook ===== ===== Logbook =====
  
Line 40: Line 115:
 ==== 20206** ==== ==== 20206** ====
   * new entry   * new entry
 +
 +==== 2020616 ====
 +  * Configured system to be managed remotely via ansible
 +    * Changed sshd_config
 +    * Added management keys to root user
 +    * Management currently done via my workstation,​ may change in the future.
 +  * Installed basic software
 +  * Created alice admin user
 +  * Installed basic build software
 +
 +==== 2020612 ====
 +  * Formatted RAID volume as xfs <code bash>​mkfs.xfs -f -d su=1m,sw=5 -L alicedisk /​dev/​sda</​code>​
 +    * Shows a warning but should be harmless
 +
 +^ su=1m        | 1MB strip size |
 +^ sw=5         | 5 RAID disks (+2 parity) |
 +^ -L alicedisk | xfs label |
 +^ /​dev/​sda ​    | the raid shown as a device to the kernel |
 +
 +  * created folder ''/​disk''​
 +  * created entry in fstab for disk based on UUID:<​code ini># RAID disk
 +UUID=0d4a40e5-084e-404f-9219-6c3645929ec2 /disk                   ​xfs ​    ​rw,​seclabel,​relatime,​attr2,​inode64,​sunit=2048,​swidth=10240,​noquota 0 0</​code>​
 +==== 2020609 ====
 +  * configured server network and hostname: beast
 +  * Installed storcli RAID management tool from Vendor website
 +    * https://​www.broadcom.com/​products/​storage/​raid-controllers/​megaraid-9460-8i ​
 +    * https://​docs.broadcom.com/​docs/​007.1316.0000.0000_Unified_StorCLI_PUL.zip
 +    * Sw installed in ''/​opt/​MegaRAID/​storcli/''​
 +    * For brevity added alias ''​storcli''​ in root .bashrc
 +  * Created array volume with 1024k strip size for very large files <code bash>​storcli /c0 add vd r6 name=alicedisk drives=133:​0-6 Strip=1024</​code>​
 +    * Note: initialization takes 13h
 +    * Raid volume name in storcli:''/​c0/​v0''​
 +  * Created hotspare (disk that kicks in if one breaks)<​code bash>​storcli /c0/e133/s7 add hotsparedrive</​code>​
 +  * Written few notes on storcli in [[it_tips:​storcliqr|MegaRAID storcli QR]]
  
 ==== 2020603 ==== ==== 2020603 ====
Line 46: Line 155:
   * Reset maintenance interface. Tested ok. Custom tools not working but web browser interface ok.   * Reset maintenance interface. Tested ok. Custom tools not working but web browser interface ok.
   * Installed system. Configured RAID1 on system disks. Upgraded system.   * Installed system. Configured RAID1 on system disks. Upgraded system.
 +  * Configured basic firewall.
  
 ==== 2020602 ==== ==== 2020602 ====
alicecrunchwip.txt · Last modified: 2020/06/18 09:58 by florido

Accessibility Statement