====== ALICE crunch server WIP ======
This page contains an overview of the machine and an everyday log of what is being done. Later there will be more accurate documentation.
===== Overview =====
The machine has 3 disks.
* One M.2 (SSD) for boot and system in raid 1 with the one below (3 copies of the system, but we will see if of any use)
* Two U.2 (SSD) in RAID1 for user homes
* a large capacity RAID6 for data, ~60TB
See description in this graph below.
.------------------------------------------.
| M.2 Disk 250GB |
|------------------------------------------|
.----------| Main system disk - boot / | SW RAID1
| | operating system |-------------------------------------------------------------.
| | /swap | |
| | /ahome (admin home separated from users) | |
| '------------------------------------------' |
| |
| SW RAID1 |
| .----------------------------------------. .----------------------------------------. |
| | U.2 disk 1 960GB | | U.2 disk 2 960GB | |
| |----------------------------------------| |----------------------------------------| |
| | - User folders /home 700GB |<---------------->| - User folders /home 700GB | |
| | -------------------------------------- | SW RAID1 | -------------------------------------- | |
--------->| - copy of boot disk (bootable) / 100GB | | - copy of boot disk (bootable) / 100GB |<---'
| copy of /ahome 100GB260GB |<---------------->| copy of /ahome 100GB |
'----------------------------------------' '----------------------------------------'
.------------------------------------------------.
__________ | RAID6 storage - 8x 12TB disks /disk | __________
[_|||||||_°]|------------------------------------------------|[_|||||||_°]
[_|||||||_°]| - 7x 12TB in RAID6 -> 60TB (50-55 usable) |[_|||||||_°]
[_|||||||_°]| - 1x 12TB spare |[_|||||||_°]
'------------------------------------------------'
One can login to the machine as it does with aurora.
Additionally admins can login to a separate monitor machine to check the status of hardware components. This machine will also send warnings in case of issues.
____ __
| | |==|
.------Admins------>|____| | | HW Monitor
| /::::/ |__|
| |
__ _ |
o [__]|=| v
/|\ /::/|_| ________
/ \ |==|=====|
Own laptop / teddi | | |
| | | |
| | | | ALICE new machine
'--Researchers-->| | |
| |====°|
|__|_____|
Florido will take care of OS installation and hardware maintenance/monitoring. This info will be shared with ALICE admins.
ALICE admins will have access as root and control any parameter they want.
===== TODO =====
* Configure networking: DONE :-D
* Configure maintenance interface: DONE :-D
* Configure HW monitoring server: ongoing FIXME
* Configure machine os and services DONE :-D
* Configure hardware monitoring FIXME
* Configure user access to monitor FIXME
* add NIS auth but with local folders? VS dedicated user with access to specific hw?
* Install system: DONE :-D
* Create RAID6: DONE :-D
* Find best settings, discuss with users DONE :-D
* Configure system: ongoing DONE :-D
* Configure firewall DONE :-D
* Test fallback operating system in case of failure FIXME
* Configure config management system. Testing **ansible** DONE :-D
* Enable snmp monitoring DONE :-D
* add server to monitor DONE :-D
* Create admin users ongoing DONE :-D
* generic alice user DONE :-D
* give admin permissions to relevant alice users DONE :-D
* Format and connect RAID6 disk DONE :-D
* Configure RAID monitoring DONE :-D
* User services (tasks mainly to be done by admin users): DONE :-D
* Create users DONE :-D
* Create folders DONE :-D
* Share folders across servers (requires installing new services) :?:
* Install ALICE software DONE :-D by ALICE members
* More details here :?:
* Move data DONE :-D helped with more performing script
* Documentation
* Write a page on how to manage the system FIXME
* Started on [[:alicebeast]]
* Must be only accessible by admins
* Must include system description above here
* Must describe common procedures
* Ask if users want remote X11 - NO
===== Logbook =====
==== 202101** ====
* new entry
==== 20210119 ====
* Configured RAID monitoring
==== 2020616 ====
* Configured system to be managed remotely via ansible
* Changed sshd_config
* Added management keys to root user
* Management currently done via my workstation, may change in the future.
* Installed basic software
* Created alice admin user
* Installed basic build software
==== 2020612 ====
* Formatted RAID volume as xfs mkfs.xfs -f -d su=1m,sw=5 -L alicedisk /dev/sda
* Shows a warning but should be harmless
^ su=1m | 1MB strip size |
^ sw=5 | 5 RAID disks (+2 parity) |
^ -L alicedisk | xfs label |
^ /dev/sda | the raid shown as a device to the kernel |
* created folder ''/disk''
* created entry in fstab for disk based on UUID:# RAID disk
UUID=0d4a40e5-084e-404f-9219-6c3645929ec2 /disk xfs rw,seclabel,relatime,attr2,inode64,sunit=2048,swidth=10240,noquota 0 0
==== 2020609 ====
* configured server network and hostname: beast
* Installed storcli RAID management tool from Vendor website
* https://www.broadcom.com/products/storage/raid-controllers/megaraid-9460-8i
* https://docs.broadcom.com/docs/007.1316.0000.0000_Unified_StorCLI_PUL.zip
* Sw installed in ''/opt/MegaRAID/storcli/''
* For brevity added alias ''storcli'' in root .bashrc
* Created array volume with 1024k strip size for very large files storcli /c0 add vd r6 name=alicedisk drives=133:0-6 Strip=1024
* Note: initialization takes 13h
* Raid volume name in storcli:''/c0/v0''
* Created hotspare (disk that kicks in if one breaks)storcli /c0/e133/s7 add hotsparedrive
* Written few notes on storcli in [[it_tips:storcliqr|MegaRAID storcli QR]]
==== 2020603 ====
* Configured server network. Current hostname: alice
* Configured remote access to monitor using X2GO client-server technology. Works nicely.
* Reset maintenance interface. Tested ok. Custom tools not working but web browser interface ok.
* Installed system. Configured RAID1 on system disks. Upgraded system.
* Configured basic firewall.
==== 2020602 ====
* Created monitor machine
* Placed monitor machine in C165
* Connected monitor machine to server
* Configured RAID6. Experimental. More info needed from users. Disks working but disk lights not blinking, contacted Compliq.
* Configured remote access to monitor. Not working, better solution required
* Configured remote access to server maintenance interface. Not working, requires reset. First attempt failed due to network config missing.
==== 20200529 ====
* Server arrived. Placed in C165.
* Inspected hardware and disk bays
* Provided electric connectivity
* Provided network connectivity
* First boot successfull