====== ALICE crunch server WIP ====== This page contains an overview of the machine and an everyday log of what is being done. Later there will be more accurate documentation. ===== Overview ===== The machine has 3 disks. * One M.2 (SSD) for boot and system in raid 1 with the one below (3 copies of the system, but we will see if of any use) * Two U.2 (SSD) in RAID1 for user homes * a large capacity RAID6 for data, ~60TB See description in this graph below. .------------------------------------------. | M.2 Disk 250GB | |------------------------------------------| .----------| Main system disk - boot / | SW RAID1 | | operating system |-------------------------------------------------------------. | | /swap | | | | /ahome (admin home separated from users) | | | '------------------------------------------' | | | | SW RAID1 | | .----------------------------------------. .----------------------------------------. | | | U.2 disk 1 960GB | | U.2 disk 2 960GB | | | |----------------------------------------| |----------------------------------------| | | | - User folders /home 700GB |<---------------->| - User folders /home 700GB | | | | -------------------------------------- | SW RAID1 | -------------------------------------- | | --------->| - copy of boot disk (bootable) / 100GB | | - copy of boot disk (bootable) / 100GB |<---' | copy of /ahome 100GB260GB |<---------------->| copy of /ahome 100GB | '----------------------------------------' '----------------------------------------' .------------------------------------------------. __________ | RAID6 storage - 8x 12TB disks /disk | __________ [_|||||||_°]|------------------------------------------------|[_|||||||_°] [_|||||||_°]| - 7x 12TB in RAID6 -> 60TB (50-55 usable) |[_|||||||_°] [_|||||||_°]| - 1x 12TB spare |[_|||||||_°] '------------------------------------------------' One can login to the machine as it does with aurora. Additionally admins can login to a separate monitor machine to check the status of hardware components. This machine will also send warnings in case of issues. ____ __ | | |==| .------Admins------>|____| | | HW Monitor | /::::/ |__| | | __ _ | o [__]|=| v /|\ /::/|_| ________ / \ |==|=====| Own laptop / teddi | | | | | | | | | | | ALICE new machine '--Researchers-->| | | | |====°| |__|_____| Florido will take care of OS installation and hardware maintenance/monitoring. This info will be shared with ALICE admins. ALICE admins will have access as root and control any parameter they want. ===== TODO ===== * Configure networking: DONE :-D * Configure maintenance interface: DONE :-D * Configure HW monitoring server: ongoing FIXME * Configure machine os and services DONE :-D * Configure hardware monitoring FIXME * Configure user access to monitor FIXME * add NIS auth but with local folders? VS dedicated user with access to specific hw? * Install system: DONE :-D * Create RAID6: DONE :-D * Find best settings, discuss with users DONE :-D * Configure system: ongoing DONE :-D * Configure firewall DONE :-D * Test fallback operating system in case of failure FIXME * Configure config management system. Testing **ansible** DONE :-D * Enable snmp monitoring DONE :-D * add server to monitor DONE :-D * Create admin users ongoing DONE :-D * generic alice user DONE :-D * give admin permissions to relevant alice users DONE :-D * Format and connect RAID6 disk DONE :-D * Configure RAID monitoring DONE :-D * User services (tasks mainly to be done by admin users): DONE :-D * Create users DONE :-D * Create folders DONE :-D * Share folders across servers (requires installing new services) :?: * Install ALICE software DONE :-D by ALICE members * More details here :?: * Move data DONE :-D helped with more performing script * Documentation * Write a page on how to manage the system FIXME * Started on [[:alicebeast]] * Must be only accessible by admins * Must include system description above here * Must describe common procedures * Ask if users want remote X11 - NO ===== Logbook ===== ==== 202101** ==== * new entry ==== 20210119 ==== * Configured RAID monitoring ==== 2020616 ==== * Configured system to be managed remotely via ansible * Changed sshd_config * Added management keys to root user * Management currently done via my workstation, may change in the future. * Installed basic software * Created alice admin user * Installed basic build software ==== 2020612 ==== * Formatted RAID volume as xfs mkfs.xfs -f -d su=1m,sw=5 -L alicedisk /dev/sda * Shows a warning but should be harmless ^ su=1m | 1MB strip size | ^ sw=5 | 5 RAID disks (+2 parity) | ^ -L alicedisk | xfs label | ^ /dev/sda | the raid shown as a device to the kernel | * created folder ''/disk'' * created entry in fstab for disk based on UUID:# RAID disk UUID=0d4a40e5-084e-404f-9219-6c3645929ec2 /disk xfs rw,seclabel,relatime,attr2,inode64,sunit=2048,swidth=10240,noquota 0 0 ==== 2020609 ==== * configured server network and hostname: beast * Installed storcli RAID management tool from Vendor website * https://www.broadcom.com/products/storage/raid-controllers/megaraid-9460-8i * https://docs.broadcom.com/docs/007.1316.0000.0000_Unified_StorCLI_PUL.zip * Sw installed in ''/opt/MegaRAID/storcli/'' * For brevity added alias ''storcli'' in root .bashrc * Created array volume with 1024k strip size for very large files storcli /c0 add vd r6 name=alicedisk drives=133:0-6 Strip=1024 * Note: initialization takes 13h * Raid volume name in storcli:''/c0/v0'' * Created hotspare (disk that kicks in if one breaks)storcli /c0/e133/s7 add hotsparedrive * Written few notes on storcli in [[it_tips:storcliqr|MegaRAID storcli QR]] ==== 2020603 ==== * Configured server network. Current hostname: alice * Configured remote access to monitor using X2GO client-server technology. Works nicely. * Reset maintenance interface. Tested ok. Custom tools not working but web browser interface ok. * Installed system. Configured RAID1 on system disks. Upgraded system. * Configured basic firewall. ==== 2020602 ==== * Created monitor machine * Placed monitor machine in C165 * Connected monitor machine to server * Configured RAID6. Experimental. More info needed from users. Disks working but disk lights not blinking, contacted Compliq. * Configured remote access to monitor. Not working, better solution required * Configured remote access to server maintenance interface. Not working, requires reset. First attempt failed due to network config missing. ==== 20200529 ==== * Server arrived. Placed in C165. * Inspected hardware and disk bays * Provided electric connectivity * Provided network connectivity * First boot successfull