===== Captain's log ===== Here I am keeping trace of the work I am doing on the cluster, for everybody to track progress. The logs can be read top to bottom from the most recent change to the newest. --- //[[:Florido Paganelli]] 2014/06/12 19:19// * shared each node's tmp folder in ''/nodestmp/'' :!: documentation about it needs to be written :!: --- //[[:Florido Paganelli]] 2014/06/02 17:40// * updated and rebooted nptest and pptest * installed quantum espresso and RTE * updated salt configuration: * software needed by Nuclear Physics * module configuration for custom modulefiles * motd and banner * gitted and backed up salt config --- //[[:Florido Paganelli]] 2014/04/28 19:33// * updated arc interface grid certificate * joined NorduGrid Sweden indexes * configured syslog installation. Needs tweaking of hostname on nodes with multiple interfaces. These nodes do not have special configuration in salt yet. --- //[[:Florido Paganelli]] 2014/03/21 21:00// * made so many changes that I couldn't keep track of. Now testing nodes are fully accessible from the internet. * rewrote documentation for the cluster. --- //[[:Florido Paganelli]] 2014/03/17 20:39// * added network configuration for n1 (nptest-iridium) and n2 (pptest-iridium) * rebooted nodes, now accessible from the internet. eth0 shut down awaiting for iptables config * todo: block access to other nodes, only allow course users, disable slurm, configure iptables --- //[[:Florido Paganelli]] 2014/03/14 18:00// * added sshd configuration for n1 and n2, only me and maintenance user can access * added envvars for umask * changed motd --- //[[:Florido Paganelli]] 2014/03/10 17:19// * updated salt-master on service-iridium due to incompatibilities with newer minions * created two simple queues in slurm * reconfigured all slurm nodes * reconfigured arc frontend --- //[[:Florido Paganelli]] 2014/03/07 17:23// * installed terena certificate on frontend. * configured nordugrid VOMS server --- //[[:Florido Paganelli]] 2014/02/21 10:26// * installed and configured arc. Job sumbission with emi-es a failure, but maybe ARC 4.1 will solve all. * configured arc with slurm and cache; however would be better to have two grid users (one for each division) and two queues. --- //[[:Florido Paganelli]] 2014/02/18 13:55// * installed salt on arc-iridium * installed other trustanchors on arc-iridium * configured firewall (still needs cleanup) * installed and configured munge, slurm, autofs, nfs --- //[[:Florido Paganelli]] 2014/02/17 19:07// * installed NG repos on arc-iridium and trustanchors * configured a-rex using instantca * initiated process of requesting host cert * asked Lunarc to open ports for ARC services --- //[[:Florido Paganelli]] 2014/02/11 18:29// * updated hep-srv :!: reboot didn't work, help request submitted to Lunarc. * installed epel on arc-iridium * installed ARC on arc-iridium * installed munge on arc-iridium --- //[[:Florido Paganelli]] 2014/02/07 18:31// * rebooted service-iridium * updated and rebooted ssh gateway * updated and configured arc-iridium --- //[[:Florido Paganelli]] 2014/02/04 18:31// * rebooted kvm-iridium * created arc-iridium machine * updated kvm-iridium to Centos6.5 * cloning disk for arc-iridium * started service-iridium update --- //[[:Florido Paganelli]] 2014/01/31 16:02// * finalized slurm installation * fixed several iptables and munge issues * rebooted all nodes --- //[[:Florido Paganelli]] 2014/01/30 17:45// * finished configuring SLURM, tests started * copied ssh keys * updated git * updated iptables configuration for all nodes * changed anaconda scripts (to be tested!) to include ssh key retrieval --- //[[:Florido Paganelli]] 2014/01/29 16:55// * installed 'SLURM' from FGI repositories maintained by Tiggi. This slurm is built withouth MPI support and with mysql and lua support * updated roadmap --- //[[:Florido Paganelli]] 2014/01/28 17:28// * added 'MUNGE' to all nodes incl. 'service-iridium'. Not clear where the frontend should run, but I would say on the same machine that runs ARC. that means it has to share the secret key. * generated the secret key on 'service-iridium'. --- //[[:Florido Paganelli]] 2013/11/07 14:03// * applied package adds to all nodes * updated software package for HEP in salt/common * updated salt config to create directories in /nfs, applied to all nodes * restarted salt-minion on all nodes --- //[[:Florido Paganelli]] 2013/11/06 17:42// * updated [[iridium_cluster:wip#roadmap]] * added few packages to ''service-iridium'' ''/srv/salt/common/init.sls'' --- //[[:Florido Paganelli]] 2013/09/20 18:12// * CPUs shipped by SouthPole have been replaced by Compliq on the frontend. Cluster Up and Running again. --- //[[:Florido Paganelli]] 2013/09/20 15:19// * checked serial number on frontend. * Waiting for compliq to substitute faulty CPUs; the frontend has been shut down. --- //[[:Florido Paganelli]] 2013/08/22 12:40// * created users home folders on storage-iridium * reconfigured storage-iridium NIS * changed tjatte setup * updated testing phase purposes on [[iridium_cluster:howtos_users]] * sent update emails --- //[[:Florido Paganelli]] 2013/08/21 21:48// * configured gateway restricted shell * added new users to the cluster * solved issue with password change. Passwords cannot be changed by users now, :TODO: solve this security issue using PAM on all machines. * added missing users to tjatte for testing round * written documentation on [[iridium_cluster:howtos_users]] --- //[[:Florido Paganelli]] 2013/08/20 17:48// * morning meeting with Pico to sort out some technicalities. Decisions: change the way direct login is done. Have a limited shell gateway. * Afternoon meeting with e-Science group: set new roadmap * Redesigned cluster web pages --- //[[Florido Paganelli]] 2013/08/19 21:14// * got answer from Lunarc. New plan will come after tomorrow's internal meeting with the Lund e-Science group. * started writing information on how to use testing nodes in [[iridium_cluster:howtos_users]] --- //[[Florido Paganelli]] 2013/08/16 18:58// * configured clusterip mode for load balancing between testing nodes n1 and n2 * added SALT configuration for clusterip (needs more work, awaiting for Luis) * changed atlas.sh * configured kvm-iridium to forward to CLUSTERIP * automated salt-call execution at boot time for each node. Node configures itself and reboots at installation time. * tested cvmfs on Centos6, seems to work * reinstalled n1 with cvmfs * run some tests with cvmfs runKV for missing libraries etc. on n1. Dump is in pflorido user folder. --- //[[Florido Paganelli]] 2013/08/14 20:23// * finalized cvmfs configuration for sl6. needs to be tested on centos6 * added profile.d scripts for atlas environment initialization --- //[[Florido Paganelli]] 2013/08/13 18:41// * nfs configuration changed to nfs4 for SALT keys distribution * sl6 installation automation script set up. * decisions on how to move Nuclear Physics data into the cluster have been taken: access to a single node of the cluster enabled to allow Luis to start tests. Open access to other researchers requires Lunarc intervention. * checked some items in [[#Fixmes]] --- //[[Florido Paganelli]] 2013/08/12 18:27// * finalized nfs4 configuration on storage. * created autofs salt configuration. * created NIS salt configuration * versioned salt folder with git. * started cvmfs salt automatic configuration * changed partitioning scheme on sl6 nodes to accomodate cvmfs. This triggered creation of a sl6 kickstart file. * testing sl6 kickstart file --- //[[Florido Paganelli]] 2013/08/09 18:00// * reconfiguring NFS shares to be compliant with nfs4. * meeting w Luis on SALT operations * preliminary discussion on dataset and software deployment. --- //[[Florido Paganelli]] 2013/08/07 17:31// * added SALT iptables configuration on service-iridium * created auto.master and auto.home NIS maps for automatic configuration of mounts. This is probably better done with salt with proper auto.home configuration on each node. * changed auto.master to include NIS auto.home to n2 for automatic configuration of mounts. * added information on how to install autofs and cernvmfs on the [[iridium_cluster:wantedpackagesonnodes]] page * writing documentation on how to add new users. Discovered an issue with autofs setup. --- //[[Florido Paganelli]] 2013/08/06 18:45// * better NFS configuration on storage-iridium. * meeting with Luis on ssh host key sharing * configuration of storage-iridium and NFS for secure key sharing --- //[[Florido Paganelli]] 2013/08/05 19:10// * installed Frontier-Squid on storage-iridium. * :!: the setup needed special SELinux configuration along the lines of what is explained here: http://grokbase.com/t/centos/centos/111zyzd2n0/squid-and-selinux * changed firewall to enable squid proxy access * created atlas.sh script on n2 to enable atlas environment. * configured NIS, nfs, ntpd, on n2 and reported what to configure in the [[iridium_cluster:wantedpackagesonnodes|wanted packages page]] * enabled reverse DNS zone on service-iridium in order for anaconda installed to setup node hostnames. --- //[[Florido Paganelli]] 2013/08/02 18:20// * cloned sdb1 and sdb3 from kvm-iridium virtimages pool with dd to storage-iridium /export/backupimages/. Attempt to use 'virsh vol-download' failed: it took one day to transfer 2GB. Let's not do that anymore * removed sdb4 partition on kvm-iridium virtimages(can't remember why was there) * restarted sshgateway(iridium) and service-iridium * configured kickstart to install SL6 * installed SL6 on n2 * set up a specific partition for cernvmfs * installed cernvmfs * squid proxy needed for operations. * fixed an issue with service-iridium hosts file, had wrong IP address --- //[[Florido Paganelli]] 2013/08/01 15:16// * enabled Luis on hep-monitor * issues with virtualization layer, storage management. Hypervisor machine updated and restarted. --- //[[Florido Paganelli]] 2013/07/31 16:30// * enabled a user for Luis on hep-srv and NIS to login to nodes * meeting with Luis to plan configuration with SALT. Plan rescheduled. --- //[[Florido Paganelli]] 2013/07/31 11:35// * reconfigured all machines to use second network card for boot * started kickstart CentOS6 installation on n2,n3,n4 --- //[[Florido Paganelli]] 2013/07/29 16:52// * created a user and a vnc server instance for Luis * reconfigured hep-srv to be able to resolve internal hostnames * started experiencing with kickstart --- //[[Florido Paganelli]] 2013/07/26 17:08// * finished configuring named DNS on service-iridium. Now nodes can find other machines. * issue with hep-srv: probably broken network config. Sent email to Rickard and Robert from Lunarc. * configured n1 to correctly join the domain * port-forwarded one of the nodes (n1) --- //[[Florido Paganelli]] 2013/07/25 18:24// * installed centos6 on node 1 (n1). Corrected remote iso issue. * installed and configured nis and nfs shares on n1. Corrected various dhcp server errors. * started installing named DNS server on service-iridium. Issues in configuration. --- //[[Florido Paganelli]] 2013/07/24 18:29// * installed dhpc server on service-iridium * DONE: nodes BIOS setup must be changed, the default ethernet boot is not the configured one. * DONE: ethernet addresses needs to be updated in dhcp server. * DONE: issues in dispatching dns must be solved. Maybe installing bind or dnsmasq. * installed PXE booting system on service-iridium * installed tftp server on service-iridium. Issues with selinux. * changed storage-iridium iptables to serve nfs folders * created a directory for boot images on storage-iridium * successfully booted a node for installation. A mirror of Centos6 is needed to complete the install via nfs --- //[[Florido Paganelli]] 2013/07/23 18:07// * created xfs filesystem on 30TB storage * created directories to be shared among nodes * set up quotas as discussed with Luis. A defined description of quotas must be added to the cluster description document. --- //[[Florido Paganelli]] 2013/07/22 18:39// * profiling storage usage after meeting w Luis * understanding xfs features * understanding logical volume management basics. LVM2 will be used. * created logical volumes on 30GB storage * quotas will be managed by xfs on folders. To be done. --- //[[Florido Paganelli]] 2013/07/10 16:39// * understanding Cernvm-fs deployment * planning storage server space allocation * updating documents --- //[[Florido Paganelli]] 2013/07/08 16:34// * configured storage server storage-iridium operating system. Storage disks need configuration, will be done once CERNVM is understood. TODO: use of xfs, configuration of 10GB interface * updated pictures in [[#Tech_documents]] --- //[[Florido Paganelli]] 2013/07/05 17:53// * configured service machine service-iridium * configured iptables and NAT routing on sshgateway * configured NIS server on service-iridium * configured NIS client on sshgateway * configured iptables for NIS on service-iridium --- //[[|Florido Paganelli]] 2013/07/02 13:54// * Setup wiki to keep track of progress * Created a machine sshgateway iridium.lunarc.lu.se to be used as main ssh gateway and to host some of the services. * Configured frontend kvm-iridium machine networking to be ready for hosting. ==== Fixmes ==== FIXME FIXME FIXME FIXME FIXME I decided to collect FIXME s here because it was difficult to track them down when spread across logs. * autofs is mounting everything as nobody:nobody user as if it was all_squash set. Check how to solve. SOme say is idmap solved by setting NIS domain name to iridium and having storage-iridium joining the NIS domain. I guess nis domain names are forwarded from nfs4 to clients. * document squid setup better! especially custom selinux modules. * squid proxy access should be limited to nodes * Optionally install a second failover squid on service-iridium. * better iptables configuration on all nodes and servers based on ''-m state'' rules. Done on * //iridium// * //storage-iridium// * //kvm-iridium// (complex due to kvm) * //service-iridium// only for tcp connections. Check what to do for udp. * check xfs selinux rules on //storage-iridium//, added with chcon might disappear :( probably needs module creation as squid. * Mixed config for NFS3/NFS4. Would be better to use nfs4 and limit portmapper to NIS. * needs stronger authorization check on portmap/rpcbind services (i.e. hosts.allow on all machines) to be done in NIS SALT * update page [[iridium_cluster:wantedpackagesonnodes]] with current salt configuration. * generation and distribution of host keys on the nodes at deployment time is needed. Might be done as Luis did for SALT. * change default user groups in the cluster for all users and in documentation * investigate on-demand vm creation * change NIS passwd change system. disable yppaswd on all nodes. For the moment being shut down yppasswdd on service. * salt should check if a group is installed. This will speedup things. * sort out why dhcp won't renew after trying for long time. * add CA certificates to all machines. Probably storage server to share certs would do the trick and need only one crl check in place. * define a proper time when node update * disable cvmfs autoupdate * salt-minion do not restart after upgrade. Find means to force it. salt-command might jam if done that way: salt -v 'n5.iridium' cmd.run "service salt-minion restart"