===== Captain's log =====

Here I am keeping trace of the work I am doing on the cluster, for everybody to track progress.

The logs can be read top to bottom from the most recent change to the newest.

 --- //[[:Florido Paganelli]] 2014/06/12 19:19//
  * shared each node's tmp folder in ''/nodestmp/<nodename>'' :!: documentation about it needs to be written :!:

 --- //[[:Florido Paganelli]] 2014/06/02 17:40//
  * updated and rebooted nptest and pptest
  * installed quantum espresso and RTE
  * updated salt configuration:
    * software needed by Nuclear Physics
    * module configuration for custom modulefiles
    * motd and banner
  * gitted and backed up salt config

 --- //[[:Florido Paganelli]] 2014/04/28 19:33//
  * updated arc interface grid certificate
  * joined NorduGrid Sweden indexes
  * configured syslog installation. Needs tweaking of hostname on nodes with multiple interfaces. These nodes do not have special configuration in salt yet.

 --- //[[:Florido Paganelli]] 2014/03/21 21:00//
  * made so many changes that I couldn't keep track of. Now testing nodes are fully accessible from the internet.
  * rewrote documentation for the cluster.

 --- //[[:Florido Paganelli]] 2014/03/17 20:39//
  * added network configuration for n1 (nptest-iridium) and n2 (pptest-iridium)
  * rebooted nodes, now accessible from the internet. eth0 shut down awaiting for iptables config
  * todo: block access to other nodes, only allow course users, disable slurm, configure iptables

 --- //[[:Florido Paganelli]] 2014/03/14 18:00//
  * added sshd configuration for n1 and n2, only me and maintenance user can access
  * added envvars for umask
  * changed motd

 --- //[[:Florido Paganelli]] 2014/03/10 17:19//
  * updated salt-master on service-iridium due to incompatibilities with newer minions
  * created two simple queues in slurm
  * reconfigured all slurm nodes
  * reconfigured arc frontend

 --- //[[:Florido Paganelli]] 2014/03/07 17:23//
  * installed terena certificate on frontend.
  * configured nordugrid VOMS server

 --- //[[:Florido Paganelli]] 2014/02/21 10:26//
  * installed and configured arc. Job sumbission with emi-es a failure, but maybe ARC 4.1 will solve all.
  * configured arc with slurm and cache; however would be better to have two grid users (one for each division) and two queues.

 --- //[[:Florido Paganelli]] 2014/02/18 13:55//
  * installed salt on arc-iridium
  * installed other trustanchors on arc-iridium
  * configured firewall (still needs cleanup)
  * installed and configured munge, slurm, autofs, nfs

 --- //[[:Florido Paganelli]] 2014/02/17 19:07//
  * installed NG repos on arc-iridium and trustanchors
  * configured a-rex using instantca
  * initiated process of requesting host cert
  * asked Lunarc to open ports for ARC services

 --- //[[:Florido Paganelli]] 2014/02/11 18:29//
  * updated hep-srv :!: reboot didn't work, help request submitted to Lunarc.
  * installed epel on arc-iridium
  * installed ARC on arc-iridium
  * installed munge on arc-iridium

 --- //[[:Florido Paganelli]] 2014/02/07 18:31//
  * rebooted service-iridium
  * updated and rebooted ssh gateway
  * updated and configured arc-iridium

 --- //[[:Florido Paganelli]] 2014/02/04 18:31//
  * rebooted kvm-iridium
  * created arc-iridium machine
  * updated kvm-iridium to Centos6.5
  * cloning disk for arc-iridium
  * started service-iridium update

 --- //[[:Florido Paganelli]] 2014/01/31 16:02//
  * finalized slurm installation
  * fixed several iptables and munge issues
  * rebooted all nodes

 --- //[[:Florido Paganelli]] 2014/01/30 17:45//

  * finished configuring SLURM, tests started
  * copied ssh keys
  * updated git
  * updated iptables configuration for all nodes
  * changed anaconda scripts (to be tested!) to include ssh key retrieval

 --- //[[:Florido Paganelli]] 2014/01/29 16:55//
  * installed 'SLURM' from FGI repositories maintained by Tiggi. This slurm is built withouth MPI support and with mysql and lua support
  * updated roadmap

 --- //[[:Florido Paganelli]] 2014/01/28 17:28//
  * added 'MUNGE' to all nodes incl. 'service-iridium'. Not clear where the frontend should run, but I would say on the same machine that runs ARC. that means it has to share the secret key.
  * generated the secret key on 'service-iridium'.

 --- //[[:Florido Paganelli]] 2013/11/07 14:03//
  * applied package adds to all nodes
  * updated software package for HEP in salt/common
  * updated salt config to create directories in /nfs, applied to all nodes
  * restarted salt-minion on all nodes

 --- //[[:Florido Paganelli]] 2013/11/06 17:42//
  * updated [[iridium_cluster:wip#roadmap]]
  * added few packages to ''service-iridium'' ''/srv/salt/common/init.sls''

 --- //[[:Florido Paganelli]] 2013/09/20 18:12//
  * CPUs shipped by SouthPole have been replaced by Compliq on the frontend. Cluster Up and Running again.

 --- //[[:Florido Paganelli]] 2013/09/20 15:19//
  * checked serial number on frontend.
  * Waiting for compliq to substitute faulty CPUs;  the frontend has been shut down.

 --- //[[:Florido Paganelli]] 2013/08/22 12:40//
  * created users home folders on storage-iridium
  * reconfigured storage-iridium NIS
  * changed tjatte setup 
  * updated testing phase purposes on [[iridium_cluster:howtos_users]]
  * sent update emails

 --- //[[:Florido Paganelli]] 2013/08/21 21:48//
  * configured gateway restricted shell
  * added new users to the cluster
  * solved issue with password change. Passwords cannot be changed by users now, :TODO: solve this security issue using PAM on all machines.
  * added missing users to tjatte for testing round
  * written documentation on [[iridium_cluster:howtos_users]]
 
 --- //[[:Florido Paganelli]] 2013/08/20 17:48//
  * morning meeting with Pico to sort out some technicalities. Decisions: change the way direct login is done. Have a limited shell gateway.
  * Afternoon meeting with e-Science group: set new roadmap
  * Redesigned cluster web pages

 --- //[[Florido Paganelli]] 2013/08/19 21:14//
  * got answer from Lunarc. New plan will come after tomorrow's internal meeting with the Lund e-Science group.
  * started writing information on how to use testing nodes in [[iridium_cluster:howtos_users]]

 --- //[[Florido Paganelli]] 2013/08/16 18:58//
  * configured clusterip mode for load balancing between testing nodes n1 and n2
  * added SALT configuration for clusterip (needs more work, awaiting for Luis)
  * changed atlas.sh
  * configured kvm-iridium to forward to CLUSTERIP
  * automated salt-call execution at boot time for each node. Node configures itself and reboots at installation time.
  * tested cvmfs on Centos6, seems to work
  * reinstalled n1 with cvmfs
  * run some tests with cvmfs runKV for missing libraries etc. on n1. Dump is in pflorido user folder.

 --- //[[Florido Paganelli]] 2013/08/14 20:23//
  * finalized cvmfs configuration for sl6. needs to be tested on centos6
  * added profile.d scripts for atlas environment initialization

 --- //[[Florido Paganelli]] 2013/08/13 18:41//
  * nfs configuration changed to nfs4 for SALT keys distribution
  * sl6 installation automation script set up.
  * decisions on how to move Nuclear Physics data into the cluster have been taken: access to a single node of the cluster enabled to allow Luis to start tests. Open access to other researchers requires Lunarc intervention.
  * checked some items in [[#Fixmes]]

 --- //[[Florido Paganelli]] 2013/08/12 18:27//
  * finalized nfs4 configuration on storage.
  * created autofs salt configuration.
  * created NIS salt configuration
  * versioned salt folder with git.
  * started cvmfs salt automatic configuration
  * changed partitioning scheme on sl6 nodes to accomodate cvmfs. This triggered creation of a sl6 kickstart file.
  * testing sl6 kickstart file

 --- //[[Florido Paganelli]] 2013/08/09 18:00//
  * reconfiguring NFS shares to be compliant with nfs4.
  * meeting w Luis on SALT operations
  * preliminary discussion on dataset and software deployment.

 --- //[[Florido Paganelli]] 2013/08/07 17:31//
  * added SALT iptables configuration on service-iridium
  * created auto.master and auto.home NIS maps for automatic configuration of mounts. This is probably better done with salt with proper auto.home configuration on each node.
  * changed auto.master to include NIS auto.home to n2 for automatic configuration of mounts.
  * added information on how to install autofs and cernvmfs on the [[iridium_cluster:wantedpackagesonnodes]] page
  * writing documentation on how to add new users. Discovered an issue with autofs setup.

 --- //[[Florido Paganelli]] 2013/08/06 18:45//
  * better NFS configuration on storage-iridium.
  * meeting with Luis on ssh host key sharing
  * configuration of storage-iridium and NFS for secure key sharing

 --- //[[Florido Paganelli]] 2013/08/05 19:10//
  * installed Frontier-Squid on storage-iridium.
    * :!: the setup needed special SELinux configuration along the lines of what is explained here: http://grokbase.com/t/centos/centos/111zyzd2n0/squid-and-selinux
    * changed firewall to enable squid proxy access
    * created atlas.sh script on n2 to enable atlas environment.
    * configured NIS, nfs, ntpd, on n2 and reported what to configure in the [[iridium_cluster:wantedpackagesonnodes|wanted packages page]]
    * enabled reverse DNS zone on service-iridium in order for anaconda installed to setup node hostnames.


 --- //[[Florido Paganelli]] 2013/08/02 18:20//
  * cloned sdb1 and sdb3 from kvm-iridium virtimages pool with dd to storage-iridium /export/backupimages/. Attempt to use 'virsh vol-download' failed: it took one day to transfer 2GB. Let's not do that anymore
  * removed sdb4 partition on kvm-iridium virtimages(can't remember why was there)
  * restarted sshgateway(iridium) and service-iridium
  * configured kickstart to install SL6
  * installed SL6 on n2
  * set up a specific partition for cernvmfs
  * installed cernvmfs
    * squid proxy needed for operations.
  * fixed an issue with service-iridium hosts file, had wrong IP address

 --- //[[Florido Paganelli]] 2013/08/01 15:16//
  * enabled Luis on hep-monitor
  * issues with virtualization layer, storage management. Hypervisor machine updated and restarted.

 --- //[[Florido Paganelli]] 2013/07/31 16:30//
  * enabled a user for Luis on hep-srv and NIS to login to nodes
  * meeting with Luis to plan configuration with SALT. Plan rescheduled.

 --- //[[Florido Paganelli]] 2013/07/31 11:35//
  * reconfigured all machines to use second network card for boot
  * started kickstart CentOS6 installation on n2,n3,n4

 --- //[[Florido Paganelli]] 2013/07/29 16:52//
  * created a user and a vnc server instance for Luis
  * reconfigured hep-srv to be able to resolve internal hostnames
  * started experiencing with kickstart

 --- //[[Florido Paganelli]] 2013/07/26 17:08//
  * finished configuring named DNS on service-iridium. Now nodes can find other machines.
  * issue with hep-srv: probably broken network config. Sent email to Rickard and Robert from Lunarc.
  * configured n1 to correctly join the domain
  * port-forwarded one of the nodes (n1)

 --- //[[Florido Paganelli]] 2013/07/25 18:24//
 
  * installed centos6 on node 1 (n1). Corrected remote iso issue.
  * installed and configured nis and nfs shares on n1. Corrected various dhcp server errors.
  * started installing named DNS server on service-iridium. Issues in configuration.

 --- //[[Florido Paganelli]] 2013/07/24 18:29//
  * installed dhpc server on service-iridium  
    * DONE: nodes BIOS setup must be changed, the default ethernet boot is not the configured one. 
    * DONE: ethernet addresses needs to be updated in dhcp server.
    * DONE: issues in dispatching dns must be solved. Maybe installing bind or dnsmasq.
  * installed PXE booting system on service-iridium
  * installed tftp server on service-iridium. Issues with selinux.
  * changed storage-iridium iptables to serve nfs folders
  * created a directory for boot images on storage-iridium
  * successfully booted a node for installation. A mirror of Centos6 is needed to complete the install via nfs

 --- //[[Florido Paganelli]] 2013/07/23 18:07//

  * created xfs filesystem on 30TB storage
  * created directories to be shared among nodes
  * set up quotas as discussed with Luis. A defined description of quotas must be added to the cluster description document.

 --- //[[Florido Paganelli]] 2013/07/22 18:39//
  * profiling storage usage after meeting w Luis
  * understanding xfs features
  * understanding logical volume management basics. LVM2 will be used.
  * created logical volumes on 30GB storage
  * quotas will be managed by xfs on folders. To be done.

 --- //[[Florido Paganelli]] 2013/07/10 16:39//
  * understanding Cernvm-fs deployment
  * planning storage server space allocation
  * updating documents

 --- //[[Florido Paganelli]] 2013/07/08 16:34//
  * configured storage server storage-iridium operating system. Storage disks need configuration, will be done once CERNVM is understood. TODO: use of xfs, configuration of 10GB interface
  * updated pictures in [[#Tech_documents]]

 --- //[[Florido Paganelli]] 2013/07/05 17:53//

  * configured service machine service-iridium
  * configured iptables and NAT routing on sshgateway
  * configured NIS server on service-iridium
  * configured NIS client on sshgateway
  * configured iptables for NIS on service-iridium

 --- //[[|Florido Paganelli]] 2013/07/02 13:54//

  * Setup wiki to keep track of progress
  * Created a machine sshgateway iridium.lunarc.lu.se to be used as main ssh gateway and to host some of the services.
  * Configured frontend kvm-iridium machine networking to be ready for hosting.

==== Fixmes ====
FIXME FIXME FIXME FIXME FIXME

I decided to collect FIXME s here because it was difficult to track them down when spread across logs.

  * <del>autofs is mounting everything as nobody:nobody user as if it was all_squash set. Check how to solve. SOme say is idmap</del> solved by setting NIS domain name to iridium and having storage-iridium joining the NIS domain. I guess nis domain names are forwarded from nfs4 to clients.
  * document squid setup better! especially custom selinux modules.
  * squid proxy access should be limited to nodes
  * Optionally install a second failover squid on service-iridium.
  * better iptables configuration on all nodes and servers based on ''-m state'' rules. Done on
    * //iridium//
    * //storage-iridium//
    * //kvm-iridium// (complex due to kvm)
    * //service-iridium// only for tcp connections. Check what to do for udp.
  * check xfs selinux rules on //storage-iridium//, added with chcon might disappear :( probably needs module creation as squid.
  * Mixed config for NFS3/NFS4. Would be better to use nfs4 and limit portmapper to NIS.
  * needs stronger authorization check on portmap/rpcbind services (i.e. hosts.allow on all machines) to be done in NIS SALT
  * <del>update page [[iridium_cluster:wantedpackagesonnodes]] with current salt configuration.</del>
  * generation and distribution of host keys on the nodes at deployment time is needed. Might be done as Luis did for SALT.
  * change default user groups in the cluster for all users and in documentation
  * investigate on-demand vm creation
  * change NIS passwd change system. disable yppaswd on all nodes. For the moment being shut down yppasswdd on service.
  * salt should check if a group is installed. This will speedup things.
  * sort out why dhcp won't renew after trying for long time.
  * add CA certificates to all machines. Probably storage server to share certs would do the trick and need only one crl check in place.
  * define a proper time when node update
  * disable cvmfs autoupdate
  * salt-minion do not restart after upgrade. Find means to force it. salt-command might jam if done that way: <code>salt -v 'n5.iridium' cmd.run "service salt-minion restart"</code>