User Tools

Site Tools


iridium_cluster:captainslog

Captain's log

Here I am keeping trace of the work I am doing on the cluster, for everybody to track progress.

The logs can be read top to bottom from the most recent change to the newest.

Florido Paganelli 2014/06/12 19:19

  • shared each node's tmp folder in /nodestmp/<nodename> :!: documentation about it needs to be written :!:

Florido Paganelli 2014/06/02 17:40

  • updated and rebooted nptest and pptest
  • installed quantum espresso and RTE
  • updated salt configuration:
    • software needed by Nuclear Physics
    • module configuration for custom modulefiles
    • motd and banner
  • gitted and backed up salt config

Florido Paganelli 2014/04/28 19:33

  • updated arc interface grid certificate
  • joined NorduGrid Sweden indexes
  • configured syslog installation. Needs tweaking of hostname on nodes with multiple interfaces. These nodes do not have special configuration in salt yet.

Florido Paganelli 2014/03/21 21:00

  • made so many changes that I couldn't keep track of. Now testing nodes are fully accessible from the internet.
  • rewrote documentation for the cluster.

Florido Paganelli 2014/03/17 20:39

  • added network configuration for n1 (nptest-iridium) and n2 (pptest-iridium)
  • rebooted nodes, now accessible from the internet. eth0 shut down awaiting for iptables config
  • todo: block access to other nodes, only allow course users, disable slurm, configure iptables

Florido Paganelli 2014/03/14 18:00

  • added sshd configuration for n1 and n2, only me and maintenance user can access
  • added envvars for umask
  • changed motd

Florido Paganelli 2014/03/10 17:19

  • updated salt-master on service-iridium due to incompatibilities with newer minions
  • created two simple queues in slurm
  • reconfigured all slurm nodes
  • reconfigured arc frontend

Florido Paganelli 2014/03/07 17:23

  • installed terena certificate on frontend.
  • configured nordugrid VOMS server

Florido Paganelli 2014/02/21 10:26

  • installed and configured arc. Job sumbission with emi-es a failure, but maybe ARC 4.1 will solve all.
  • configured arc with slurm and cache; however would be better to have two grid users (one for each division) and two queues.

Florido Paganelli 2014/02/18 13:55

  • installed salt on arc-iridium
  • installed other trustanchors on arc-iridium
  • configured firewall (still needs cleanup)
  • installed and configured munge, slurm, autofs, nfs

Florido Paganelli 2014/02/17 19:07

  • installed NG repos on arc-iridium and trustanchors
  • configured a-rex using instantca
  • initiated process of requesting host cert
  • asked Lunarc to open ports for ARC services

Florido Paganelli 2014/02/11 18:29

  • updated hep-srv :!: reboot didn't work, help request submitted to Lunarc.
  • installed epel on arc-iridium
  • installed ARC on arc-iridium
  • installed munge on arc-iridium

Florido Paganelli 2014/02/07 18:31

  • rebooted service-iridium
  • updated and rebooted ssh gateway
  • updated and configured arc-iridium

Florido Paganelli 2014/02/04 18:31

  • rebooted kvm-iridium
  • created arc-iridium machine
  • updated kvm-iridium to Centos6.5
  • cloning disk for arc-iridium
  • started service-iridium update

Florido Paganelli 2014/01/31 16:02

  • finalized slurm installation
  • fixed several iptables and munge issues
  • rebooted all nodes

Florido Paganelli 2014/01/30 17:45

  • finished configuring SLURM, tests started
  • copied ssh keys
  • updated git
  • updated iptables configuration for all nodes
  • changed anaconda scripts (to be tested!) to include ssh key retrieval

Florido Paganelli 2014/01/29 16:55

  • installed 'SLURM' from FGI repositories maintained by Tiggi. This slurm is built withouth MPI support and with mysql and lua support
  • updated roadmap

Florido Paganelli 2014/01/28 17:28

  • added 'MUNGE' to all nodes incl. 'service-iridium'. Not clear where the frontend should run, but I would say on the same machine that runs ARC. that means it has to share the secret key.
  • generated the secret key on 'service-iridium'.

Florido Paganelli 2013/11/07 14:03

  • applied package adds to all nodes
  • updated software package for HEP in salt/common
  • updated salt config to create directories in /nfs, applied to all nodes
  • restarted salt-minion on all nodes

Florido Paganelli 2013/11/06 17:42

  • updated roadmap
  • added few packages to service-iridium /srv/salt/common/init.sls

Florido Paganelli 2013/09/20 18:12

  • CPUs shipped by SouthPole have been replaced by Compliq on the frontend. Cluster Up and Running again.

Florido Paganelli 2013/09/20 15:19

  • checked serial number on frontend.
  • Waiting for compliq to substitute faulty CPUs; the frontend has been shut down.

Florido Paganelli 2013/08/22 12:40

  • created users home folders on storage-iridium
  • reconfigured storage-iridium NIS
  • changed tjatte setup
  • updated testing phase purposes on howtos_users
  • sent update emails

Florido Paganelli 2013/08/21 21:48

  • configured gateway restricted shell
  • added new users to the cluster
  • solved issue with password change. Passwords cannot be changed by users now, :TODO: solve this security issue using PAM on all machines.
  • added missing users to tjatte for testing round
  • written documentation on howtos_users

Florido Paganelli 2013/08/20 17:48

  • morning meeting with Pico to sort out some technicalities. Decisions: change the way direct login is done. Have a limited shell gateway.
  • Afternoon meeting with e-Science group: set new roadmap
  • Redesigned cluster web pages

Florido Paganelli 2013/08/19 21:14

  • got answer from Lunarc. New plan will come after tomorrow's internal meeting with the Lund e-Science group.
  • started writing information on how to use testing nodes in howtos_users

Florido Paganelli 2013/08/16 18:58

  • configured clusterip mode for load balancing between testing nodes n1 and n2
  • added SALT configuration for clusterip (needs more work, awaiting for Luis)
  • changed atlas.sh
  • configured kvm-iridium to forward to CLUSTERIP
  • automated salt-call execution at boot time for each node. Node configures itself and reboots at installation time.
  • tested cvmfs on Centos6, seems to work
  • reinstalled n1 with cvmfs
  • run some tests with cvmfs runKV for missing libraries etc. on n1. Dump is in pflorido user folder.

Florido Paganelli 2013/08/14 20:23

  • finalized cvmfs configuration for sl6. needs to be tested on centos6
  • added profile.d scripts for atlas environment initialization

Florido Paganelli 2013/08/13 18:41

  • nfs configuration changed to nfs4 for SALT keys distribution
  • sl6 installation automation script set up.
  • decisions on how to move Nuclear Physics data into the cluster have been taken: access to a single node of the cluster enabled to allow Luis to start tests. Open access to other researchers requires Lunarc intervention.
  • checked some items in Fixmes

Florido Paganelli 2013/08/12 18:27

  • finalized nfs4 configuration on storage.
  • created autofs salt configuration.
  • created NIS salt configuration
  • versioned salt folder with git.
  • started cvmfs salt automatic configuration
  • changed partitioning scheme on sl6 nodes to accomodate cvmfs. This triggered creation of a sl6 kickstart file.
  • testing sl6 kickstart file

Florido Paganelli 2013/08/09 18:00

  • reconfiguring NFS shares to be compliant with nfs4.
  • meeting w Luis on SALT operations
  • preliminary discussion on dataset and software deployment.

Florido Paganelli 2013/08/07 17:31

  • added SALT iptables configuration on service-iridium
  • created auto.master and auto.home NIS maps for automatic configuration of mounts. This is probably better done with salt with proper auto.home configuration on each node.
  • changed auto.master to include NIS auto.home to n2 for automatic configuration of mounts.
  • added information on how to install autofs and cernvmfs on the wantedpackagesonnodes page
  • writing documentation on how to add new users. Discovered an issue with autofs setup.

Florido Paganelli 2013/08/06 18:45

  • better NFS configuration on storage-iridium.
  • meeting with Luis on ssh host key sharing
  • configuration of storage-iridium and NFS for secure key sharing

Florido Paganelli 2013/08/05 19:10

  • installed Frontier-Squid on storage-iridium.
    • :!: the setup needed special SELinux configuration along the lines of what is explained here: http://grokbase.com/t/centos/centos/111zyzd2n0/squid-and-selinux
    • changed firewall to enable squid proxy access
    • created atlas.sh script on n2 to enable atlas environment.
    • configured NIS, nfs, ntpd, on n2 and reported what to configure in the wanted packages page
    • enabled reverse DNS zone on service-iridium in order for anaconda installed to setup node hostnames.

Florido Paganelli 2013/08/02 18:20

  • cloned sdb1 and sdb3 from kvm-iridium virtimages pool with dd to storage-iridium /export/backupimages/. Attempt to use 'virsh vol-download' failed: it took one day to transfer 2GB. Let's not do that anymore
  • removed sdb4 partition on kvm-iridium virtimages(can't remember why was there)
  • restarted sshgateway(iridium) and service-iridium
  • configured kickstart to install SL6
  • installed SL6 on n2
  • set up a specific partition for cernvmfs
  • installed cernvmfs
    • squid proxy needed for operations.
  • fixed an issue with service-iridium hosts file, had wrong IP address

Florido Paganelli 2013/08/01 15:16

  • enabled Luis on hep-monitor
  • issues with virtualization layer, storage management. Hypervisor machine updated and restarted.

Florido Paganelli 2013/07/31 16:30

  • enabled a user for Luis on hep-srv and NIS to login to nodes
  • meeting with Luis to plan configuration with SALT. Plan rescheduled.

Florido Paganelli 2013/07/31 11:35

  • reconfigured all machines to use second network card for boot
  • started kickstart CentOS6 installation on n2,n3,n4

Florido Paganelli 2013/07/29 16:52

  • created a user and a vnc server instance for Luis
  • reconfigured hep-srv to be able to resolve internal hostnames
  • started experiencing with kickstart

Florido Paganelli 2013/07/26 17:08

  • finished configuring named DNS on service-iridium. Now nodes can find other machines.
  • issue with hep-srv: probably broken network config. Sent email to Rickard and Robert from Lunarc.
  • configured n1 to correctly join the domain
  • port-forwarded one of the nodes (n1)

Florido Paganelli 2013/07/25 18:24

  • installed centos6 on node 1 (n1). Corrected remote iso issue.
  • installed and configured nis and nfs shares on n1. Corrected various dhcp server errors.
  • started installing named DNS server on service-iridium. Issues in configuration.

Florido Paganelli 2013/07/24 18:29

  • installed dhpc server on service-iridium
    • DONE: nodes BIOS setup must be changed, the default ethernet boot is not the configured one.
    • DONE: ethernet addresses needs to be updated in dhcp server.
    • DONE: issues in dispatching dns must be solved. Maybe installing bind or dnsmasq.
  • installed PXE booting system on service-iridium
  • installed tftp server on service-iridium. Issues with selinux.
  • changed storage-iridium iptables to serve nfs folders
  • created a directory for boot images on storage-iridium
  • successfully booted a node for installation. A mirror of Centos6 is needed to complete the install via nfs

Florido Paganelli 2013/07/23 18:07

  • created xfs filesystem on 30TB storage
  • created directories to be shared among nodes
  • set up quotas as discussed with Luis. A defined description of quotas must be added to the cluster description document.

Florido Paganelli 2013/07/22 18:39

  • profiling storage usage after meeting w Luis
  • understanding xfs features
  • understanding logical volume management basics. LVM2 will be used.
  • created logical volumes on 30GB storage
  • quotas will be managed by xfs on folders. To be done.

Florido Paganelli 2013/07/10 16:39

  • understanding Cernvm-fs deployment
  • planning storage server space allocation
  • updating documents

Florido Paganelli 2013/07/08 16:34

  • configured storage server storage-iridium operating system. Storage disks need configuration, will be done once CERNVM is understood. TODO: use of xfs, configuration of 10GB interface
  • updated pictures in Tech_documents

Florido Paganelli 2013/07/05 17:53

  • configured service machine service-iridium
  • configured iptables and NAT routing on sshgateway
  • configured NIS server on service-iridium
  • configured NIS client on sshgateway
  • configured iptables for NIS on service-iridium

Florido Paganelli 2013/07/02 13:54

  • Setup wiki to keep track of progress
  • Created a machine sshgateway iridium.lunarc.lu.se to be used as main ssh gateway and to host some of the services.
  • Configured frontend kvm-iridium machine networking to be ready for hosting.

Fixmes

FIXME FIXME FIXME FIXME FIXME

I decided to collect FIXME s here because it was difficult to track them down when spread across logs.

  • autofs is mounting everything as nobody:nobody user as if it was all_squash set. Check how to solve. SOme say is idmap solved by setting NIS domain name to iridium and having storage-iridium joining the NIS domain. I guess nis domain names are forwarded from nfs4 to clients.
  • document squid setup better! especially custom selinux modules.
  • squid proxy access should be limited to nodes
  • Optionally install a second failover squid on service-iridium.
  • better iptables configuration on all nodes and servers based on -m state rules. Done on
    • iridium
    • storage-iridium
    • kvm-iridium (complex due to kvm)
    • service-iridium only for tcp connections. Check what to do for udp.
  • check xfs selinux rules on storage-iridium, added with chcon might disappear :( probably needs module creation as squid.
  • Mixed config for NFS3/NFS4. Would be better to use nfs4 and limit portmapper to NIS.
  • needs stronger authorization check on portmap/rpcbind services (i.e. hosts.allow on all machines) to be done in NIS SALT
  • update page wantedpackagesonnodes with current salt configuration.
  • generation and distribution of host keys on the nodes at deployment time is needed. Might be done as Luis did for SALT.
  • change default user groups in the cluster for all users and in documentation
  • investigate on-demand vm creation
  • change NIS passwd change system. disable yppaswd on all nodes. For the moment being shut down yppasswdd on service.
  • salt should check if a group is installed. This will speedup things.
  • sort out why dhcp won't renew after trying for long time.
  • add CA certificates to all machines. Probably storage server to share certs would do the trick and need only one crl check in place.
  • define a proper time when node update
  • disable cvmfs autoupdate
  • salt-minion do not restart after upgrade. Find means to force it. salt-command might jam if done that way:
    salt -v 'n5.iridium' cmd.run "service salt-minion restart"
iridium_cluster/captainslog.txt · Last modified: 2014/06/12 17:19 by florido