pfwiki

Captain's log

Here I am keeping trace of the work I am doing on the cluster, for everybody to track progress.

The logs can be read top to bottom from the most recent change to the newest.

— Florido Paganelli 2014/06/12 19:19

shared each node's tmp folder in /nodestmp/<nodename> documentation about it needs to be written

— Florido Paganelli 2014/06/02 17:40

updated and rebooted nptest and pptest
installed quantum espresso and RTE
updated salt configuration:
- software needed by Nuclear Physics
- module configuration for custom modulefiles
- motd and banner
gitted and backed up salt config

— Florido Paganelli 2014/04/28 19:33

updated arc interface grid certificate
joined NorduGrid Sweden indexes
configured syslog installation. Needs tweaking of hostname on nodes with multiple interfaces. These nodes do not have special configuration in salt yet.

— Florido Paganelli 2014/03/21 21:00

made so many changes that I couldn't keep track of. Now testing nodes are fully accessible from the internet.
rewrote documentation for the cluster.

— Florido Paganelli 2014/03/17 20:39

added network configuration for n1 (nptest-iridium) and n2 (pptest-iridium)
rebooted nodes, now accessible from the internet. eth0 shut down awaiting for iptables config
todo: block access to other nodes, only allow course users, disable slurm, configure iptables

— Florido Paganelli 2014/03/14 18:00

added sshd configuration for n1 and n2, only me and maintenance user can access
added envvars for umask
changed motd

— Florido Paganelli 2014/03/10 17:19

updated salt-master on service-iridium due to incompatibilities with newer minions
created two simple queues in slurm
reconfigured all slurm nodes
reconfigured arc frontend

— Florido Paganelli 2014/03/07 17:23

installed terena certificate on frontend.
configured nordugrid VOMS server

— Florido Paganelli 2014/02/21 10:26

installed and configured arc. Job sumbission with emi-es a failure, but maybe ARC 4.1 will solve all.
configured arc with slurm and cache; however would be better to have two grid users (one for each division) and two queues.

— Florido Paganelli 2014/02/18 13:55

installed salt on arc-iridium
installed other trustanchors on arc-iridium
configured firewall (still needs cleanup)
installed and configured munge, slurm, autofs, nfs

— Florido Paganelli 2014/02/17 19:07

installed NG repos on arc-iridium and trustanchors
configured a-rex using instantca
initiated process of requesting host cert
asked Lunarc to open ports for ARC services

— Florido Paganelli 2014/02/11 18:29

updated hep-srv reboot didn't work, help request submitted to Lunarc.
installed epel on arc-iridium
installed ARC on arc-iridium
installed munge on arc-iridium

— Florido Paganelli 2014/02/07 18:31

rebooted service-iridium
updated and rebooted ssh gateway
updated and configured arc-iridium

— Florido Paganelli 2014/02/04 18:31

rebooted kvm-iridium
created arc-iridium machine
updated kvm-iridium to Centos6.5
cloning disk for arc-iridium
started service-iridium update

— Florido Paganelli 2014/01/31 16:02

finalized slurm installation
fixed several iptables and munge issues
rebooted all nodes

— Florido Paganelli 2014/01/30 17:45

finished configuring SLURM, tests started
copied ssh keys
updated git
updated iptables configuration for all nodes
changed anaconda scripts (to be tested!) to include ssh key retrieval

— Florido Paganelli 2014/01/29 16:55

installed 'SLURM' from FGI repositories maintained by Tiggi. This slurm is built withouth MPI support and with mysql and lua support
updated roadmap

— Florido Paganelli 2014/01/28 17:28

added 'MUNGE' to all nodes incl. 'service-iridium'. Not clear where the frontend should run, but I would say on the same machine that runs ARC. that means it has to share the secret key.
generated the secret key on 'service-iridium'.

— Florido Paganelli 2013/11/07 14:03

applied package adds to all nodes
updated software package for HEP in salt/common
updated salt config to create directories in /nfs, applied to all nodes
restarted salt-minion on all nodes

— Florido Paganelli 2013/11/06 17:42

updated roadmap
added few packages to service-iridium /srv/salt/common/init.sls

— Florido Paganelli 2013/09/20 18:12

CPUs shipped by SouthPole have been replaced by Compliq on the frontend. Cluster Up and Running again.

— Florido Paganelli 2013/09/20 15:19

checked serial number on frontend.
Waiting for compliq to substitute faulty CPUs; the frontend has been shut down.

— Florido Paganelli 2013/08/22 12:40

created users home folders on storage-iridium
reconfigured storage-iridium NIS
changed tjatte setup
updated testing phase purposes on howtos_users
sent update emails

— Florido Paganelli 2013/08/21 21:48

configured gateway restricted shell
added new users to the cluster
solved issue with password change. Passwords cannot be changed by users now, :TODO: solve this security issue using PAM on all machines.
added missing users to tjatte for testing round
written documentation on howtos_users

— Florido Paganelli 2013/08/20 17:48

morning meeting with Pico to sort out some technicalities. Decisions: change the way direct login is done. Have a limited shell gateway.
Afternoon meeting with e-Science group: set new roadmap
Redesigned cluster web pages

— Florido Paganelli 2013/08/19 21:14

got answer from Lunarc. New plan will come after tomorrow's internal meeting with the Lund e-Science group.
started writing information on how to use testing nodes in howtos_users

— Florido Paganelli 2013/08/16 18:58

configured clusterip mode for load balancing between testing nodes n1 and n2
added SALT configuration for clusterip (needs more work, awaiting for Luis)
changed atlas.sh
configured kvm-iridium to forward to CLUSTERIP
automated salt-call execution at boot time for each node. Node configures itself and reboots at installation time.
tested cvmfs on Centos6, seems to work
reinstalled n1 with cvmfs
run some tests with cvmfs runKV for missing libraries etc. on n1. Dump is in pflorido user folder.

— Florido Paganelli 2013/08/14 20:23

finalized cvmfs configuration for sl6. needs to be tested on centos6
added profile.d scripts for atlas environment initialization

— Florido Paganelli 2013/08/13 18:41

nfs configuration changed to nfs4 for SALT keys distribution
sl6 installation automation script set up.
decisions on how to move Nuclear Physics data into the cluster have been taken: access to a single node of the cluster enabled to allow Luis to start tests. Open access to other researchers requires Lunarc intervention.
checked some items in Fixmes

— Florido Paganelli 2013/08/12 18:27

finalized nfs4 configuration on storage.
created autofs salt configuration.
created NIS salt configuration
versioned salt folder with git.
started cvmfs salt automatic configuration
changed partitioning scheme on sl6 nodes to accomodate cvmfs. This triggered creation of a sl6 kickstart file.
testing sl6 kickstart file

— Florido Paganelli 2013/08/09 18:00

reconfiguring NFS shares to be compliant with nfs4.
meeting w Luis on SALT operations
preliminary discussion on dataset and software deployment.

— Florido Paganelli 2013/08/07 17:31

added SALT iptables configuration on service-iridium
created auto.master and auto.home NIS maps for automatic configuration of mounts. This is probably better done with salt with proper auto.home configuration on each node.
changed auto.master to include NIS auto.home to n2 for automatic configuration of mounts.
added information on how to install autofs and cernvmfs on the wantedpackagesonnodes page
writing documentation on how to add new users. Discovered an issue with autofs setup.

— Florido Paganelli 2013/08/06 18:45

better NFS configuration on storage-iridium.
meeting with Luis on ssh host key sharing
configuration of storage-iridium and NFS for secure key sharing

— Florido Paganelli 2013/08/05 19:10

installed Frontier-Squid on storage-iridium.
- the setup needed special SELinux configuration along the lines of what is explained here: http://grokbase.com/t/centos/centos/111zyzd2n0/squid-and-selinux
- changed firewall to enable squid proxy access
- created atlas.sh script on n2 to enable atlas environment.
- configured NIS, nfs, ntpd, on n2 and reported what to configure in the wanted packages page
- enabled reverse DNS zone on service-iridium in order for anaconda installed to setup node hostnames.

— Florido Paganelli 2013/08/02 18:20

cloned sdb1 and sdb3 from kvm-iridium virtimages pool with dd to storage-iridium /export/backupimages/. Attempt to use 'virsh vol-download' failed: it took one day to transfer 2GB. Let's not do that anymore
removed sdb4 partition on kvm-iridium virtimages(can't remember why was there)
restarted sshgateway(iridium) and service-iridium
configured kickstart to install SL6
installed SL6 on n2
set up a specific partition for cernvmfs
installed cernvmfs
- squid proxy needed for operations.
fixed an issue with service-iridium hosts file, had wrong IP address

— Florido Paganelli 2013/08/01 15:16

enabled Luis on hep-monitor
issues with virtualization layer, storage management. Hypervisor machine updated and restarted.

— Florido Paganelli 2013/07/31 16:30

enabled a user for Luis on hep-srv and NIS to login to nodes
meeting with Luis to plan configuration with SALT. Plan rescheduled.

— Florido Paganelli 2013/07/31 11:35

reconfigured all machines to use second network card for boot
started kickstart CentOS6 installation on n2,n3,n4

— Florido Paganelli 2013/07/29 16:52

created a user and a vnc server instance for Luis
reconfigured hep-srv to be able to resolve internal hostnames
started experiencing with kickstart

— Florido Paganelli 2013/07/26 17:08

finished configuring named DNS on service-iridium. Now nodes can find other machines.
issue with hep-srv: probably broken network config. Sent email to Rickard and Robert from Lunarc.
configured n1 to correctly join the domain
port-forwarded one of the nodes (n1)

— Florido Paganelli 2013/07/25 18:24

installed centos6 on node 1 (n1). Corrected remote iso issue.
installed and configured nis and nfs shares on n1. Corrected various dhcp server errors.
started installing named DNS server on service-iridium. Issues in configuration.

— Florido Paganelli 2013/07/24 18:29

installed dhpc server on service-iridium
- DONE: nodes BIOS setup must be changed, the default ethernet boot is not the configured one.
- DONE: ethernet addresses needs to be updated in dhcp server.
- DONE: issues in dispatching dns must be solved. Maybe installing bind or dnsmasq.
installed PXE booting system on service-iridium
installed tftp server on service-iridium. Issues with selinux.
changed storage-iridium iptables to serve nfs folders
created a directory for boot images on storage-iridium
successfully booted a node for installation. A mirror of Centos6 is needed to complete the install via nfs

— Florido Paganelli 2013/07/23 18:07

created xfs filesystem on 30TB storage
created directories to be shared among nodes
set up quotas as discussed with Luis. A defined description of quotas must be added to the cluster description document.

— Florido Paganelli 2013/07/22 18:39

profiling storage usage after meeting w Luis
understanding xfs features
understanding logical volume management basics. LVM2 will be used.
created logical volumes on 30GB storage
quotas will be managed by xfs on folders. To be done.

— Florido Paganelli 2013/07/10 16:39

understanding Cernvm-fs deployment
planning storage server space allocation
updating documents

— Florido Paganelli 2013/07/08 16:34

configured storage server storage-iridium operating system. Storage disks need configuration, will be done once CERNVM is understood. TODO: use of xfs, configuration of 10GB interface
updated pictures in Tech_documents

— Florido Paganelli 2013/07/05 17:53

configured service machine service-iridium
configured iptables and NAT routing on sshgateway
configured NIS server on service-iridium
configured NIS client on sshgateway
configured iptables for NIS on service-iridium

— Florido Paganelli 2013/07/02 13:54

Setup wiki to keep track of progress
Created a machine sshgateway iridium.lunarc.lu.se to be used as main ssh gateway and to host some of the services.
Configured frontend kvm-iridium machine networking to be ready for hosting.

Fixmes

I decided to collect s here because it was difficult to track them down when spread across logs.

~~autofs is mounting everything as nobody:nobody user as if it was all_squash set. Check how to solve. SOme say is idmap~~ solved by setting NIS domain name to iridium and having storage-iridium joining the NIS domain. I guess nis domain names are forwarded from nfs4 to clients.
document squid setup better! especially custom selinux modules.
squid proxy access should be limited to nodes
Optionally install a second failover squid on service-iridium.
better iptables configuration on all nodes and servers based on -m state rules. Done on
- iridium
- storage-iridium
- kvm-iridium (complex due to kvm)
- service-iridium only for tcp connections. Check what to do for udp.
check xfs selinux rules on storage-iridium, added with chcon might disappear :( probably needs module creation as squid.
Mixed config for NFS3/NFS4. Would be better to use nfs4 and limit portmapper to NIS.
needs stronger authorization check on portmap/rpcbind services (i.e. hosts.allow on all machines) to be done in NIS SALT
~~update page wantedpackagesonnodes with current salt configuration.~~
generation and distribution of host keys on the nodes at deployment time is needed. Might be done as Luis did for SALT.
change default user groups in the cluster for all users and in documentation
investigate on-demand vm creation
change NIS passwd change system. disable yppaswd on all nodes. For the moment being shut down yppasswdd on service.
salt should check if a group is installed. This will speedup things.
sort out why dhcp won't renew after trying for long time.
add CA certificates to all machines. Probably storage server to share certs would do the trick and need only one crl check in place.
define a proper time when node update
disable cvmfs autoupdate
salt-minion do not restart after upgrade. Find means to force it. salt-command might jam if done that way:
```
salt -v 'n5.iridium' cmd.run "service salt-minion restart"
```

pfwiki

User Tools

Site Tools

Sidebar

Captain's log

Fixmes

Page Tools