User Tools

Site Tools


iridium_cluster:wip

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
iridium_cluster:wip [2014/01/29 10:22]
florido [Software related]
iridium_cluster:wip [2015/09/11 13:59]
florido [System related]
Line 9: Line 9:
 ===== Roadmap ===== ===== Roadmap =====
  
-:!: new schedule set November 6th2013 :!:+:!: new schedule set September 9th2015 :!:
  
-==== TODO ====+  * 1) Reconfigure service disks on frontend with a more flexible technology, e.g. lvm.  
 +    * Due w38 
 +  * 2) Reinstall service machines with lxc instead of kvm/​virtualization,​  
 +    * Due w38 
 +  * 3) Connect to the newly bought storage at Lunarc  
 +    * Due October 2015 
 +  * 4) Connect to the default storage DDN at Lunarc  
 +    * Due October 2015 
 +  * 5) Configure the new nodes to integrate into Lunarc  
 +    * Due January 2016 
 +  * 6) Configure access to be integrated into Lunarc  
 +    * Due January 2016
  
-By the end of August: ​ 
  
-  * understand how to access other kind of datasets (e.g. where/how to store PRESPEC-AGATA dataset from Nuclear Physics) 
-    * **ONGOING** Luis is working on Nuclear Physics data. Waiting for update. 
  
-All the following tasks were due end of September. 
-Rescheduled by the end of November: 
  
-  * 1) Setup n1 and n2 as test nodes. This includes: +==== DONE (as 9th Sep 2015) ==== 
-    * Direct connection of nodes to the internet  +  * 1) understand how to interact with existing grid storage present at Lunarc. Update: Lunarc suggested to benchmark current connection. If not enough, interact with Jens to understand how direct connection can be achieved. Lunarc only has data, no metadata or index
-    * LDAP+OTP authentication on nodes. Status: send domain name to lunarc and wait for lunarc info. +    * Outcome: users prefer to interact directly with the data and move it on the cluster storage. Use of FAX is not possible yet both because it's an experimental technology and also because FAX is only installed in Tier0 or Tier1. The storage should be one of these
-  * 2) understand how to interact with existing grid storage present at Lunarc. Update: Lunarc suggested to benchmark current connection. If not enough, interact with Jens to understand how direct connection can be achieved. Lunarc only has data, no metadata or index. +  * 4Start batch/grid jobs tests with users from particle physics and nuclear physics. In this phase we will be able to see how resource management should be done for an optimal use of the cluster. Missing: ATLAS RTEs. 
-  * 3install ​batch system.+    * Outcome: not so succesful due to the cluster not being a tier2 and its connection to grid storage not optimal. Also, disk problems prevented optimal operation. 
 +  * 5) Setup n1 and n2 as test nodes. This includes: 
 +    * Configure direct connection of nodes to the internet DONE 
 +    * LDAP+OTP authentication on nodes. Status: sent domain name to lunarc. Received first bit of info from Lunarc. NOT DONE: impractical for use, there was no time to set it up. To be scheduled for next iteration and new Iridium nodes.
  
-Rescheduled mid December/​January:​ 
-  * 4) install grid interfaces for researchers to run test grid jobs. 
-  * 5) Start batch/grid jobs tests with users from particle physics and nuclear physics. In this phase we will be able to see how resource management should be done for an optimal use of the cluster. 
  
 +
 +==== DONE (as 17th Mar 2014) ====
 +
 +  * 3) install grid interfaces for researchers to run test grid jobs.
 +
 +==== DONE (as 31st Jan 2014) ====
 +  * 2) install batch system.
 +    * add to salt config:
 +      * create folders for slurm
 +      * add slurm ports to iptables, wrt machines/​services:​ 6817,​6818,​7321
 +    * set correct node values in ''​slurm.conf''​
 +  ​
 +
 +==== DONE (as 29th Jan 2014) ====
 +
 +  * Physical direct connection of nodes n1/n2 to the internet done by Robert
 +  * started batch system installation
 +    * created user for slurm
 +    * used n10 as guinea pig for configuration
  
 ==== DONE (as 15th Oct 2013) ==== ==== DONE (as 15th Oct 2013) ====
Line 65: Line 90:
  
   * Faulty PSU on node Chassis 1 replaced with Alarik spare. Waiting for replacement part to come.   * Faulty PSU on node Chassis 1 replaced with Alarik spare. Waiting for replacement part to come.
 +  * Faulty motherboard on n11 was completely replaced.
  
 ---- ----
  
 ===== Useful links ===== ===== Useful links =====
 +
 +==== Hardware related ====
 +  * Information on the RAID controller on storage: ​
 +    * http://​hwraid.le-vert.net/​wiki/​LSIMegaRAIDSAS
 +    * Utility mentioned up there can be found on LSI logic website painfully. I finally find it here:
 +      * http://​www.lsi.com/​support/​Pages/​download-results.aspx?​keyword=megacli
 +    * Cheat sheet for the above utility:
 +      * http://​erikimh.com/​megacli-cheatsheet/​
  
 ==== System related ==== ==== System related ====
Line 81: Line 115:
   * CENTOS DHCP config: http://​www.krizna.com/​centos/​install-configure-dhcp-server-centos-6/​   * CENTOS DHCP config: http://​www.krizna.com/​centos/​install-configure-dhcp-server-centos-6/​
   * :!: this is for NFS3. Would be better to use nfs4 and limit portmapper to NIS. CENTOS NFS config: http://​www.malaya-digital.org/​setup-a-minimal-centos-6-64-bit-nfs-server/​   * :!: this is for NFS3. Would be better to use nfs4 and limit portmapper to NIS. CENTOS NFS config: http://​www.malaya-digital.org/​setup-a-minimal-centos-6-64-bit-nfs-server/​
 +  * lvm volume groups guide: https://​www.centos.org/​docs/​5/​html/​Cluster_Logical_Volume_Manager/​index.html
 +  * lxc containers creation: https://​linuxcontainers.org/​lxc/​getting-started/​
  
 ==== Software related ==== ==== Software related ====
Line 99: Line 135:
       * http://​modules.sourceforge.net/​       * http://​modules.sourceforge.net/​
       * an howto: http://​www.admin-magazine.com/​HPC/​Articles/​Environment-Modules       * an howto: http://​www.admin-magazine.com/​HPC/​Articles/​Environment-Modules
 +    * SLURM: a batch system.
 +      * http://​www.schedmd.com/​slurmdocs/​quickstart_admin.html
     * MUNGE: a way of authenticating across nodes, needed by SLURM     * MUNGE: a way of authenticating across nodes, needed by SLURM
       * https://​code.google.com/​p/​munge/​wiki/​InstallationGuide       * https://​code.google.com/​p/​munge/​wiki/​InstallationGuide
iridium_cluster/wip.txt · Last modified: 2015/09/15 14:47 by florido