Powered by:
Open Science Grid
Center for High Throughput Computing

User News

Below is a list of important user news updates, sorted by date. Please stay up to date with news which is relevant to you, as CHTC policy changes may affect the jobs of users.


Matlab licenses and compiling restored in CHTC

Tuesday, February 3, 2015, 2:00 PM

Dear Matlab users on CHTC compute systems,

Due to a Matlab license error, processes on CHTC systems that use Matlab licenses (like compiling) were failing this morning. As of noon today, we have resolved the issue and all Matlab functionality should now be restored. If you are using CHTC's Matlab tools and see persistent issues, please let us know.

Thank you,
Your CHTC Team


Matlab compiling and licenses temporarily unavailable in CHTC

Tuesday, February 3, 2015, 9:30 AM

Greetings users of Matlab on CHTC compute systems,

We have confirmed a license error with using Matlab compilers on CHTC systems. Currently, any processes (including compiling) that rely on Matlab licenses will likely fail. We'll send a follow-up email when Matlab functionality has been restored, likely by the end of the day.

Thank you,
Your CHTC Team


All CHTC functionality restored!!

Thursday, January 29, 2015, 4:30 PM

Greetings!

The HTC submit node, submit-3, and all other CHTC compute systems are once again fully functional!!

Users may now return to all regular computational activity via the CHTC's HTC submit nodes and HPC cluster head nodes.

Thank you for your patience during the urgent reboot of all of CHTC's servers within the last day, and especially for the patience of those using submit-3.chtc.wisc.edu, which required some additional testing this afternoon.

Please continue to let CHTC staff know if/when you experience difficulties in using our compute systems. Hopefully, we won't have to bother you with emails for a while ...

Happy Computing,
Your CHTC Team


HTC submit-3 not fully functional yet

Thursday, January 29, 2015, 2:00 PM

Greetings,

After our reboot on submit-3 this morning, the server is now back up and users can log in, but HTCondor (condor) commands and access to /squid or /mnt/gluster are NOT restored yet while some final tests are being performed.

We appreciate the patience of users while submit-3 functionality is being restored. We will send another email when the process has completed and users can access all submit-3 features, including HTCondor.

Thank you,
Your CHTC Team


HPC Cluster back online after reboot

Wednesday, January 28, 2015, 4:30 PM

The HPC Cluster is back online after the necessary reboot (see below).

As a reminder ALL jobs from the HPC Cluster will need to be resubmitted, as they are lost from the SLURM queue upon reboot.

Thank you,
Your CHTC Team


IMPORTANT: All CHTC servers need immediate reboot TODAY!!!

Wednesday, January 28, 2015, 2:00 PM

ATTENTION CHTC Users!!

Due to very recent information on a critical vulnerability in the operating systems we use for CHTC compute servers,
ALL CHTC SERVERS NEED TO BE REBOOTED TODAY (see below)

For CHTC's HTC System (HTCondor Pool via submit nodes):

The process to reboot all servers has already begun, and will take place over the next 24 hours due to the large number of servers.

What HTC users can expect:

  • Temporary delays in access to submit servers during their reboot (planned for early tomorrow).
  • Interruption of running jobs as execute servers are automatedly rebooted over the next 24 hours. Interrupted jobs WILL continue to be tracked and will be re-run by HTCondor.
  • Delays in the running of newly-submitted jobs until all reboots are complete.
For CHTC's HPC Cluster (via head node: aci-service-1.chtc.wisc.edu):

The HPC Cluster will be rebooted at 3pm today, and brought back ASAP after that point.

What HPC users can expect:

  • loss of SSH access to cluster head nodes (aci-service-1/2) during the reboot.
  • JOBS WILL BE LOST AND NEED TO BE REBOOTED, as SLURM cannot recover jobs upon reboot.

CHTC staff will send emails when the reboot processes have completed and compute system functionality is restored. The security vulnerability applies to all RedHat-based Linux operating systems, including the Scientific Linux operating system we use in CHTC. The security of your work is of utmost importance to CHTC, and this specific vulnerability requires immediate action.

The timing of the security vulnerability and CHTC-wide reboot are completely unrelated to the previously-described downtime for /mnt/gluster and high-memory servers in the HTC System that was necessary this morning. We apologize for any interruption to your CHTC research!

Thank you,
Your CHTC Team


/mnt/gluster and "mem" servers unavailable tomorrow, 8am-12pm

January 27, 2015

Greetings CHTC Users,

Due to late-notice maintenance in the building where some of our compute hardware is stored, the /mnt/gluster location and high-memory servers (mem1 and mem2) in CHTC's HTC System (HTCondor Pool) will be unavailable tomorrow, January 28, from 8am-12pm.

Potential effects that HTC users can anticipate:

  1. Users of /mnt/gluster and their jobs will be unable to access data in this location during the outage.
  2. Jobs already running on mem1 or mem2 will unfortunately be interrupted. HTCondor will keep such jobs in the queue and restart them on a new server as soon as a new job match can be made.
  3. All users who do not access /mnt/gluster or rely on our high-memory servers will be unaffected, and jobs submitted from CHTC submit nodes will otherwise continue to run as usual, with HTCondor automatically restarting jobs on new execute nodes if they are at all affected by the outage.
  4. CHTC's HPC Cluster will be completely unaffected by the temporary outage.

We apologize for the short notice (as it was short for us, as well). As always, please feel free to email chtc@xxxxxxxxxxx if you have any questions for us.

Best Wishes,
Your CHTC Team