Follow us on social media:
Below is a list of important user news updates, sorted by date.
Please stay up to date with news which is relevant to you,
as CHTC policy changes may affect the jobs of users.
For older updates not shown on this page, see our
user mailing list archives.
Brief outage of some HPC Cluster nodes yesterd
Friday, July 19, 2019
Due to power irregularities likely related to morning fires at two power stations in Madison, CHTC currently has multiple (but not all) servers down in the HTC System and HPC Cluster.
The HPC Cluster has roughly 50 nodes down due to a power-related issue with the cooling system in the cluster's server room (Discovery building). Jobs previously running on those servers will have failed. It is possible that we may need to shut down more of the HPC Cluster if cooling issues persists, and we'll provide updates as things progress.
The HTC System's submit-2.chtc.wisc.edu submit server and multiple execute servers are down in several server rooms (Discovery and Computer Sciences), and some other group-specific submit servers may have been affected. Jobs running on the affected execute servers will have been interrupted, but will return to "Idle" status to re-run on another server. Jobs queued from submit-2 (or any other submit server with power loss) have been interrupted, but will similarly return to "Idle" to re-run once we have the submit-2 server stably rebooted. As with the HPC Cluster, we may need to take down additional execute or submit servers, and will provide updates as things progress.
Thank you for your patience. We hope you are safe following this morning's fires and that you are able to stay cool this weekend, given any persisting power outages in the city.
Please email us at email@example.com with any concerns or questions.
Your CHTC Team