Transfer Large Input Files Via Squid
Which Option is the Best for Your Files?
|Link to Guide
|How to Transfer
|0 - 100 MB per file, up to 500 MB per job
|0 - 5 GB per job
|Small Input/Output File Transfer via HTCondor
|submit file; filename in
|CHTC, UW Grid, and OSG; works for your jobs
|100 MB - 1 GB per repeatedly-used file
|Not available for output
|Large Input File Availability Via Squid
|submit file; http link in
|CHTC, UW Grid, and OSG; files are made *publicly-readable* via an HTTP address
|100 MB - TBs per job-specific file; repeatedly-used files > 1GB
|4 GB - TBs per job
|Large Input and Output File Availability Via Staging
|job executable; copy or move within the job
|a portion of CHTC; accessible only to your jobs
SQUID Web Proxy
CHTC maintains a SQUID web proxy from which pre-staged input files and executables can be downloaded into jobs using CHTC's proxy HTTP address.
Table of Contents
The SQUID web proxy is best for cases where many jobs will use the same large file (or few files), including large software. It is not good for cases when each of many jobs needs a different large input file, in which case our large data staging location should be used. Remember that you're always better off by pre-splitting a large input file into smaller job-specific files if each job only needs some of the large files's data. If each job needs a large set of many files, you should create a
.tar.gzfile containing all the files, and this file will still need to be less than 1 GB.
Access to SQUID:
is granted upon request to email@example.com. A user on CHTC submit servers may will be granted a user directory within
/squid, which users should transfer data into via the CHTC transfer server (transfer.chtc.wisc.edu). As for all CHTC file space, users should minimize the amount of data on the SQUID web proxy, and should clean files from the
/squidlocation regularly. CHTC staff reserve the right to remove any file from
/squidwhen needed to preserve availability and performance for all users.
Files placed on the SQUID web proxy can be downloaded by jobs running anywhere, because the files are world-readable.
- Limitations and Policies:
- SQUID cannot be used for job output, as there is no way to change files in SQUID from within a job.
- SQUID is also only capable of delivering individual files up to 1 GB in size.
- A change you make to a file within your
/squiddirectory may not take effect immediately on the SQUID web proxy if you use the same filename. Therefore, it is important to use a new filename when replacing a file in your
- Jobs should still ALWAYS and ONLY be submitted from within the
- Only the "http" address should be listed in the
transfer_input_files" line of the submit file. File locations starting with "
/squid" should NEVER be listed in the submit file.
- Users should only have data in /squid that is being use for
currently-queued jobs; CHTC provides no back ups of any data in
CHTC systems, and our staff reserve the right to remove any data
causing issues. It is the responsibility of users to keep copies
of all essential data in preparation for potential data loss or
file system corruption.
- Data Security:
Files placed in SQUID can only be edited by the owner of the user directory within
/squid, but will end up being world-readable on the SQUID web proxy in order to be readily downloadable by jobs (with the proper HTTP address); thus, large files that should be "private" should not be placed in your user directory in
/squid, and should instead use CHTC's large data staging space for large-file staging.
2. Using SQUID to Deliver Input Files
Request a directory in SQUID. Write to firstname.lastname@example.org describing the data you'd like to place in SQUID, and indicating your username and submit server hostname (i.e. submit-5.chtc.wisc.edu).
Place files within your
/squid/usernamedirectory via a CHTC transfer server (if from your laptop/desktop) or on the submit server.
From your laptop/desktop:
[username@computer]$ scp large_file.tar.gz email@example.com:/squid/username/
If the file already exists within your /home directory on a submit server:
[username@submit]$ cp large_file.tar.gz /squid/username/
Check the file from the submit server:
[username@submit]$ ls /squid/username/
Have HTCondor download the file to the working job using the
http://proxy.chtc.wisc.edu/SQUIDaddress in the transfer_input_files line of your submit file:
transfer_input_files = other_file1,other_file2,http://proxy.chtc.wisc.edu/SQUID/username/large_file.txt
Important:Make sure to replace "username" with your username in the above address. All other files should be staged before job submission.
If your large file is a
.tar.gzfile that untars to include other files, remember to remove such files before the end of the job; otherwise, HTCondor will think that such files are new output that needs to be transferred back to the submit server. (HTCondor will not automatically transfer back directories.)