BYU

Office of Research Computing

Rclone

Rclone allows one to move files and directories to and from cloud storage via the command line. In combination with box.byu.edu, where BYU students and faculty get unlimited free storage, it can make storing and backing up archival data much easier. Rclone+Box will help users who routinely run up against storage space constraints and who wish to back up data that can only fit in compute. Those who wish to collaborate without making others get ORC accounts can upload to Box with Rclone, then share their data with collaborators (even if those collaborators don't have Box accounts).

This tutorial will show how to configure Rclone with Box, a few of the most useful commands, and a couple of worked examples. It is by no means comprehensive, so those wanting to learn more should reference the documentation, which is excellent.

Note that while the storage on Box is unlimited, expansive storage comes at a cost: Box is slow (especially with small files), so it can take a while to move big chunks of data; if you have many small files, we recommend backing up with Kopia, which aggregates small files and will significantly speed up transfer. Additionally, files stored on Box cannot exceed 32 GB in size, although rclone easily works around this with its chunker overlay.

To use Rclone on the supercomputer, you'll need to load the rclone environment module with module load rclone.

Configuration

Keep in mind that Rclone need only be configured once--as soon as you've finished the steps below, you should never need to do so again as long as you use it at least monthly.

rclone authorize box

To use Rclone with Box, one needs to get an authorization token, which acts somewhat like a password and allows Rclone to access your files. You'll need a visual environment with a web browser to get this token; we recommend using viz. Open a terminal and run module load rclone && rclone authorize box. Firefox should open, where you can log in with single sign on and get the token. If you're on the supercomputer, you should now be able to access box with rclone--you can confirm with rclone lsd box:.

After you've run rclone authorize box, you can move on to configuration; the easiest way is to use our default rclone.conf, but you can also run rclone config manually if you want more customization.

The Easy Way

Rclone stores its data in a configuration file located at ~/.config/rclone/rclone.conf. Since most people will use roughly the same configuration to access files on Box, we provide a default rclone.conf here:

[boxRaw]
type = box
token = PASTE_TOKEN_HERE

[box]
type = chunker
remote = boxRaw:
chunk_size = 30G
hash_type = sha1

This describes 2 "remotes": your Box storage (the [boxRaw] section), and a chunker remote that overlays your Box storage (the [box] section). This chunker remote allows files greater than 32GB to be stored on Box by transparently splitting such files (see the Chunker section below).

Now that you've authorized Box, ~/.config/rclone/rclone.conf should exist; replace it with the file above, keeping the token you find there originally.

In this config file, the chunker remote uses the base Box directory; if you want to sequester it in a particular directory, just put the directory name after the colon, e.g. remote = boxRaw:ORCChunkerRemote.

rclone config

This is an alternative to using our sample rclone.conf; it will allow you to choose different names and options than those we provide as default.

To access Rclone, log in to the supercomputer and load the rclone module:

module load rclone

Once that's done, run rclone config. This will give you a few options:

No remotes found - make a new one
n) New remote
s) Set configuration password
q) Quit config
n/s/q> n

Enter n to make a new remote. Give it a name (e.g. boxRaw), then choose which storage service you'd like to configure (you can type box for box.byu.edu, drive for Google Drive, etc.).

It'll ask for Box App Client Id and Box App Client Secret; most users should simply hit enter to leave these blank. You'll then be asked if you want to "Edit advanced config" (most users should enter n):

Edit advanced config? (y/n)
y) Yes
n) No
y/n> n

Next, you will be asked whether to use auto config:

Remote config
Use auto config?
 * Say Y if not sure
 * Say N if you are working on a remote or headless machine
y) Yes
n) No
y/n> n

Since you are working on a remote machine, enter n. You will then be presented with a message prompting you to run rclone authorize "box" on your local machine:

For this to work, you will need rclone available on a machine that has a web browser available.
Execute the following on your machine:
    rclone authorize "box"
Then paste the result below:
result>

Run rclone authorize "box" in a command prompt on your local machine (see rclone authorize box above) and paste the authorization information (of the form {"access_token":"XXXXX","token_type":"bearer","refresh_token":"XXXXX","expiry":"2019-01-01T00:00:00-06:00"}) after result> on the remote terminal:

result> {"access_token":"XXXXX","token_type":"bearer","refresh_token":"XXXXX","expiry":"2019-01-01T00:00:00-06:00"}
--------------------
[boxRaw]
type = box
client_id = 
client_secret = 
token = {"access_token":"XXXXX","token_type":"bearer","refresh_token":"XXXXX","expiry":"2019-01-01T00:00:00-06:00"}
--------------------
y) Yes this is OK
e) Edit this remote
d) Delete this remote
y/e/d> y

After entering y, you're finished configuring Rclone to work with Box. You'll almost surely want to set up a chunker overlay (see Chunker below) and, if you're working with sensitive data, you may want to set up an encrypted overlay on top of that (see Crypt below).

Special Remotes

Chunker

The chunker overlay automatically splits files larger than a certain threshold on upload, which allows us to use Box to store files larger than 32GB. When downloading, the split files are automatically rejoined. We strongly recommend using chunker--it has no real downsides and preempts the frustration of cryptic upload errors when one tries to move a big file up to Box.

Big files are split into pieces transparently--if you download them outside of the chunker overlay and thus don't get the benefit of automatic concatenation, you can just use cat to put them together. Split files are named filename.rclone-chunk.000, filename.rclone-chunk.001, etc., so a simple cat filename.rclone-chunk.* > filename is all that's needed to combine them. Note that you don't need to do this if you access the files via your chunker remote.

You can copy the [box] section of the example config file in The Easy Way above into your ~/.config/rclone/rclone.conf to create a chunker remote.

Crypt

The crypt overlay encrypts files in a given remote directory such that they cannot (assuming a good password) be read even if someone obtains the files; a downside of this is that you can't simply download the files from box.byu.edu and use them directly. As such, we don't recommend the crypt overlay for most users. If you do decide to use it, make sure to check whether such a setup satisfies any regulations your data is governed by.

To set up a crypt remote, create a remote directory for it (e.g. box:crypt-remote), run rclone config, choose a name (e.g. boxCrypt), crypt storage type, then follow the prompts to finish configuring. File name obfuscation has slight usability disadvantages, but "standard" obfuscation is the most secure. You'll choose a password and a salt (please don't neglect the salt), and your crypt remote is configured.

Usage

This tutorial will only cover the basics due to the clarity and breadth of Rclone's exceptional documentation, which should be your first resource when learning its usage. Typing rclone --help to see a good synopsis of each command. For help on a specific command, you can also use rclone <command> --help (e.g. rclone copy --help).

Listing files

Rclone gives a few methods for listing files; none of them are quite like Unix's ls, but rclone lsf --max-depth 1 remote:path/to/dir comes close. A few more examples:

# Recursively list all files at "box" remote
rclone ls box:

# Show directories in "mydir" at "box"
rclone lsd box:mydir

# Recursively list files in "mydir/dir1" at "box" with more detail
rclone lsl box:mydir/dir1

Moving and Copying

rclone copy and rclone move behave a essentially like Unix's cp and mv, respectively; you can copy and move to or from the remote. Example usage:

# Copy remote file, mydata.txt, from "mydir" at "box"
rclone copy box:mydir/mydata.txt $HOME/data/

# Move a tarball from compute to "mydir/compute-backup" at "box"
rclone move ~/compute/my-tarball.tar.gz box:mydir/compute-backup

Creating Directories

rclone mkdir behaves like Unix's mkdir; to create a new directory on a remote, you would use something like:

rclone mkdir box:mydir/myNewDirectory

Examples

Move Archival Data to Box

Say you have a directory, ~/compute/dataset, with data that needs to be kept, but you don't expect to do any work on it with the supercomputer and you're running out of space. You can either move it directly, or consolidate and compress it then move it. Moving it directly is easier and you'll be able to look at the data directly at box.byu.edu, but compressing then moving could be faster.

Generally, if you have a few big files you won't be slowed down too much by copying directly, but if you have many small files it will take a long time. Under ideal conditions, you can copy 4 files per second (across all processes--Box limits transfers by user). If you have a million files, that means it will take at least a few days to transfer them, no matter how small they each are.

To move without compressing, simply use:

rclone move ~/compute/dataset box:mydir/dataset

There are two main ways to consolidate and compress then move data. This one is slower and more reliable:

tar -czf dataset.tar.gz ~/compute/dataset
rclone move dataset.tar.gz box:mydir/dataset.tar.gz

This one is faster and doesn't use significant disk space, but the work will be lost of the command is interrupted:

tar -czf - ~/compute/dataset | rclone rcat box:mydir/dataset.tar.gz

Back up compute with Box

Before backing up directly with Rclone, consider using a full-featured backup tool like Kopia. For most use cases, it will be faster, use less space, and be more secure. The main advantage of using Rclone directly is that your directory structure will be mirrored on Box.

Perhaps you have a large set of data in ~/compute/dataset, which is too big to fit in your home directory, that you would like to back up weekly. Say you set up the following directory structure to store the backups:

box:
'-- backup
    '-- dataset
        '-- old
        '-- primary

...by running:

rclone mkdir box:backup
rclone mkdir box:backup/dataset
rclone mkdir box:backup/dataset/old
# primary will be created by the copy

The current backup will live at box:backup/dataset/primary, while older snapshots, organized by date, will go in box:backup/dataset/old/. To get started, let's copy over dataset to the current backup directory at box:backup:

rclone copy ~/compute/dataset box:backup/dataset/primary

Keep in mind that Box is slow, so this may take some time. If you want to exit your ssh session while the copy is going, you may want to use screen or tmux to make the transfer.

Once the copy is done, you'll need to back up every week (or however frequently you would like to). This could go something like:

module load rclone
PRIMARY="box:backup/dataset/primary"
OLD="backup/dataset/old/dataset-$(date +%F_%H-%M)"
rclone sync "$HOME/compute/dataset" "$PRIMARY" --backup-dir "$OLD"

If you want to do this regularly, you can put it in a script and run it at your convenience; you can use cron to run it automatically at regular intervals. To make the script (we'll call it do_rclone_backup.sh) execute weekly, use crontab -e to edit your crontab and enter something along the lines of 0 X * * Y bash /path/to/do_rclone_backup.sh (replacing X with an hour, 0-24, and Y with a day of the week, 0-6). Your backup script will now run once a week with no intervention from you. This tutorial goes into more depth in case you want to back up more or less frequently or would like to learn more about cron generally.