Rclone

Rclone allows one to sync files and directories to and from cloud storage via the command line. In combination with box.byu.edu, where BYU students and faculty get unlimited free storage, it can make storing and backing up archival data much easier. Rclone+Box will help users who routinely run up against storage space constraints and who wish to back up data that can only fit in compute. Those who wish to collaborate without making others get FSL accounts can upload to Box with Rclone, then share their data with collaborators (even if those collaborators don't have Box accounts).

This tutorial will show how to configure Rclone with Box, a few of the most useful commands, and a couple worked examples. It is by no means comprehensive, so those wanting to learn more should reference the documentation, which is excellent.

Note that while the storage on box is unlimited, expansive storage comes at a cost: Box is slow, so it takes a while to move big chunks of data. Additionaly, files stored there are cannot exceed 32 GB in size.

Configuration

Keep in mind that Rclone need only be configured once--as soon as you've finished the steps below, you should never need to do so again.

Port Forwarding

When setting up box, Rclone needs access to a web browser; since the supercomputer doesn't have a browser, you'll need to connect one of its ports to your computer. To do so, add a little bit to your ssh command:

ssh -L localhost:53682:localhost:53682 username@ssh.fsl.byu.edu

This tells ssh to make a tunnel, allowing your local machine (and its browser) to access the port Rclone will use for configuration. This tunnel is no longer required after Rclone is configured, so you needn't add '-L localhost:53682:localhost:53682' on subsequent logins.

rclone config

To access Rclone itself, load the rclone module:

module load rclone

Once that's done, run rclone config. This will give you a few options:

No remotes found - make a new one
n) New remote
s) Set configuration password
q) Quit config
n/s/q> n

Enter n to make a new remote. Give it a name (e.g. box), then choose which storage service you'd like to configure (you can type box for box.byu.edu, drive for Google Drive, etc.).

It'll ask for Box App Client Id and Box App Client Secret; most users should simply hit enter to leave these blank. You'll then be asked if you want to "Edit advanced config" (most users should enter n):

Edit advanced config? (y/n)
y) Yes
n) No
y/n> n

Next, you will be asked whether to use auto config:

Remote config
Use auto config?
 * Say Y if not sure
 * Say N if you are working on a remote or headless machine
y) Yes
n) No
y/n> y

Even though you are on a remote machine, you'll still say yes--your computer has access to the remote port that Rclone is about to use, so you are effectively using a local machine. When you enter y, you'll see the following:

If your browser doesn't open automatically go to the following link: http://127.0.0.1:53682/auth
Log in and authorize rclone for access
Waiting for code...

Ctrl-click the link or copy-paste it into your browser. If you're not logged in to box, it will ask for your credentials; use yournetid@byu.edu for the email address. You'll then see a screen with a big blue Grant access to Box button--click it, and you should be greeted with a success message. Go back to the terminal and type y at the prompt:

Got code
--------------------
[box]
type = box
token = {"access_token":"XXXXXXXX", token_type":"bearer", refresh_token":"XXXXXXXX", "expiry":"2019-01-01T12:00:00.000000000-06:00"}
--------------------
y) Yes this is OK
e) Edit this remote
d) Delete this remote
y/e/d> y

...and you're finished configuring Rclone to work with Box.

Usage

This tutorial will only cover the basics due to the clarity and breadth of Rclone's exceptional documentation, which should be your first resource when learning its usage. Typing rclone --help will result in a deluge of information, but the "Available Commands" section of the help message gives a good synopsis of each command. For help on a specific command, you can also use rclone <command> --help (e.g. rclone copy --help).

Listing files

Rclone gives a few methods for listing files; none of them are quite like Unix's ls, but rclone lsf --max-depth 1 remote:path/to/dir comes close. A few more examples:

# Recursively list all files at "box" remote
rclone ls box:

# Show directories in "fsl" at "box"
rclone lsd box:fsl

# Recursively list files in "fsl/dir1" at "box" with more detail
rclone lsl box:fsl/dir1

Moving and Copying

rclone copy and rclone move behave a essentially like Unix's cp and mv, respectively; you can copy and move to or from the remote. Example usage:

# Copy remote file, mydata.txt, from "fsl" at "box"
rclone copy box:fsl/mydata.txt $HOME/data/

# Move a tarball from compute to "fsl/compute-backup" at "box"
rclone move ~/compute/my-tarball.tar.gz box:fsl/compute-backup

Creating Directories

rclone mkdir behaves like Unix's mkdir; to create a new directory on a remote, you would use something like:

rclone mkdir box:fsl/myNewDirectory

Examples

Move Archival Data to Box

Say you have a directory with data that needs to be kept, but you don't expect to do any work on it with the supercomputer, and you're running out of space. You can either move it directly, or compress it and move it. Moving it directly is easier and you'll be able to look at the data directly at box.byu.edu, but compressing then moving could be much faster.

Generally, if you have a few big files (which must be under 32 GB, of course) you won't be slowed down too much by copying directly, but if you have many small files it will take a long time. Under ideal conditions, you can copy 4 files per second (across all processes--Box limits transfers by user). If you have a million files, that means it will take at least a few days to transfer them, no matter how small they each are.

To move without compressing, simply use:

rclone move ~/compute/dataset box:fsl/dataset

There are two main ways to compress then move data. This one is slower and more reliable:

tar -czf dataset.tar.gz ~/compute/dataset
rclone move dataset.tar.gz box:fsl/dataset.tar.gz

This one is faster and doesn't use significant disk space, but the work will be lost of the command is interrupted:

tar -czf - ~/compute/dataset | rclone rcat box:fsl/dataset.tar.gz

Backup compute with Box

Perhaps you have a large set of data in ~/compute/dataset, which is too big to fit in your home directory, that you would like to back up weekly. Say you set up the following directory structure to store the backups:

box:fsl
'-- backup
    '-- dataset
        '-- old
        '-- primary

...by running:

rclone mkdir box:fsl/backup
rclone mkdir box:fsl/backup/dataset
rclone mkdir box:fsl/backup/dataset/old
# primary will be created by the copy

The current backup will live at box:fsl/dataset/primary, while older snapshots, organized by date, will go in box:fsl/dataset/old/. To get started, let's copy over dataset to the current backup directory at box:fsl:

rclone copy ~/compute/dataset box:fsl/backup/dataset/primary

Keep in mind that Box is slow, so this may take some time. If you want to exit your ssh session while the copy is going, you may want to use screen or tmux.

Once the copy is done, you'll need to back up every week (or however frequently you would like to). This could go something like:

PRIMARY=box:fsl/backup/dataset/primary
OLD=fsl/backup/dataset/old/dataset-$(date +%F-%R)
screen -dm rclone sync $HOME/compute/dataset $PRIMARY --backup-dir $OLD
# using `screen -dm ...` means that rclone will keep going even if you log out

If you want to do this regularly, you can put this in a script and run it at your convenience.