SHARK-ALICE Data Transfer

Last modified by Kerwin Olfers on 2023/12/13 11:55

Introduction

Below are instruction for various ways to transfer data between SHARK and ALICE. The Rsync method in combination with setting up SSH keys first is the recommended method at the moment. When in doubt, please contact labsupport@fsw.leidenuniv.nl.

Note: do not remove your data on SHARK before validating whether the data has arrived on ALICE properly.

Setting up SSH Keys

  1. Set up SSH public/private keys, this will prevent you from having to manually enter your password every time you wish to connect from SHARK to ALICE. Each user on SHARK and ALICE has keys autogenerated for them but a 'private' set of keys is recommended. To do that:
    • Open a terminal on SHARK and run:
      ssh-keygen
    • When prompted for the file location leave empty (just press enter) to store in the default path. (If you already have keys that you wish to keep using, you have to set a custom path for the new keys instead!).
    • When prompted for a password leave empty.
    • Now run the following command, replace [USER] by your ALICE username, and enter your ALICE password when prompted
      ssh-copy-id  [USER]@ssh-gw.alice.universiteitleiden.nl
      and then (one line!)
      ssh-copy-id -o "ProxyJump [USER]@ssh-gw.alice.universiteitleiden.nl" [USER]@login2.alice.universiteitleiden.nl
    • Make sure you can log now in from SHARK to ALICE without having to type in a password, by running:
      ssh -o "ProxyJump [USER]@ssh-gw.alice.universiteitleiden.nl" [USER]@login2.alice.universiteitleiden.nl
  2. [OPTIONAL] Set up ssh config file, so that you can use shorter ssh command in the future. To do this:
    1. Open the config file in “~/.ssh/config" (or create a file with that name if it doesn't exist yet) in an editor and add the following text, replace <user-name> by your ALICE username:

      Host alice2
      HostName login2.alice.universiteitleiden.nl
      User <user-name>
      ProxyJump <user-name>@ssh-gw.alice.universiteitleiden.nl
      ServerAliveInterval 60
    2. Now test whether you can connect to ALICE by running:
      ssh alice2
    3. If this works, then in all further steps in the guide below you can replace all instances of 
      shh-o "ProxyJump user@ssh-gw.alice.universiteitleiden.nl" user@login2.alice.universiteitleiden.nl 
      by
      ssh alice2

Rsync (Recommended)

The benefits of rsync are that it can keep track of what is transferred correctly, provide a progress indication, can pick up and continue transfers after interruptions (e.g. if the connection was lost) and allows for a log-file to be written. The downside is that Rsync does not combine the files in a tar prior to transfer, so it will be slower to transfer the data (especially when there are large numbers of files) and place a larger load on the network and cluster I/O

  1. First, set up SSH keys and SSH config file, by following the first two steps here.
  2. Run the following command in a SHARK terminal, replace the source [SHARK-FOLDER]  and destination [ALICE-FOLDER] paths as relevant. It is strongly suggested to start with a small test folder containing one or two (text) files.
    rsync -azv --sparse -P [SHARK-FOLDER] alice2:[ALICE-FOLDER]
    example:
    rsync -azv --sparse -P ./test alice2:data1/
  3. For the source [SHARK-FOLDER] path, including a forward slash /  at the end of the path means that the contents of the source directory (e.g. dir1) will be synced to the destination ALICE folder (e.g dir2), creating a /dir2/files structure. Not including the /  means that the source folder itself will also be put inside the destination folder, e.g. creating a /dir2/dir1/files structure.
  4. If the transfer was interrupted for any reason (e.g. lost connection), you can just run the command again. Files that already were transferred successfully will be left alone.
  5. [Optional] you can store the progress output from rsync in a logfile:
    • run:
      rsync -azv -P --log-file="./rsync_log-$(date +"%Y-%m-%d".txt)" [SHARK-FOLDER] alice2:[ALICE-FOLDER]
    • A log file will be written in the current folder named "rsync_log" followed by the date of the transfer. You can inspect this log files afterwards to see if the transfer was completed or interrupted. If you do multiple transfer on the same day, the log will just be appended in the file.
    • Check out the log file to see if everything went as expected. What you want to at the end of the log is something like:
      • 2023/03/27 14:26:22 [702404] sent 92,249,442 bytes  received 55 bytes  61,499,664.67 bytes/sec
        2023/03/27 14:26:22 [702404] total size is 92,226,726  speedup is 1.00
      • What you do not want to see is something like:
        2023/03/27 14:16:42 [542408] rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1189) [sender=3.1.3]
  6. [Optional] In the above examples compression is turned on to reduce network load. However you can disable this if you feel that it slows down the transfer too much
  7. Rsync can be run with many configurations. The one used above is pretty much the default setting for archiving (that is copying all folders and files). However, in some cases you may want to use different parameters to better suit your needs. For instance symbolic and hard-linked files and folders may need special treatment that the regular -av parameter does not give. Please consult the help pages for rsync and/or contact SOLO or cluster admins for help if needed
  8. Transfer your data! If all has gone well so far, you are ready to copy over your data. If your data is relatively large and/or you expect to turn of your laptop/computer or switch networks at some point during the transfer it is advised to run the transfer command in a Screen. Processes running in Screen will continue to run when their window is not visible, even if you get disconnected.
    • In the SHARK terminal run:
      screen -S [SOME-NAME]
      example
      screen -S datatransfer
    • Start your data transfer command
    • Now 'detach' the screen using CTRL+a+d
    • You can now safely disconnect from SHARK
    • Next time you wish to check up on the transfer, run:
      screen -r
    • If you have multiple screens, you can choose which one to connect by appending its name:
      screen -r datatransfer

Tar over SSH

This method is run fully from the terminal and has the benefit of packing the files together in a (optionally compressed) single tar file for transfer. Generally that leads to faster transfer times.

  1. Try to copy over a single file from SHARK to ALICE:
    • cd [some-directory]
    • tar -czSf - [some-file] | ssh -o "ProxyJump [USER]@ssh-gw.alice.universiteitleiden.nl" [USER]@login2.alice.universiteitleiden.nl  tar -C data1 -xzSf -
    • note that you can shorten the above command a lot if you set up your ssh config file:
      tar -czSf - [some-file] | ssh alice2  tar -C data1 -xzSf -
    • This will copy /some-directory/some-file to ~/data1/some-file. If you wish to transfer the file to a different destination folder replace data1  with the intended destination path. Be aware that the destination folder must exist on ALICE prior to the transfer.
    • Be aware: this will overwrite any existing files on ALICE (in the destination folder) with the same name!

  2. If that works, then copy a test directory
    •  cd [some-directory]
    • Run the following command, replacing [USER] and [some-directory] with relevant information:
      tar -czSf - [some-directory] | ssh -o "ProxyJump [USER]@ssh-gw.alice.universiteitleiden.nl" [USER]@login2.alice.universiteitleiden.nl  tar -C data1 -xzf -
      example:
      tar -czSf - ./test/| ssh -o "ProxyJump olferskjf@ssh-gw.alice.universiteitleiden.nl" olferskjf@login2.alice.universiteitleiden.nl  tar -C data1 -xzf -
      note that you can shorten the above command a lot if you set up your ssh config file:
      tar -czSf - ./test/| ssh alice2 tar -C data1 -xzf -
    • This will copy /some-directory/data-direcory and everything below it to ~/data1/data-directory. File permissions will be preserved, the user and group are re-set to the local ALICE userid and default groupid. If you wish to transfer the folder to a different destination folder replace data1  with the intended destination path. Be aware that the destination folder must exist on ALICE prior to the transfer.
    • Be aware: this will overwrite existing files on ALICE (in the destination folder) with the same name or add the files to an existing folder with the same name of ALICE.

  3. Transfer your data! If all has gone well so far, you are ready to copy over your data. If your data is relatively large and/or you expect to turn of your laptop/computer or switch networks at some point during the transfer it is advised to run the transfer command in a Screen. Processes running in Screen will continue to run when their window is not visible even if you get disconnected.
    • In the SHARK terminal run:
      screen -S [SOME-NAME]
      example
      screen -S datatransfer
    • Start your data transfer command
    • Now 'detach' the screen using CTRL+a+d
    • You can now safely disconnect from SHARK
    • Next time you wish to check up on the transfer, run:
      screen -r
    • If you have multiple screens, you can choose which one to connect by appending its name:
      screen -r datatransfer

Optional: Manual tar file

  1. If you have a lot of (especially smaller files) to transfer, it may speed up the transfer if you first combine all these files into a single tar file. This will also allow you to quickly check if the file transferred without issues later on. For the fastest transfer you can use compression when creating the tar file (note that compression the files will take some time itself as well though).
    with compression: tar -czvf FILE-NAME.tar.gz FOLDER-NAME
    example: tar -czvf myfile.tar.gz /exports/fsw/kjfolfers/mydata/
    without compression: tar -cvf myfile.tar /exports/fsw/kjfolfers/mydata/
  2. Once you have the tar file, you can create a checksum value that you can use later to see if the data was transferred without mistakes. To do so run the command: 
    md5sum FILE-NAME
    example:
    md5sum myfile.tar.gz
  3. Note / Save the output of the command somewhere.
  4. You can now copy (cp) or move (mv) the tar file from SHARK to the mounted ALICE folder.
  5. After transferring the file, run the md5sum again on the copied or moved file (in the ALICE folder) and check whether it matches the one you obtained earlier.
  6. Finally to work with the data on ALICE, you can now extract the files from the tar using:
    compressed tar: tar -zxvkf FILE-NAME.tar.gz -C DESTINATION-FOLDER
    uncompressed tar: tar -xvf FILE-NAME.tar -C DESTINATION-FOLDER
    example: tar -zxvf myfile.tar.gz -C /home/olferskjf/data1/test/    (Note: the destination folder must already exist!)

GUI Drag & Drop [Not Recommended]

This methods allows for relatively easy dragging and dropping of files using the File Manager in X2Go, by virtually mounting a folder from ALICE onto SHARK. It does require running one command from the terminal to set everything up. Note that this method is not suitable for all types of data and is generally only advisable when you have a fairly small amount (number of files and size) of data to transfer.

This method does not work for 'sparse files' and symlinked files. Generally that means that conda/pip environment and docker/singularity containers are not suitable for this method.

  1. Log in to SHARK using X2Go.
  2. Log in to ALICE using whatever means you are comfortable with (e.g. terminal, X2Go, RDP, MobaXTerm)
  3. Determine destination folder on ALICE where you wish to put the data. Note that if you are transferring large data sets, you will need to use your scratch folder (on 'data1'), as your home directory has a relatively small quota. For instance, I would use " /home/olferskjf/data1/test/ " Note: make sure the directory actually exists or create it first (mkdir).
  4. Create a mounting point on SHARK. To do this simply create a new empty folder in your home using the File Manager (right click > create new) or using mkdir in the terminal. You could name it "alice_mount" for example, although the name does not matter.
  5. Assemble the mounting command needed to mount this folder on SHARK. For instance, by opening a text editor (on your local machine or on SHARK). Edit the (single-line) command below by replacing the highlighted text with the appropriate information:
    sshfs -o reconnect,ServerAliveInterval=15,ssh_command="ssh -J ULCN-USERNAME@ssh-gw.alice.universiteitleiden.nl" ULCN-USERNAME@login1.alice.universiteitleiden.nl:DESTINATION-FOLDER MOUNTING-POINT
    for example for the command could look like:
    sshfs -o reconnect,ServerAliveInterval=15,ssh_command="ssh -J olferskjf@ssh-gw.alice.universiteitleiden.nl" olferskjf@login1.alice.universiteitleiden.nl:/home/olferskjf/data1/test/ ./alice_mount/
  6. Run the mounting command in a terminal on SHARK. Please make sure that you are in your home directory in the terminal, or if you are not that you adjust the MOUNTING-POINT path to reflect that.
  7. Enter your ALICE password twice now. If you receive no errors, the mount has been successful. If you are in an X2Go session, you can also confirm this by looking at the File Manager. The mounted folder should show up under "Devices".
    image-20230327100131-2.png
  8. Copy a small file from a SHARK folder to the mounted ALICE folder as a test, by dragging and dropping in the File Manager.
  9. Check if the transfer worked in your ALICE session (i.e. can you see the file there in the expected location)
  10. Now read optional variation below to see which ones are relevant for you.
  11. Transfer the files as needed.
  12. Unmount the folder once you are done, by running the command:
    fusermount -u [MOUNTING-POINT]
    example:
    fusermount -u ./alice_mount/
Tags:
   
solo
XWiki 14.10.13
contact@xwiki.com