Skip to content

Working with large data files

Here we explain how to handle large data files in your ARC.

ARCs and the DataHUB come with a mechanism to sync and store large files called Large File Storage (LFS). LFS is an efficient way to store your large data files. These files are called “LFS objects”. Rather than checking every file during every arc sync (ARC Commander) or DataHUB Sync (ARCitect), the tools first check whether there was a change at all. And only if this is the case, it scans what was changed. This way it saves time and computing power compared to always scanning all large files for possible changes.

The ARCitect offers to activate or deactivate the use of LFS:

  • in the “Download ARC” (1) menu via the “LFS” checkbox (2)

  • as well as in the “DataHUB Sync” menu (1) via the “Use Large File Storage” checkbox (2), which are available once an ARC has been open in ARCitect.

In addition you can set a threshold (2) in megabytes (MB) for what you consider a large file in the “Commit” menu (1).

You can also easily check which files in your ARC are flagged as LFS, by looking in the ARCitect tree panel (1).

If you haven’t downloaded the LFS file you can only open its pointer file. Unfortunately, this pointer file cannot be displayed in ARCitect but if you try to open it with a text editor (e.g. Notepad) it looks something like this:

Terminal window
version https://git-lfs.github.com/spec/v1
oid sha256:dfc4d259bb70ab93915fe6fd91df33017b09f9208d94b48d7c9a789dd35d65bc
size 22973898

Finally, you can individually download large files via right-click -> “Download LFS File” (1)

or you can also choose to download all large files from a directory by right clicking on the folder in the panel tree (1) and then “Download LFS Files” (2).

By default, the ARC Commander tracks the following files via LFS:

  1. All files stored in an assay’s dataset folder, and
  2. All files with a size larger than 150 MB.

The threshold of 150 MB can easily be adjusted using the ARC Commander. For instance, if you want to decrease it to 5 MB (i.e. 5000000 bytes), run

Terminal window
arc config set -g -n "general.gitlfsbytethreshold" -v "5000000"

In addition to the defaults, you can also actively choose, which files to track via LFS.

  1. Update your local ARC via arc sync
  2. Add large files or folders by copying or moving them to your ARC
  3. Track files via
Terminal window
git lfs track "<path/to/FolderWithLargeFiles/**>"
git add .gitattributes
  1. Sync your ARC to the DataHUB via arc sync

Downloading an ARC without large data files

Section titled Downloading an ARC without large data files

Sometimes you may want to download your ARC to a smaller computer, where you do not need a full copy of your ARC including all its large data files. For instance, you just want to work with smaller derived data sets or want to update ISA metadata. In this case, you can add the -n or --nolfs flag to your arc get command:

Terminal window
arc get --nolfs -r https://git.nfdi4plants.org/<YourUser>/<YourARC>

For example, have a look at the ARC https://git.nfdi4plants.org/shiltemann/physcomitrium-patens-light-signaling-2022/. In the DataHUB this ARC has a storage volume of ~84GB (December 2023), most of which comes from the large RNASeq data files flagged as “LFS”.

You can download this ARC without the LFS objects via

Terminal window
arc get --nolfs -r https://git.nfdi4plants.org/shiltemann/physcomitrium-patens-light-signaling-2022/

Selectively download large files

Section titled Selectively download large files

If at some point you wish to selectively download one or more of the LFS objects of your ARC to that machine, you can do so via git lfs pull --include "<path/to/fileOrFolder>"

For example, the following command will download one of the large RNASeq data files.

Terminal window
git lfs pull --include "assays/RNASeq/dataset/R19/R19_1.fq.gz"

Download all large files in the ARC

Section titled Download all large files in the ARC

If at some point you wish to download all LFS files of your ARC, you can use the following command

Terminal window
git lfs pull --include "*"

Open your ARC in the DataHUB and navigate to the folder with LFS objects. Files uploaded with LFS are flagged as “LFS” (1).

If at some point you would like to check the storage used by your ARC, you can easily do so by navigating to your ARC in the DataHUB and clicking on “Project Storage” in the right sidebar (1).