Skip to content

3. Useful Commands

Liam Bindle edited this page Feb 23, 2022 · 12 revisions

This section has a list of useful bashdatacatalog-list commands.


Listing commands

List all missing files

$ bashdatacatalog-list -am catalog.csv

List all files existing locally

$ bashdatacatalog-list -ae catalog.csv

List all missing files in a date range

$ bashdatacatalog-list -am -r "2015-01-01,2018-12-31" catalog.csv

List invalid files whose names match a pattern

$ bashdatacatalog-list -aw -p "file[123]" catalog.csv

Note: It can take a significant amount of time to list wrong files because the checksums need to be calculated. This is why the -p "PATTERN" argument is often useful with -w.

List unnecessary files for a given date range

$ bashdatacatalog-list -u -r "2015-01-01,2018-12-31" catalog.csv

Note: Unnecessary files are temporal files with a timestamp that falls outside of the provided date range and untracked files.


Download commands (using various tools)

curl: Downloading all missing files (HTTP)

$ bashdatacatalog-list -am -f xargs-curl catalog.csv | xargs curl

wget: Downloading all missing files (HTTP)

$ bashdatacatalog-list -am -f url catalog.csv > url_download_list.txt
$ wget -i url_download_list.txt -x -nH -nv --cut-dirs=4   # you will need to modify --cut-dirs=N

rsync: Downloading all missing files (SSH)

$ bashdatacatalog-list -am -f rsync catalog.csv > file_list.txt
$ rsync -av --file-from=file_list.txt user@host:/remote-data-root/ .

Globus: Downloading all missing files (Globus)

$ bashdatacatalog-list -am -f globus="$(pwd),/remote-data-root/" catalog.csv > globus_batch.txt
$ globus transfer --batch globus_batch.txt SOURCE_ENDPOINT_ID DEST_ENDPOINT_ID

Miscellaneous commands

Removing unnecessary files, given a date range

$ bashdatacatalog-list -u -r "2015-01-01,2018-12-31" -f xargs-rm catalog.csv | xargs rm

Changing the group that owns all the files in a catalog

$ bashdatacatalog-list -ae catalog.csv | xargs chgrp groupname

Note: You might need to use sudo xargs chgrp groupname.

Changing the group that owns all the directories in a catalog

$ bashdatacatalog-execdir 'find -type d -exec chgrp groupname {} \;' catalog.csv
$ bashdatacatalog-execdir 'pwd' catalog.csv | sed "s#$(pwd)/*##g" | awk -F '/' '{for (i=1; i<=NF; i++) { for(j=1; j<=i; ++j) printf "%s/",$j ; printf "\n"} }' | sort | uniq | xargs chgrp groupname

Note: You might need to use sudo chgrp groupname.

Removing all unnecessary files in the tree, given a date range

$ bashdatacatalog-list -a -r "2015-01-01,2018-12-31" catalog.csv | sort > tracked_files.txt
$ find -L . -name .asset_patches -prune -o -type f \( ! -name '\.*' \) -print | sort > all_files_in_tree.txt
$ comm -13 tracked_files.txt all_files_in_tree.txt > unnecessary_files.txt
$ cat unnecessary_files.txt | xargs rm  # be careful!

Size of all static files that exist

$ bashdatacatalog-list -s DataCatalogs/MeteorologicalInputs.csv DataCatalogs/13.2/*.csv | xargs stat --printf="%s\n" | awk '{s+=$1} END {print s}'
Clone this wiki locally