-
Notifications
You must be signed in to change notification settings - Fork 2
Instructions for GEOS Chem Users
This page is a standalone summary of bashdatacatalog instructions for GEOS-Chem users.
There are two important terms you should understand before starting:
- data collection - A data collection is an input data directory. For example, HEMCO/CEDS/v2021-06 is a single data collection.
- catalog file - A file that groups data collections together. For example, the following catalog file specifies the data collections for emissions inputs for GEOS-Chem v13.2: EmissionsInputs.csv.
GEOS-Chem input data is divided into 4 catalogs: MeteorologicalInputs.csv
, EmissionsInputs.csv
, ChemistryInputs.csv
, and InitialConditions.csv
. Scientific updates in X.Y versions of GEOS-Chem often introduce new collections with updated emissions, chemistry inputs, and initial condition, so certain catalogs are version-specific:
Catalog | GEOS-Chem X.Y Version-Specific? |
---|---|
MeteorologicalInputs.csv | No |
EmissionsInputs.csv | Yes |
ChemistryInputs.csv | Yes |
InitialConditions.csv | Yes |
In general, the workflow for downloading GEOS-Chem input data is:
-
Download catalog files you need. These catalogs define the data collections (input data) that GEOS-Chem needs.
- (Optional) If you are using any optional emissions or specialty simulations, enable the appropriate collections in your catalogs (column 3 of a catalog file).
-
Fetch collection metadata by running
bashdatacatalog-fetch
. This downloads metadata for each active collection in your catalogs which includes the files in every collection, their checksums, and other details. -
Generate a list of the files you need to download by running
bashdatacatalog-list
. There are a variety of output formats to choose from (e.g., as download list for cURL, wget, or Globus) along with a variety of options (e.g., filtering by a date range, or only listing missing files).
Refer to the Installation Instructions. In brief, run the following command, answer the prompts, and restart your terminal.
$ bash <(curl -s https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/install.sh)
Note: These set up instructions work for pre-existing (in-place set up) and new copies of the GEOS-Chem input data.
First, create a directory to house your catalog files. Navigate to the top-level of the GEOS-Chem data directory (the directory with HEMCO/
, CHEM_INPUTS/
, etc.) and create a new directory called InputDataCatalogs
. Inside this directory, create subdirectories for the GEOS-Chem versions you work with.
liam@~$ cd /ExtData # navigate to GEOS-Chem data
liam@/ExtData$ mkdir InputDataCatalogs # new directory for catalog files
liam@/ExtData$ mkdir InputDataCatalogs/13.2 # " for 13.2-specific catalogs
liam@/ExtData$ mkdir InputDataCatalogs/13.3 # " for 13.3-specific catalogs
Next, download the catalogs for the GEOS-Chem versions you work with. Since MeteorologicalInputs.csv
doesn't change between GEOS-Chem versions, put it in InputDataCatalogs/
. Since EmissionsInputs.csv
, ChemistryInputs.csv
, and InitialConditions.csv
do change between GEOS-Chem versions, put those in the version specific subdirectories. Input data catalogs can be downloaded from http://geoschemdata.wustl.edu/ExtData/DataCatalogs/.
Expand to see examples of downloading catalog files
Download the catalog for metfields:
liam@/ExtData$ cd InputDataCatalogs
liam@/ExtData/InputDataCatalogs$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/MeteorologicalInputs.csv
Download the 13.2-specific catalogs:
liam@/ExtData/InputDataCatalogs$ cd 13.2
liam@/ExtData/InputDataCatalogs/13.2$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.2/ChemistryInputs.csv
liam@/ExtData/InputDataCatalogs/13.2$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.2/EmissionsInputs.csv
liam@/ExtData/InputDataCatalogs/13.2$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.2/InitialConditions.csv
Download the 13.3-specific catalogs:
liam@/ExtData/InputDataCatalogs/13.2$ cd ../13.3
liam@/ExtData/InputDataCatalogs/13.3$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.3/ChemistryInputs.csv
liam@/ExtData/InputDataCatalogs/13.3$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.3/EmissionsInputs.csv
liam@/ExtData/InputDataCatalogs/13.3$ wget http://geoschemdata.wustl.edu/ExtData/DataCatalogs/13.3/InitialConditions.csv
Column 3 of a catalog file is a boolean flag that enables/disable each collection. By default, only the collections that are required for an out-of-the-box GC-Classic and GCHP simulation are enabled (they have a 1
in column 3). Column 3 is where you can customize active collections according to the simulations you plan on running. You will need to modify column 3 if you use
- GEOS-FP or nested grid metfields,
- optional emissions, or
- specialty simulations
As an example, lets configure MeteorologicalInputs.csv so that the collection with GEOS-FP 0.25°x0.3125° global fields is activated. Open MeteorologicalInputs.csv with your preferred text or CSV editor and make the following modification
GEOS_0.25x0.3125_AS/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125_AS/GEOS_FP,0,
GEOS_0.25x0.3125_CH/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125_CH/GEOS_FP,0,
GEOS_0.25x0.3125_EU/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125_EU/GEOS_FP,0,
-GEOS_0.25x0.3125/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125/GEOS_FP,0,
+GEOS_0.25x0.3125/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125/GEOS_FP,1,
GEOS_0.25x0.3125_NA/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125_NA/GEOS_FP,0,
GEOS_0.5x0.625_AS/GEOS_FP,http://geoschemdata.wustl.edu/ExtData/GEOS_0.5x0.625_AS/GEOS_FP,0,
GEOS_0.5x0.625_AS/MERRA2,http://geoschemdata.wustl.edu/ExtData/GEOS_0.5x0.625_AS/MERRA2,0,
Note: Over time we will improve the documentation of what collections are needed when. In the mean time, refer to column 4 for notes on when a collection is needed. If you have questions, please open an issue on GitHub.
HEMCO/
, CHEM_INPUTS/
, etc.), otherwise, relative paths to data collections (column 1) will be interpreted incorrectly.
Navigate to the top-level of your GEOS-Chem data directory. First, you need to fetch collection metadata (the information about what files exist in each collection). This is done with the bashdatacatalog-fetch
command which takes catalog files as its arguments. See bashdatacatalog-fetch -h
for more details.
liam@~$ cd /ExtData # IMPORTANT: navigate to top-level of GEOS-Chem input data
liam@/ExtData$ bashdatacatalog-fetch InputDataCatalogs/*.csv InputDataCatalogs/**/*.csv
Click to see expected output
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/CHEM_INPUTS/FastJ_201204/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/CHEM_INPUTS/FAST_JX/v2020-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/CHEM_INPUTS/Linoz_200910/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/CHEM_INPUTS/Olson_Land_Map_201203/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/CHEM_INPUTS/UCX_201403/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/GEOS_0.25x0.3125/GEOS_FP/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/GEOS_0.5x0.625/MERRA2/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/GEOS_2x2.5/MERRA2/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/GEOSCHEM_RESTARTS/GC_13.0.0/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/GEOSCHEM_RESTARTS/v2021-09/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/ACET/v2014-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/AEIC/v2015-01/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/AFCID/v2018-04/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/ALD2/v2017-03/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/AnnualScalar/v2014-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/APEI/v2016-11/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/BIOFUEL/v2019-08/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/BROMINE/v2015-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/C2H6_2010/v2019-06/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/CEDS/v2021-06/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/CH3I/v2014-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/CMIP6/v2020-03/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/DICE_Africa/v2016-10/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/DMS/v2015-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/DUST_DEAD/v2019-06/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/EDGARv42/v2015-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/EDGARv43/v2016-11/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/GEIA/v2014-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/GFED4/v2015-10/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/GFED4/v2020-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/GMI/v2015-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/HTAP/v2015-03/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/IODINE/v2020-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/LIGHTNOX/v2014-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/MASKS/v2018-09/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/MASKS/v2019-05/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/MEGAN/v2018-05/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/MEGAN/v2020-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/MODIS_CHLR/v2019-11/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/MOH/v2019-12/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/MULTI_ICE/v2021-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/NEI2005/v2014-09/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/NEI2011/v2017-02-MM_for_GCHP/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/NEI2016/v2021-06/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/NH3/v2018-04/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/NH3/v2019-08/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/NOAA_GMD/v2018-01/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/OCEAN_O3_DRYDEP/v2020-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/OFFLINE_BIOVOC/v2019-10/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/OFFLINE_DUST/v2019-01/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/OFFLINE_DUST/v2021-08/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/OFFLINE_LIGHTNING/v2020-03/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/OFFLINE_SEASALT/v2019-01/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/OFFLINE_SOILNOX/v2019-01/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/OLSON_MAP/v2019-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/OMOC/v2018-01/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/PARANOX/v2015-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/POET/v2017-03/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/RCP/v2020-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/RONO2/v2019-05/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/RRTMG/v2018-11/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/SfcFix/v2019-12/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/SOA/v2014-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/SOILNOX/v2014-07/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/STRAT/v2015-01/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/TIMEZONES/v2015-02/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/TrashEmis/v2015-03/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/UVALBEDO/v2019-06/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/VerticalScaleFactors/v2021-05/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/VOLCANO/v2019-08/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/VOLCANO/v2021-09/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/XIAO/v2014-09/'
Fetching metadata from 'http://geoschemdata.wustl.edu/ExtData/HEMCO/Yuan_XLAI/v2021-06/'
Note: Fetching should take about 1 minute. Recently, the GEOS-Chem data portal has had intermittent periods where it is slow to respond, so if fetching is taking a long time, try again later.
Fetching downloads the latest metadata for every active collection in your catalogs. You should run bashdatacatalog-fetch
whenever you add or modify a catalog, as well as periodically so you get updates to your collections (e.g., new meteorological data that is processed and added to the meteorological collections).
Now that you have fetched, you can run bashdatacatalog-list
commands. You can tailor this command the generate various types of file lists using its command-line arguments. See bashdatacatalog-list -h
for details. A common use case is generating a list of required input files that missing in your local file system.
liam@/ExtData$ bashdatacatalog-list -am -r 2018-06-30,2018-08-01 InputDataCatalogs/*.csv InputDataCatalogs/**/*.csv
Here, -a
means "all" files (temporal files and static files), -m
means "missing" (list files that are absent locally), -r START,END
is the date-range of your simulation (you should add an extra day before/after your simulation), and the remaining arguments are the paths to your catalog files.
The command can be easily modified so that it generates a list of missing files that is compatible with xargs curl
to download all the files you are missing:
liam@/ExtData$ bashdatacatalog-list -am -r 2018-06-30,2018-08-01 -f xargs-curl InputDataCatalogs/*.csv InputDataCatalogs/**/*.csv | xargs curl
Here, -f xargs-curl
means the output file list should be formatted for piping into xargs curl
.
📖 You can find a list of useful list commands here: Useful Commands.
When using the -r
option in bashdatacatalog-list
(date range filtering), you need to subtract/add 1 day to the start/end date of your simulation. This is because time interpolation at midnight needs the files for both days.
For example, if you plan to run a simulation for 2019-01-01 to 2019-12-31, you should use an -r
range like
liam@/ExtData$ bashdatacatalog-list -am -r 2018-12-31,2020-01-01 InputDataCatalogs/*.csv InputDataCatalogs/**/*.csv
Emission data collections that use the C
flag in HEMCO (i.e., "recycling" where the first/last year of data is recycled for simulation dates before/after the data's temporal coverage) need to classify the first/last year of data as static assets rather than temporal assets. As a result, the OFFLINE_BIOVOC/v2019-10
and OFFLINE_SEASALT/v2019-01
collections each have about ~100 GB of static data. This means that the minimum size of ExtData using the bashdatacatalog is ~300 GB (this is static data and data which isn't easily classified from the top-down). It's essentially a one-time overhead, and in practice it's a small price. Once you have this "static overhead" downloaded, subsequent downloads will only be the required data.
You can see the static overhead of each collection here: Summary of Input Data Collection Sizes (see the "Total Size (Static)" column).
The default collections only enable (column 3) collections that are required for out-of-the-box GC-Classic and GCHP simulations. To enable data collections for optional emissions, search column 4 (collection comments) for keywords related to emission or speciatly simulation. Column 4 of EmissionsInputs.csv
has comments that indicate the corresponding switch in HEMCO_Config.rc
. For example, for APEI emissions:
liam@/ExtData$ grep -r 'APEI' InputDataCatalogs/*.csv InputDataCatalogs/**/*.csv
13.2/EmissionsInputs.csv:HEMCO/MASKS/v2018-09,http://geoschemdata.wustl.edu/ExtData/HEMCO/MASKS/v2018-09,1,For China mask and APEI and NEI2016_MONMEAN and DICE_Africa
13.2/EmissionsInputs.csv:HEMCO/APEI/v2016-11,http://geoschemdata.wustl.edu/ExtData/HEMCO/APEI/v2016-11,0,Optional; APEI Canada
13.3/EmissionsInputs.csv:HEMCO/MASKS/v2018-09,http://geoschemdata.wustl.edu/ExtData/HEMCO/MASKS/v2018-09,1,For China mask and APEI and NEI2016_MONMEAN and DICE_Africa
13.3/EmissionsInputs.csv:HEMCO/APEI/v2016-11,http://geoschemdata.wustl.edu/ExtData/HEMCO/APEI/v2016-11,0,Optional; APEI Canada
This shows that the HEMCO/APEI/v2016-11
is currently disabled (0
in column 3), but it could be enable it by setting column 3 to a 1
.
Note: Technically, only one catalog needs to enable it, but it would be good practice to enable it in both catalogs above.
Please report it by Opening and Issue. Indexing all of the GEOS-Chem data has been a major organizational effort, and it is likely we have a few minor errors. It shouldn't take long to fix once you have reported the issue.
Feel free to open an issue if you have any questions or need help.
Consider giving the bashdatacatalog a Star ⭐ if you find it useful. This increase visibility and helps justify maintaining this repository.