Workflow and Data Management

A primary use of WSIM is to generate composite indicators of overall water surplus and deficit at a variety of time integration periods. Production of these indicators typically requires many different calls to command-line tools and generation of numerous intermediate files. To simplify the process of composite indicator production, WSIM provides a workflow tool that can be used to produce a directed acyclic graph (DAG) representation of intermediate files and the commands used to create them. The DAG can then be output to a Makefile, which can be executed to produce the desired outputs.

Note

Generated Makefiles are not intended to be human-readable, and may contain hundreds of thousands of lines describing targets and dependencies.

Notes about Make

Make provides several advantages over, e.g., a shell script:

  • It allows a user to build only the outputs that are desired. For example, a user may only want observed data, or may only want long-lead-time forecasts.

  • It allows the process to be stopped and restarted at any point, which is useful if the process is being run on non-dedicated hardware.

  • It provides an easy way to run portions of the process in parallel. Because Make understands the interdependencies of the various commands, it can figure out which steps are independent and can be run in parallel, without manual specification.

There are dozens of alternative (and more modern) workflow tools that provide additional functionality, such as provisioning of worker machines, graphical monitoring of progress, etc. WSIM’s representation of processing steps as a DAG is designed such that additional output modules could be written to execute the WSIM workflow with alternatives to Make, if desired.

Starting a New Model Instance

The Python script makemake.py can be used to generate a Makefile that contains all steps necessary to run the model for one or more iterations, including downloading process inputs and performing model spin-up procedures.

Note

The steps for multiple model iterations can be included in a single file by calling makemake.py with the --start and --stop arguments. By default, forecasting steps will only be included in the most recent model iteration. This can be adjusted, in turn, with the --forecasts argument. However, including many forecasts in a single Makefile can cause Make to start slowly.

An example usage of makemake.py is as follows:

python3 workflow/makemake.py \
  --bindir /wsim \
  --config workflow/config/config_cfs.py \
  --source ~/wsim/source \
  --workspace ~/wsim/workspaces/oct26 \
  --start 201701

This will create the file ~/wsim/workspaces/oct26/Makefile with instructions for performing the following tasks:

  • downloading and preparing source data needed by WSIM to the source directory

  • performing a model spinup in the workspace directory

  • creating all outputs from the January 2017 model run

Subsequent model iterations (February 2017, March 2017, etc.) can be run by writing a new Makefile into the same workspace, overwriting the previous Makefile.

Once a Makefile has been generated, any WSIM output can be generated by calling Make from the workspace directory and providing the path of the output file as an argument, e.g.,

make ~/wsim/workspaces/oct26/composite/composite_1mo_201701.nc

Or, more succinctly:

make `pwd`/composite/composite_1mo_201701.nc

This will run all commands necessary to generate composite_1mo_201701.nc, including downloading historical data, spinning up the land surface model process, fitting statistical distributions on historical data, running the land surface model for the January 2017 timestep, and summarizing the January 2017 results in terms of composite indicators.

Make can also be called with the names of multiple targets:

However, even the number of composite indicator files is large (depending on the configuration, perhaps 1 month of observed indicators + 9 months of forecast indicators * 7 time integration windows). To provide an easy way to “build everything”, the Makefile also includes the following targets:

  • all_composites all composite indicator files (observed and forecast) for all time-integration windows

  • all_adjusted_composites all adjusted composite indicator files (observed and forecast) for all time-integration windows

  • all_monthly_composites all composite indicator files (observed and forecast) without time integration

  • all_adjusted_monthly_composites all adjusted composite indicator files (observed and forecast) without time integration

Note

In order to run a model iteration, outputs from the previous model iteration must either be present on disk, or be covered within the model spinup period. In the example above, the January 2017 timestep is the first iteration after the 1948-2016 spinup period, so Makefile instructions are available for all needed iterations. If we wanted to perform the June 2017 iteration instead, we would need to run the iterations between January 2017 - May 2017 iterations first, or include instructions for all of them in the Makefile with arguments --start 201701 --stop 201706.

Configuration needed by the Makefile generator is provided by a Python file with information such as:

  • raw data locations

  • availability of historical data

  • time period to use for fitting historical norms (e.g., 1950-2009)

  • time integration windows (3 months, 24 months, etc.)

  • statistics to compute for time-integrated results (e.g., Bt_RO_max, Ws_ave, etc.)

  • forecast target dates

  • forecast ensemble members

Note

Example configurations are available in the config_cfs.py, config_nldas.py, and config_gldas20_noah.py files.

Data Workspace

By default, the workflow tool assumes that files derived by WSIM will be organized into a workspace folder. The workspace contains all files generated by WSIM for a particular model instance, including spin-up, forcing, results, and summary data. The workspace folder contains a Makefile at its root and the subdirectories listed below:

The forcing directory

The forcing directory contains model inputs in the netCDF format used by wsim_lsm. These may be of two types:

  • Forcing files for observed data are stored with filenames of the format forcing_YYYYMM.nc.

  • Forcing files for forecast data are stored with the filenames of the format forcing_YYYYMM_trgtYYYYMM_fcstNAME.nc where the two dates refer to (1) the month the forecast was generated, and (2) the month to which the forecast applies, and NAME refers to the ensemble member name. For example, forcing data for a forecast ensemble member 2015052812 generated in May 2015 and predicting conditions in August 2015 would be named forcing_201505_trgt201508_fcst2015052812.nc.

The state directory

The state directory contains files storing model states in the netCDF format used by wsim_lsm. Model state files follow the same naming convention as the forcing files, with forcing replaced by state in the filename.

The results and results_integrated directories

The results directory contains files storing model results in the netCDF format generated by wsim_lsm. Model result files follow a similar naming convention to the forcing files, with an extension to indicate time-integrated data. Result filenames for time-integrated data have the format results_Xmo_YYYYMM_trgtYYYYMM_fcstNAME.nc. (the trgt and fcst section are omitted for results generated from observed rather than forecast data).

Time-integrated results are stored in the results_integrated directory rather than the results directory, because these files have different variable names from the 1-month files (e.g., PETmE_sum instead of PETmE).

The rp and rp_integrated directories

Files in the rp directory contain model results expressed as a return period. File and variable naming conventions are equivalent to the results files, with results replaced by rp in the filename.

Time-integrated return periods are stored in the rp_integrated directory rather than the rp directory, because these files have different variable names from the 1-month files (e.g., PETmE_sum_rp instead of PETmE_rp).

The anom and anom_integrated directories

Files in the anom directory contain model results expressed as a return period. File and variable naming conventions are equivalent to the results files, with results replaced by anom in the filename.

Time-integrated return periods are stored in the anom_integrated directory rather than the rp directory, because these files have different variable names from the 1-month files (e.g., PETmE_sum_sa instead of PETmE_sa).

The spinup directory

The spinup directory contains various files generated during the model spin-up process, including climate norms, forcing files of climate norms, model states generated by forcing with climate norms, etc. Spin-up files are described here.

The composite directories

Files in the composite directory contain composite indicators of overall surplus and deficit. File names have the format composite_Xmo_YYYYMM_trgtYYYYMM.nc, with the trgt section omitted for composites generated from observed rather than forecast data. Composite indicators are not generated for individual forecast ensemble members.

Files in the composite_anom directory are the same as those in the composite directory, except that the composites are expressed as standardized anomalies instead of return periods.

Files in the composite_anom_rp directory contain the values from composite_anom expressed as a return period relative to historical values of composite_anom.

Files in the composite_adjusted directory contain adjusted composites based on the values in composite_anom_rp.

The _summary directories

Six directories contain files of model outputs summarizes across the members of a forecast ensemble:

  • rp_summary

  • rp_integrated_summary

  • anom_summary

  • anom_integrated_summary

  • results_summary

  • results_integrated_summary

Files are named according to the same convention as the composite directories.