Workflow and Data Management¶
A primary use of WSIM is to generate composite indicators of overall water surplus and deficit at a variety of time integration periods. Production of these indicators typically requires many different calls to command-line tools and generation of numerous intermediate files. To simplify the process of composite indicator production, WSIM provides a workflow tool that can be used to produce a directed acyclic graph (DAG) representation of intermediate files and the commands used to create them. The DAG can then be output to a Makefile, which can be executed to produce the desired outputs.
Note
Generated Makefiles are not intended to be human-readable, and may contain hundreds of thousands of lines describing targets and dependencies.
Notes about Make¶
Make provides several advantages over, e.g., a shell script:
It allows a user to build only the outputs that are desired. For example, a user may only want observed data, or may only want long-lead-time forecasts.
It allows the process to be stopped and restarted at any point, which is useful if the process is being run on non-dedicated hardware.
It provides an easy way to run portions of the process in parallel. Because Make understands the interdependencies of the various commands, it can figure out which steps are independent and can be run in parallel, without manual specification.
There are dozens of alternative (and more modern) workflow tools that provide additional functionality, such as provisioning of worker machines, graphical monitoring of progress, etc. WSIM’s representation of processing steps as a DAG is designed such that additional output modules could be written to execute the WSIM workflow with alternatives to Make, if desired.
Starting a New Model Instance¶
The Python script makemake.py
can be used to generate a Makefile that
contains all steps necessary to run the model for one or more iterations,
including downloading process inputs and performing model spin-up procedures.
Note
The steps for multiple model iterations can be included in a single file by calling
makemake.py
with the--start
and--stop
arguments. By default, forecasting steps will only be included in the most recent model iteration. This can be adjusted, in turn, with the--forecasts
argument. However, including many forecasts in a single Makefile can cause Make to start slowly.
An example usage of makemake.py
is as follows:
python3 workflow/makemake.py \
--bindir /wsim \
--config workflow/config/config_cfs.py \
--source ~/wsim/source \
--workspace ~/wsim/workspaces/oct26 \
--start 201701
This will create the file ~/wsim/workspaces/oct26/Makefile
with instructions
for performing the following tasks:
downloading and preparing source data needed by WSIM to the
source
directoryperforming a model spinup in the
workspace
directorycreating all outputs from the January 2017 model run
Subsequent model iterations (February 2017, March 2017, etc.) can be run by writing a new Makefile into the same workspace, overwriting the previous Makefile.
Once a Makefile has been generated, any WSIM output can be generated by calling Make from the workspace directory and providing the path of the output file as an argument, e.g.,
make ~/wsim/workspaces/oct26/composite/composite_1mo_201701.nc
Or, more succinctly:
make `pwd`/composite/composite_1mo_201701.nc
This will run all commands necessary to generate composite_1mo_201701.nc
,
including downloading historical data, spinning up the land surface model process,
fitting statistical distributions on historical data, running the land surface model
for the January 2017 timestep, and summarizing the January 2017 results in terms of
composite indicators.
Make can also be called with the names of multiple targets:
However, even the number of composite indicator files is large (depending on the configuration, perhaps 1 month of observed indicators + 9 months of forecast indicators * 7 time integration windows). To provide an easy way to “build everything”, the Makefile also includes the following targets:
all_composites
all composite indicator files (observed and forecast) for all time-integration windowsall_adjusted_composites
all adjusted composite indicator files (observed and forecast) for all time-integration windowsall_monthly_composites
all composite indicator files (observed and forecast) without time integrationall_adjusted_monthly_composites
all adjusted composite indicator files (observed and forecast) without time integration
Note
In order to run a model iteration, outputs from the previous model iteration
must either be present on disk, or be covered within the model spinup period.
In the example above, the January 2017 timestep is the first iteration after
the 1948-2016 spinup period, so Makefile instructions are available for all
needed iterations. If we wanted to perform the June 2017 iteration instead, we
would need to run the iterations between January 2017 - May 2017 iterations
first, or include instructions for all of them in the Makefile with arguments
--start 201701 --stop 201706
.
Configuration needed by the Makefile generator is provided by a Python file with information such as:
raw data locations
availability of historical data
time period to use for fitting historical norms (e.g., 1950-2009)
time integration windows (3 months, 24 months, etc.)
statistics to compute for time-integrated results (e.g.,
Bt_RO_max
,Ws_ave
, etc.)forecast target dates
forecast ensemble members
Note
Example configurations are available in the config_cfs.py
,
config_nldas.py
, and config_gldas20_noah.py
files.
Data Workspace¶
By default, the workflow tool assumes that files derived by WSIM will be organized
into a workspace folder. The workspace contains all files generated by WSIM for
a particular model instance, including spin-up, forcing, results, and summary
data. The workspace folder contains a Makefile
at its root and the
subdirectories listed below:
The forcing
directory¶
The forcing
directory contains model inputs in the netCDF format used by
wsim_lsm
. These may be of two types:
Forcing files for observed data are stored with filenames of the format
forcing_YYYYMM.nc
.Forcing files for forecast data are stored with the filenames of the format
forcing_YYYYMM_trgtYYYYMM_fcstNAME.nc
where the two dates refer to (1) the month the forecast was generated, and (2) the month to which the forecast applies, andNAME
refers to the ensemble member name. For example, forcing data for a forecast ensemble member2015052812
generated in May 2015 and predicting conditions in August 2015 would be namedforcing_201505_trgt201508_fcst2015052812.nc
.
The state
directory¶
The state
directory contains files storing model states in the netCDF format
used by wsim_lsm
. Model state files follow the same naming convention as the
forcing files, with forcing
replaced by state
in the filename.
The results
and results_integrated
directories¶
The results
directory contains files storing model results in the netCDF
format generated by wsim_lsm
. Model result files follow a similar naming
convention to the forcing files, with an extension to indicate time-integrated
data. Result filenames for time-integrated data have the format
results_Xmo_YYYYMM_trgtYYYYMM_fcstNAME.nc
. (the trgt
and fcst
section are omitted for results generated from observed rather than forecast
data).
Time-integrated results are stored in the results_integrated
directory
rather than the results
directory, because these files have different
variable names from the 1-month files (e.g., PETmE_sum
instead of
PETmE
).
The rp
and rp_integrated
directories¶
Files in the rp
directory contain model results expressed as a return
period. File and variable naming conventions are equivalent to the results
files, with results
replaced by rp
in the filename.
Time-integrated return periods are stored in the rp_integrated
directory
rather than the rp
directory, because these files have different
variable names from the 1-month files (e.g., PETmE_sum_rp
instead of
PETmE_rp
).
The anom
and anom_integrated
directories¶
Files in the anom
directory contain model results expressed as a return
period. File and variable naming conventions are equivalent to the results
files, with results
replaced by anom
in the filename.
Time-integrated return periods are stored in the anom_integrated
directory
rather than the rp
directory, because these files have different
variable names from the 1-month files (e.g., PETmE_sum_sa
instead of
PETmE_sa
).
The spinup
directory¶
The spinup
directory contains various files generated during the model
spin-up process, including climate norms, forcing files of climate norms, model
states generated by forcing with climate norms, etc.
Spin-up files are described here.
The composite
directories¶
Files in the composite
directory contain composite indicators of overall
surplus and deficit. File names have the format
composite_Xmo_YYYYMM_trgtYYYYMM.nc
, with the trgt
section
omitted for composites generated from observed rather than forecast data. Composite
indicators are not generated for individual forecast ensemble members.
Files in the composite_anom
directory are the same as those in the composite
directory, except that the composites are expressed as standardized anomalies instead
of return periods.
Files in the composite_anom_rp
directory contain the values from composite_anom
expressed as a return period relative to historical values of composite_anom
.
Files in the composite_adjusted
directory contain adjusted composites
based on the values in composite_anom_rp
.
The _summary
directories¶
Six directories contain files of model outputs summarizes across the members of a forecast ensemble:
rp_summary
rp_integrated_summary
anom_summary
anom_integrated_summary
results_summary
results_integrated_summary
Files are named according to the same convention as the composite
directories.