Preprocessing of CMIP models
This notebook should demonstrate how to use pyremo
and the underlying pyintorg package to easily preprocess global model data from CMIP models that comply to CF conventions. The workflow is basically the same as before but is reimplmented in python for better integration into the workflow and for more flexibility of in and ouput data. This notebook basically show how to use the
remap API that understands CF conventions. The API takes and returns xarray datasets which has several advantages. For example, it will automatically vectorize computations along levels and time axes so we don’t have to care about different calendars any more. Additionally, we can conserve important meta data from the input model and keep it in the forcing data, and consequently, in the REMO
model ouput. Another critical advantage is the lazy computation which will allow dask to easiy parallelize the computation along the time axis and levels.
Accessing CMIP input data
For this notebook, we take CMIP6 MPI-ESM1-2-HR
model data directly from the DKRZ filesystem. However, the input is quite flexible and could come also from any other source the xarray accepts as input.
[1]:
%load_ext autoreload
%autoreload 2
[2]:
# from dask_jobqueue import SLURMCluster
import dask
import xarray as xr
from dask.distributed import Client, progress
import pyremo as pr
dask.config.set(**{"array.slicing.split_large_chunks": False})
[2]:
<dask.config.set at 0x7ffff360ec70>
We will create a dask client here to make efficient use of the DKRZ node resources. In this example, we run the noteook on a shared
node at DKRZ.
[3]:
client = Client()
client
/work/ch0636/g300046/conda_envs/pyremo-dev/lib/python3.9/site-packages/distributed/node.py:177: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 41161 instead
warnings.warn(
[3]:
Client
Client-f74e17a5-f332-11ec-9691-080038c03b61
Connection method: Cluster object | Cluster type: distributed.LocalCluster |
Dashboard: /user/g300046/levante-spawner-preset//proxy/41161/status |
Cluster Info
LocalCluster
8bad6cb5
Dashboard: /user/g300046/levante-spawner-preset//proxy/41161/status | Workers: 7 |
Total threads: 28 | Total memory: 50.00 GiB |
Status: running | Using processes: True |
Scheduler Info
Scheduler
Scheduler-2a32ce70-b354-44bd-a046-3c07e93c3fd1
Comm: tcp://127.0.0.1:39643 | Workers: 7 |
Dashboard: /user/g300046/levante-spawner-preset//proxy/41161/status | Total threads: 28 |
Started: Just now | Total memory: 50.00 GiB |
Workers
Worker: 0
Comm: tcp://127.0.0.1:44767 | Total threads: 4 |
Dashboard: /user/g300046/levante-spawner-preset//proxy/36515/status | Memory: 7.14 GiB |
Nanny: tcp://127.0.0.1:44679 | |
Local directory: /home/g/g300046/python/packages/pyremo/notebooks/dask-worker-space/worker-eyq3542e |
Worker: 1
Comm: tcp://127.0.0.1:38425 | Total threads: 4 |
Dashboard: /user/g300046/levante-spawner-preset//proxy/36075/status | Memory: 7.14 GiB |
Nanny: tcp://127.0.0.1:39671 | |
Local directory: /home/g/g300046/python/packages/pyremo/notebooks/dask-worker-space/worker-plwnm_af |
Worker: 2
Comm: tcp://127.0.0.1:34711 | Total threads: 4 |
Dashboard: /user/g300046/levante-spawner-preset//proxy/38585/status | Memory: 7.14 GiB |
Nanny: tcp://127.0.0.1:42033 | |
Local directory: /home/g/g300046/python/packages/pyremo/notebooks/dask-worker-space/worker-betxo5w9 |
Worker: 3
Comm: tcp://127.0.0.1:37673 | Total threads: 4 |
Dashboard: /user/g300046/levante-spawner-preset//proxy/38421/status | Memory: 7.14 GiB |
Nanny: tcp://127.0.0.1:37967 | |
Local directory: /home/g/g300046/python/packages/pyremo/notebooks/dask-worker-space/worker-sjoehqaw |
Worker: 4
Comm: tcp://127.0.0.1:33535 | Total threads: 4 |
Dashboard: /user/g300046/levante-spawner-preset//proxy/40505/status | Memory: 7.14 GiB |
Nanny: tcp://127.0.0.1:36747 | |
Local directory: /home/g/g300046/python/packages/pyremo/notebooks/dask-worker-space/worker-19cq339e |
Worker: 5
Comm: tcp://127.0.0.1:46747 | Total threads: 4 |
Dashboard: /user/g300046/levante-spawner-preset//proxy/44493/status | Memory: 7.14 GiB |
Nanny: tcp://127.0.0.1:38149 | |
Local directory: /home/g/g300046/python/packages/pyremo/notebooks/dask-worker-space/worker-fj0lg4h7 |
Worker: 6
Comm: tcp://127.0.0.1:41619 | Total threads: 4 |
Dashboard: /user/g300046/levante-spawner-preset//proxy/36069/status | Memory: 7.14 GiB |
Nanny: tcp://127.0.0.1:35053 | |
Local directory: /home/g/g300046/python/packages/pyremo/notebooks/dask-worker-space/worker-062dthg7 |
Now, we collect all input files from the filesystem. For the example, we just take ten years. We combine the input files into a dataset dictionary that complies to CF conventions in the variable names. For preprocessing of the (dynamic) variables, we require from the input model at least: * ta
: the amospheric temperature * ps
: surface pressure * ua
, va
: the wind components * orog
: the global models orography * sftlf
: the land seamask of the global model
The data should have, at least, a 6 hourly resolution although the preprocessing workflow is independent from the temporal resolution.
[4]:
from glob import glob
ta_file = "/pool/data/CMIP6/data/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/6hrLev/ta/gn/v20190710/ta_6hrLev_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_197901010600-198001010000.nc"
ps_file = "/pool/data/CMIP6/data/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/6hrLev/ps/gn/v20190710/ps_6hrLev_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_197501010600-198001010000.nc"
hus_file = "/pool/data/CMIP6/data/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/6hrLev/hus/gn/v20190710/hus_6hrLev_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_197901010600-198001010000.nc"
ua_file = "/pool/data/CMIP6/data/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/6hrLev/ua/gn/v20190815/ua_6hrLev_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_197901010600-198001010000.nc"
va_file = "/pool/data/CMIP6/data/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/6hrLev/va/gn/v20190815/va_6hrLev_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_197901010600-198001010000.nc"
orog_file = "/pool/data/CMIP6/data/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/fx/orog/gn/v20190710/orog_fx_MPI-ESM1-2-HR_historical_r1i1p1f1_gn.nc"
sftlf_file = "/pool/data/CMIP6/data/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/fx/sftlf/gn/v20190710/sftlf_fx_MPI-ESM1-2-HR_historical_r1i1p1f1_gn.nc"
# tos_file = "/pool/data/CMIP6/data/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/Oday/tos/gn/v20190710/tos_Oday_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_19750101-19791231.nc"
tos_files = glob(
"/pool/data/CMIP6/data/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/Oday/tos/gn/v20190710/*"
)
tos_files.sort()
datasets = {
"ta": ta_file,
"ps": ps_file,
"hus": hus_file,
"ua": ua_file,
"va": va_file,
"orog": orog_file,
"sftlf": sftlf_file,
}
Now, we define a slightly optimized access function that works well at the DKRZ filesystem. Since we have a quite large input dataset, we want to access it lazily.
[5]:
def open_mfdataset(
files,
use_cftime=True,
parallel=True,
data_vars="minimal",
chunks={"time": 1},
coords="minimal",
compat="override",
drop=None,
**kwargs
):
"""optimized function for opening CMIP6 6hrLev 3d datasets
based on https://github.com/pydata/xarray/issues/1385#issuecomment-561920115
"""
def drop_all_coords(ds):
# ds = ds.drop(drop)
return ds.reset_coords(drop=True)
ds = xr.open_mfdataset(
files,
parallel=parallel,
decode_times=False,
combine="by_coords",
preprocess=drop_all_coords,
decode_cf=False,
chunks=chunks,
data_vars=data_vars,
coords="minimal",
compat="override",
use_cftime=use_cftime,
**kwargs
)
return xr.decode_cf(ds, use_cftime=use_cftime)
The dataset for the SST is optional. In this example, we use the SST also from MPI-ESM1-2-HR
. The SST for CMIP usually has a daily resolution and also probably a different grid.
[6]:
# ref_ds = open_mfdataset(ta_file)
tos_ds = open_mfdataset(tos_files)
Creating the global dataset
With the input data at hand, we can now create a global dataset that contains all neccessary input data for the dynamic preprocessing. We use the gfile function here that will automatically check the input data and convert units if neccessary. If the SST dataset is given, it will also resample and regrid the SST to the atmospheric grid and a 6 hourly temporal resolution. In this example, we
will only preprocess one month (1979-01
).
[7]:
from pyremo import preproc
gfile = preproc.gfile(
datasets,
tos=tos_ds.tos,
time_range=slice("1979-01-01T06:00:00", "1979-02-01T00:00:00"),
)
using ap_bnds for akgm
using b_bnds for bkgm
converting sftlf units to fractional
converting tos units to K
converting orography to geopotential
The gfile
dataset is comprable to the gfiles that were explictly created in the old preprocessing workflow. But here we don’t have to store this dataset on disk but simply create it directly from the CMIP input lazily. However, we could also still read in the gfile dataset from older gfiles on the filesystem that were created in a different way. The preprocessing workflow does not depend on where the data comes from, it only depends on the gfile dataset having the right conventions
concerning units and variable names. In addition, the gfile function also adds vertical coordinates and static variables, e.g., orog
and sftlf
to the dataset. Let’s have a look:
[8]:
gfile
[8]:
<xarray.Dataset> Dimensions: (lat: 192, lev_input: 95, time: 124, lon: 384, lev_vertices: 96) Coordinates: * lat (lat) float64 -89.28 -88.36 -87.42 -86.49 ... 87.42 88.36 89.28 * lev_input (lev_input) float64 9.816e-06 2.31e-05 ... 0.9826 0.9961 * time (time) object 1979-01-01 06:00:00 ... 1979-02-01 00:00:00 * lon (lon) float64 0.0 0.9375 1.875 2.812 ... 356.2 357.2 358.1 359.1 Dimensions without coordinates: lev_vertices Data variables: ta (time, lev_input, lat, lon) float32 dask.array<chunksize=(1, 95, 192, 384), meta=np.ndarray> ps (time, lat, lon) float32 dask.array<chunksize=(1, 192, 384), meta=np.ndarray> hus (time, lev_input, lat, lon) float32 dask.array<chunksize=(1, 95, 192, 384), meta=np.ndarray> ua (time, lev_input, lat, lon) float32 dask.array<chunksize=(1, 95, 192, 384), meta=np.ndarray> va (time, lev_input, lat, lon) float32 dask.array<chunksize=(1, 95, 192, 384), meta=np.ndarray> orog (lat, lon) float32 dask.array<chunksize=(192, 384), meta=np.ndarray> sftlf (lat, lon) float32 dask.array<chunksize=(192, 384), meta=np.ndarray> akgm (lev_vertices) float64 dask.array<chunksize=(96,), meta=np.ndarray> bkgm (lev_vertices) float64 dask.array<chunksize=(96,), meta=np.ndarray> tos (time, lat, lon) float32 dask.array<chunksize=(124, 192, 384), meta=np.ndarray> Attributes: (12/47) Conventions: CF-1.7 CMIP-6.2 activity_id: CMIP branch_method: standard branch_time_in_child: 0.0 branch_time_in_parent: 0.0 contact: cmip6-mpi-esm@dkrz.de ... ... title: MPI-ESM1-2-HR output prepared for CMIP6 variable_id: ta variant_label: r1i1p1f1 license: CMIP6 model data produced by MPI-M is licensed un... cmor_version: 3.5.0 tracking_id: hdl:21.14100/e7cf2db1-51d1-4692-a1ff-60e2d204b533
Target grid information
Now, we need to collect some information about the target grid. Here, we choose the EUR-11
grid directly from pyremo:
[9]:
domain_info = pr.domain_info("EUR-11")
domain_info
[9]:
{'short_name': 'EUR-11',
'region': 4,
'long_name': 'Remo Europe high-res.',
'nlon': 433,
'nlat': 433,
'll_lon': -28.925,
'll_lat': -23.925,
'dlon': 0.11,
'dlat': 0.11,
'pollon': -162.0,
'pollat': 39.25,
'ur_lon': nan,
'ur_lat': nan}
The preprocessing also needs the orography (FIB
) and land sea mask (BLA
) for the REMO target grid. We take those from the surface library that can be accessed from a surface library dataset.
[10]:
surflib = pr.data.surflib("EUR-11", crop=False)
surflib
[10]:
<xarray.Dataset> Dimensions: (rlon: 435, rlat: 435) Coordinates: * rlon (rlon) float64 -29.04 -28.93 -28.82 ... 18.48 18.59 18.71 * rlat (rlat) float64 -24.04 -23.93 -23.82 ... 23.48 23.59 23.71 Data variables: (12/14) rotated_pole int32 1 FIB (rlat, rlon) float32 ... BLA (rlat, rlon) float32 ... AZ0 (rlat, rlon) float32 ... ALB (rlat, rlon) float32 ... VGRAT (rlat, rlon) float32 ... ... ... FOREST (rlat, rlon) float32 ... FAO (rlat, rlon) float32 ... WSMX (rlat, rlon) float32 ... BETA (rlat, rlon) float32 ... WMINLOK (rlat, rlon) float32 ... WMAXLOK (rlat, rlon) float32 ... Attributes: CDI: Climate Data Interface version 1.9.8 (https://mpimet.mpg.de... Conventions: CF-1.6 history: Wed Jul 01 12:03:07 2020: cdo -f nc copy lib_EUR-11_frac li... CDO: Climate Data Operators version 1.9.8 (https://mpimet.mpg.de...
For the vertical interpolation, we have to give the vertical coordinates table containing the hybrid sigma coefficients for the target grid (ak
and bk
).
[11]:
vc = pr.vc.tables["vc_27lev"]
vc
[11]:
ak | bk | |
---|---|---|
0 | 0.000000 | 0.000000 |
1 | 5000.000000 | 0.000000 |
2 | 10000.000000 | 0.000000 |
3 | 13600.000000 | 0.000000 |
4 | 14736.355469 | 0.020319 |
5 | 15689.207031 | 0.036975 |
6 | 16266.609375 | 0.059488 |
7 | 16465.003906 | 0.087895 |
8 | 16297.621094 | 0.122004 |
9 | 15791.597656 | 0.161441 |
10 | 14985.269531 | 0.205703 |
11 | 13925.519531 | 0.254189 |
12 | 12665.292969 | 0.306235 |
13 | 11261.230469 | 0.361145 |
14 | 9771.406250 | 0.418202 |
15 | 8253.210938 | 0.476688 |
16 | 6761.339844 | 0.535887 |
17 | 5345.914062 | 0.595084 |
18 | 4050.717773 | 0.653565 |
19 | 2911.569336 | 0.710594 |
20 | 1954.805176 | 0.765405 |
21 | 1195.889893 | 0.817167 |
22 | 638.148926 | 0.864956 |
23 | 271.626465 | 0.907716 |
24 | 72.063583 | 0.944213 |
25 | 0.000000 | 0.972985 |
26 | 0.000000 | 0.992282 |
27 | 0.000000 | 1.000000 |
Creating the forcing dataset
With the gfile
dataset and the target grid information, we can now create the forcing dataset. We can use the remap function here that basically assembles the same workflow as the former Fortran source code. The function basically does all the horizontal and vertical interpolations as well as pressure corrections including height corrections etc… Please note, that the efficiency of the
computation might depend on how we access the input data and, of course, on the length of the time axis in the gfile dataset. For a long time axis, the computation should be done lazily!
[12]:
ads = pr.preproc.remap(gfile, domain_info, vc, surflib)
[13]:
ads
[13]:
<xarray.Dataset> Dimensions: (time: 124, rlon: 433, rlat: 433, lev: 27, lev_i: 28) Coordinates: * time (time) object 1979-01-01 06:00:00 ... 1979-02... * rlon (rlon) float64 -28.93 -28.82 ... 18.48 18.59 * rlat (rlat) float64 -23.93 -23.82 ... 23.48 23.59 * lev (lev) int64 1 2 3 4 5 6 7 ... 22 23 24 25 26 27 * lev_i (lev_i) int64 1 2 3 4 5 6 ... 23 24 25 26 27 28 lon (rlat, rlon) float64 -10.32 -10.23 ... 68.65 lat (rlat, rlon) float64 21.28 21.32 ... 67.86 67.8 Data variables: (12/16) T (time, lev, rlat, rlon) float32 dask.array<chunksize=(1, 27, 433, 433), meta=np.ndarray> U (time, lev, rlat, rlon) float32 dask.array<chunksize=(1, 27, 433, 433), meta=np.ndarray> V (time, lev, rlat, rlon) float32 dask.array<chunksize=(1, 27, 433, 433), meta=np.ndarray> PS (time, rlat, rlon) float32 dask.array<chunksize=(1, 433, 433), meta=np.ndarray> RF (time, lev, rlat, rlon) float32 dask.array<chunksize=(1, 27, 433, 433), meta=np.ndarray> TSW (time, rlat, rlon) float32 dask.array<chunksize=(124, 433, 433), meta=np.ndarray> ... ... TSI (time, rlat, rlon) float32 dask.array<chunksize=(124, 433, 433), meta=np.ndarray> hyai (lev_i) float64 0.0 5e+03 1e+04 ... 0.0 0.0 0.0 hybi (lev_i) float64 0.0 0.0 0.0 ... 0.973 0.9923 1.0 hyam (lev) float64 2.5e+03 7.5e+03 ... 0.0 0.0 hybm (lev) float64 0.0 0.0 0.0 ... 0.9826 0.9961 rotated_latitude_longitude int32 0 Attributes: (12/47) Conventions: CF-1.7 CMIP-6.2 activity_id: CMIP branch_method: standard branch_time_in_child: 0.0 branch_time_in_parent: 0.0 contact: cmip6-mpi-esm@dkrz.de ... ... title: MPI-ESM1-2-HR output prepared for CMIP6 variable_id: ta variant_label: r1i1p1f1 license: CMIP6 model data produced by MPI-M is licensed un... cmor_version: 3.5.0 tracking_id: hdl:21.14100/e7cf2db1-51d1-4692-a1ff-60e2d204b533
The forcing dataset seems to look fine! Please note, that also all global attributes from the input gfile dataset are copied to the forcing dataset. To check, we will explicitly look at the first timestep.
[17]:
%time ads_ = ads.isel(time=0).compute()
# progress(ads_)
CPU times: user 1.38 s, sys: 262 ms, total: 1.64 s
Wall time: 16.3 s
[18]:
ads_.PS.plot()
[18]:
<matplotlib.collections.QuadMesh at 0x7ffecdce0280>
[19]:
ads_.TSW.plot()
[19]:
<matplotlib.collections.QuadMesh at 0x7ffe70289040>
Note, that the whole ads
forcing dataset has not been computed explicitly yet but only lazily. If we would trigger the whole computation, we might get in trouble to hold it in the memory. However, since we want a single file for each timestep on the disk anyway, we can start the computation only at this point, when we want to write it to disk. To organize the output, pyremo provides the to_netcdf function
that simply creates one NetCDF file per timestep and uses a REMO compliant naming convention. To start the computation and file writing, you can simply call
[20]:
%time output = preproc.to_netcdf(ads, path='/scratch/g/g300046/xa')
CPU times: user 3min 14s, sys: 20.1 s, total: 3min 34s
Wall time: 36min 21s