{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Preprocessing of CMIP models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook should demonstrate how to use `pyremo` and the underlying [pyintorg](https://git.gerics.de/python/pyintorg) package to preprocess global model data from CMIP models that comply to CF conventions. The workflow is basically the same as before but is reimplmented in python for better integration into the workflow and for more flexibility of in and ouput data. This notebook basically show how to use the [remap](https://pyremo.readthedocs.io/en/latest/generated/pyremo.preproc.remap.html#pyremo.preproc.remap) API that understands CF conventions. The API takes and returns xarray datasets which has several advantages. For example, it will automatically vectorize computations along levels and time axes so we don't have to care about different calendars any more. Additionally, we can conserve important meta data from the input model and keep it in the forcing data, and consequently, in the REMO model ouput. Another critical advantage is the lazy computation which will allow dask to easiy parallelize the computation along the time axis and levels." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Accessing CMIP input data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this notebook, we take CMIP6 `MPI-ESM1-2-HR` model data directly from the DKRZ filesystem. However, the input is quite flexible and could come also from any other source the xarray accepts as input. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import dask\n", "import xarray as xr\n", "from dask.distributed import Client\n", "\n", "import pyremo as pr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will create a dask client here to make efficient use of the DKRZ node resources. In this example, we run the noteook on a `shared` node at DKRZ." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-03-28 22:03:53,287 - distributed.scheduler - WARNING - Failed to format dashboard link, unknown value: 'JUPYTERHUB_SERVICE_PREFIX'\n" ] } ], "source": [ "client = Client(dashboard_address=\"localhost:8787\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from glob import glob\n", "\n", "ta_file = \"/pool/data/CMIP6/data/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/6hrLev/ta/gn/v20190710/ta_6hrLev_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_197901010600-198001010000.nc\"\n", "ps_file = \"/pool/data/CMIP6/data/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/6hrLev/ps/gn/v20190710/ps_6hrLev_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_197501010600-198001010000.nc\"\n", "hus_file = \"/pool/data/CMIP6/data/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/6hrLev/hus/gn/v20190710/hus_6hrLev_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_197901010600-198001010000.nc\"\n", "ua_file = \"/pool/data/CMIP6/data/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/6hrLev/ua/gn/v20190815/ua_6hrLev_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_197901010600-198001010000.nc\"\n", "va_file = \"/pool/data/CMIP6/data/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/6hrLev/va/gn/v20190815/va_6hrLev_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_197901010600-198001010000.nc\"\n", "orog_file = \"/pool/data/CMIP6/data/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/fx/orog/gn/v20190710/orog_fx_MPI-ESM1-2-HR_historical_r1i1p1f1_gn.nc\"\n", "sftlf_file = \"/pool/data/CMIP6/data/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/fx/sftlf/gn/v20190710/sftlf_fx_MPI-ESM1-2-HR_historical_r1i1p1f1_gn.nc\"\n", "# tos_file = \"/pool/data/CMIP6/data/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/Oday/tos/gn/v20190710/tos_Oday_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_19750101-19791231.nc\"\n", "tos_files = glob(\n", " \"/pool/data/CMIP6/data/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/Oday/tos/gn/v20190710/*\"\n", ")\n", "tos_files.sort()\n", "\n", "datasets = {\n", " \"ta\": ta_file,\n", " \"ps\": ps_file,\n", " \"hus\": hus_file,\n", " \"ua\": ua_file,\n", " \"va\": va_file,\n", " \"orog\": orog_file,\n", " \"sftlf\": sftlf_file,\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we define a slightly optimized access function that works well at the DKRZ filesystem. Since we have a quite large input dataset, we want to access it lazily." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def open_mfdataset(\n", " files,\n", " use_cftime=True,\n", " parallel=True,\n", " data_vars=\"minimal\",\n", " chunks={\"time\": 1},\n", " coords=\"minimal\",\n", " compat=\"override\",\n", " drop=None,\n", " **kwargs,\n", "):\n", " \"\"\"optimized function for opening CMIP6 6hrLev 3d datasets\n", "\n", " based on https://github.com/pydata/xarray/issues/1385#issuecomment-561920115\n", "\n", " \"\"\"\n", "\n", " def drop_all_coords(ds):\n", " # ds = ds.drop(drop)\n", " return ds.reset_coords(drop=True)\n", "\n", " ds = xr.open_mfdataset(\n", " files,\n", " parallel=parallel,\n", " decode_times=False,\n", " combine=\"by_coords\",\n", " preprocess=drop_all_coords,\n", " decode_cf=False,\n", " chunks=chunks,\n", " data_vars=data_vars,\n", " coords=\"minimal\",\n", " compat=\"override\",\n", " use_cftime=use_cftime,\n", " **kwargs,\n", " )\n", " return xr.decode_cf(ds, use_cftime=use_cftime)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset for the SST is optional. In this example, we use the SST also from `MPI-ESM1-2-HR`. The SST for CMIP usually has a daily resolution and also probably a different grid." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# ref_ds = open_mfdataset(ta_file)\n", "tos_ds = open_mfdataset(tos_files)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating the global dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the input data at hand, we can now create a global dataset that contains all neccessary input data for the dynamic preprocessing. We use the [gfile](https://pyremo.readthedocs.io/en/latest/generated/pyremo.preproc.gfile.html#pyremo.preproc.gfile) function here that will automatically check the input data and convert units if neccessary. If the SST dataset is given, it will also resample and regrid the SST to the atmospheric grid and a 6 hourly temporal resolution. In this example, we will only preprocess one month (`1979-01`)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "inverting vertical axis\n", "inverting vertical axis\n", "inverting vertical axis\n", "inverting vertical axis\n", "using ap_bnds for akgm\n", "using b_bnds for bkgm\n", "inverting vertical coordinates\n", "adding sst...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/work/ch0636/g300046/conda_envs/pyremo/lib/python3.10/site-packages/esmpy/interface/loadESMF.py:94: VersionWarning: ESMF installation version 8.8.0, ESMPy version 8.8.0b0\n", " warnings.warn(\"ESMF installation version {}, ESMPy version {}\".format(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "converting sftlf units to fractional\n", "converting tos units to K\n", "converting orography to geopotential\n" ] } ], "source": [ "from pyremo import preproc\n", "\n", "gfile = preproc.get_gcm_dataset(\n", " datasets,\n", " tos=tos_ds.tos,\n", " time_range=slice(\n", " \"1979-01-01T06:00:00\", \"1979-02-01T00:00:00\"\n", " ), # choose a common time range from all files\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `gfile` dataset is comprable to the gfiles that were explictly created in the old preprocessing workflow. But here we don't have to store this dataset on disk but simply create it directly from the CMIP input lazily. However, we could also still read in the gfile dataset from older gfiles on the filesystem that were created in a different way. The preprocessing workflow does not depend on where the data comes from, it only depends on the gfile dataset having the right conventions concerning units and variable names. In addition, the gfile function also adds vertical coordinates and static variables, e.g., `orog` and `sftlf` to the dataset. Let's have a look:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
<xarray.Dataset> Size: 14GB\n",
"Dimensions: (lev: 95, lat: 192, lon: 384, time: 124, lev_vertices: 96)\n",
"Coordinates:\n",
" * lev (lev) float64 760B 9.816e-06 2.31e-05 3.079e-05 ... 0.9826 0.9961\n",
" * lat (lat) float64 2kB -89.28 -88.36 -87.42 -86.49 ... 87.42 88.36 89.28\n",
" * lon (lon) float64 3kB 0.0 0.9375 1.875 2.812 ... 357.2 358.1 359.1\n",
" * time (time) object 992B 1979-01-01 06:00:00 ... 1979-02-01 00:00:00\n",
"Dimensions without coordinates: lev_vertices\n",
"Data variables:\n",
" ta (time, lev, lat, lon) float32 3GB dask.array<chunksize=(1, 47, 96, 192), meta=np.ndarray>\n",
" ps (time, lat, lon) float32 37MB dask.array<chunksize=(1, 192, 384), meta=np.ndarray>\n",
" hus (time, lev, lat, lon) float32 3GB dask.array<chunksize=(1, 47, 96, 192), meta=np.ndarray>\n",
" ua (time, lev, lat, lon) float32 3GB dask.array<chunksize=(1, 47, 96, 192), meta=np.ndarray>\n",
" va (time, lev, lat, lon) float32 3GB dask.array<chunksize=(1, 47, 96, 192), meta=np.ndarray>\n",
" orog (lat, lon) float32 295kB dask.array<chunksize=(192, 384), meta=np.ndarray>\n",
" sftlf (lat, lon) float32 295kB 1.0 1.0 1.0 1.0 1.0 ... 0.0 0.0 0.0 0.0\n",
" akgm (lev_vertices) float64 768B dask.array<chunksize=(96,), meta=np.ndarray>\n",
" bkgm (lev_vertices) float64 768B dask.array<chunksize=(96,), meta=np.ndarray>\n",
" mask (lat, lon) bool 74kB dask.array<chunksize=(192, 384), meta=np.ndarray>\n",
" tos (time, lat, lon) float32 37MB dask.array<chunksize=(124, 192, 384), meta=np.ndarray>\n",
"Attributes: (12/47)\n",
" Conventions: CF-1.7 CMIP-6.2\n",
" activity_id: CMIP\n",
" branch_method: standard\n",
" branch_time_in_child: 0.0\n",
" branch_time_in_parent: 0.0\n",
" contact: cmip6-mpi-esm@dkrz.de\n",
" ... ...\n",
" title: MPI-ESM1-2-HR output prepared for CMIP6\n",
" variable_id: ta\n",
" variant_label: r1i1p1f1\n",
" license: CMIP6 model data produced by MPI-M is licensed un...\n",
" cmor_version: 3.5.0\n",
" tracking_id: hdl:21.14100/e7cf2db1-51d1-4692-a1ff-60e2d204b533<xarray.Dataset>\n",
"Dimensions: (rlon: 435, rlat: 435)\n",
"Coordinates:\n",
" * rlon (rlon) float64 -29.04 -28.93 -28.82 ... 18.48 18.59 18.71\n",
" * rlat (rlat) float64 -24.04 -23.93 -23.82 ... 23.48 23.59 23.71\n",
"Data variables: (12/14)\n",
" rotated_pole int32 1\n",
" FIB (rlat, rlon) float32 295.7 286.6 278.2 ... 16.8 31.95 44.6\n",
" BLA (rlat, rlon) float32 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 0.9855\n",
" AZ0 (rlat, rlon) float32 0.005009 0.005004 ... 0.03324 0.03231\n",
" ALB (rlat, rlon) float32 0.4709 0.475 0.4726 ... 0.1388 0.1438\n",
" VGRAT (rlat, rlon) float32 0.0 0.0 0.0 0.0 ... 0.3305 0.3181 0.3064\n",
" ... ...\n",
" FOREST (rlat, rlon) float32 0.0 0.0 0.0 0.0 ... 0.02978 0.0625 0.045\n",
" FAO (rlat, rlon) float32 2.0 2.0 2.0 2.0 2.0 ... 6.0 4.0 4.0 4.0\n",
" WSMX (rlat, rlon) float32 0.1 0.1 0.1 0.1 ... 0.252 0.2313 0.2235\n",
" BETA (rlat, rlon) float32 0.01 0.01 0.01 ... 0.7407 0.4051 0.6344\n",
" WMINLOK (rlat, rlon) float32 0.1 0.1 0.1 0.1 ... 0.1613 0.1641 0.1613\n",
" WMAXLOK (rlat, rlon) float32 0.1 0.1 0.1 0.1 ... 0.5375 0.2623 0.2604\n",
"Attributes:\n",
" CDI: Climate Data Interface version 1.9.8 (https://mpimet.mpg.de...\n",
" Conventions: CF-1.6\n",
" history: Wed Jul 01 12:03:07 2020: cdo -f nc copy lib_EUR-11_frac li...\n",
" CDO: Climate Data Operators version 1.9.8 (https://mpimet.mpg.de...| \n", " | ak | \n", "bk | \n", "
|---|---|---|
| 0 | \n", "0.000000 | \n", "0.000000 | \n", "
| 1 | \n", "5000.000000 | \n", "0.000000 | \n", "
| 2 | \n", "10000.000000 | \n", "0.000000 | \n", "
| 3 | \n", "13600.000000 | \n", "0.000000 | \n", "
| 4 | \n", "14736.355469 | \n", "0.020319 | \n", "
| 5 | \n", "15689.207031 | \n", "0.036975 | \n", "
| 6 | \n", "16266.609375 | \n", "0.059488 | \n", "
| 7 | \n", "16465.003906 | \n", "0.087895 | \n", "
| 8 | \n", "16297.621094 | \n", "0.122004 | \n", "
| 9 | \n", "15791.597656 | \n", "0.161441 | \n", "
| 10 | \n", "14985.269531 | \n", "0.205703 | \n", "
| 11 | \n", "13925.519531 | \n", "0.254189 | \n", "
| 12 | \n", "12665.292969 | \n", "0.306235 | \n", "
| 13 | \n", "11261.230469 | \n", "0.361145 | \n", "
| 14 | \n", "9771.406250 | \n", "0.418202 | \n", "
| 15 | \n", "8253.210938 | \n", "0.476688 | \n", "
| 16 | \n", "6761.339844 | \n", "0.535887 | \n", "
| 17 | \n", "5345.914062 | \n", "0.595084 | \n", "
| 18 | \n", "4050.717773 | \n", "0.653565 | \n", "
| 19 | \n", "2911.569336 | \n", "0.710594 | \n", "
| 20 | \n", "1954.805176 | \n", "0.765405 | \n", "
| 21 | \n", "1195.889893 | \n", "0.817167 | \n", "
| 22 | \n", "638.148926 | \n", "0.864956 | \n", "
| 23 | \n", "271.626465 | \n", "0.907716 | \n", "
| 24 | \n", "72.063583 | \n", "0.944213 | \n", "
| 25 | \n", "0.000000 | \n", "0.972985 | \n", "
| 26 | \n", "0.000000 | \n", "0.992282 | \n", "
| 27 | \n", "0.000000 | \n", "1.000000 | \n", "
<xarray.Dataset>\n",
"Dimensions: (time: 124, rlon: 433, rlat: 433, lev: 27,\n",
" lev_i: 28)\n",
"Coordinates:\n",
" * time (time) object 1979-01-01 06:00:00 ... 1979-02...\n",
" * rlon (rlon) float64 -28.93 -28.82 ... 18.48 18.59\n",
" * rlat (rlat) float64 -23.93 -23.82 ... 23.48 23.59\n",
" * lev (lev) int64 1 2 3 4 5 6 7 ... 22 23 24 25 26 27\n",
" * lev_i (lev_i) int64 1 2 3 4 5 6 ... 23 24 25 26 27 28\n",
" lon (rlat, rlon) float64 -10.32 -10.23 ... 68.65\n",
" lat (rlat, rlon) float64 21.28 21.32 ... 67.86 67.8\n",
"Data variables: (12/16)\n",
" T (time, lev, rlat, rlon) float32 dask.array<chunksize=(1, 27, 433, 433), meta=np.ndarray>\n",
" U (time, lev, rlat, rlon) float32 dask.array<chunksize=(1, 27, 433, 433), meta=np.ndarray>\n",
" V (time, lev, rlat, rlon) float32 dask.array<chunksize=(1, 27, 433, 433), meta=np.ndarray>\n",
" PS (time, rlat, rlon) float32 dask.array<chunksize=(1, 433, 433), meta=np.ndarray>\n",
" RF (time, lev, rlat, rlon) float32 dask.array<chunksize=(1, 27, 433, 433), meta=np.ndarray>\n",
" TSW (time, rlat, rlon) float32 dask.array<chunksize=(124, 433, 433), meta=np.ndarray>\n",
" ... ...\n",
" TSI (time, rlat, rlon) float32 dask.array<chunksize=(124, 433, 433), meta=np.ndarray>\n",
" hyai (lev_i) float64 0.0 5e+03 1e+04 ... 0.0 0.0 0.0\n",
" hybi (lev_i) float64 0.0 0.0 0.0 ... 0.973 0.9923 1.0\n",
" hyam (lev) float64 2.5e+03 7.5e+03 ... 0.0 0.0\n",
" hybm (lev) float64 0.0 0.0 0.0 ... 0.9826 0.9961\n",
" rotated_latitude_longitude int32 0\n",
"Attributes: (12/48)\n",
" Conventions: CF-1.7 CMIP-6.2\n",
" activity_id: CMIP\n",
" branch_method: standard\n",
" branch_time_in_child: 0.0\n",
" branch_time_in_parent: 0.0\n",
" contact: cmip6-mpi-esm@dkrz.de\n",
" ... ...\n",
" variable_id: ta\n",
" variant_label: r1i1p1f1\n",
" license: CMIP6 model data produced by MPI-M is licensed un...\n",
" cmor_version: 3.5.0\n",
" tracking_id: hdl:21.14100/e7cf2db1-51d1-4692-a1ff-60e2d204b533\n",
" CORDEX_domain: EUR-11