# netCDF Datacube Pipeline

The `netcdf_datacube` pipeline extracts per-pixel VI time series as CF-1.8 compliant
netCDF datacubes clipped to polygon regions. Unlike the phenology pipeline, which
spatially aggregates pixels into a single ROI-mean time series per region, this pipeline
**preserves the full spatial dimension** — every pixel within the polygon boundary is
retained for downstream scientific analysis.

---

## When to Use This Pipeline

Use the datacube pipeline when you need:
- Per-pixel VI values for input into spatial models (e.g., biodiversity, habitat, carbon models)
- CF-1.8 compliant datacubes for interoperability with other geospatial tools
- A clipped spatial subset of the source HLS VI tiles in a standardized format

Use the [phenology pipeline](cli_reference.md) when you need aggregated time series,
smoothed curves, phenological metrics, or plots.

---

## Selecting the Pipeline

Set `PIPELINE` in `config.local.env`:

```bash
PIPELINE="netcdf_datacube"   # CF-1.8 datacubes (this pipeline)
PIPELINE="phenology"         # ROI-mean time series, metrics, plots
```

All input variables (`NETCDF_DIR`, `VI`, `SHAPEFILE`, `SHAPEFILE_FIELD`,
`VALID_RANGE_*`, `WORKERS`, `START_DATE`, `END_DATE`) are shared between both pipelines
and are set in `config.local.env` (see [Overview — Setup](overview.md#setup)).

---

## Processing Model

Extraction runs in three stages:

```
Pre-filter (main process, sequential)
  For each tile in --netcdf-dir:
    Read x/y coordinate min/max only (4 scalar values, no data decompression)
    Reproject ROI bounding box to tile CRS → bounding-box intersection test
    Exclude non-overlapping tiles before any worker is spawned

Phase 1 — Parallel tile extraction
  For each bbox-passing tile:
    Open NetCDF → clip to polygon boundary → apply valid-range mask → write temp file

Phase 2 — Merge and write (main process)
  Detect CRS for each temp file
  Group tiles by CRS
  Apply merge strategy → write final datacube(s)

Cleanup — temp files deleted (always, even on error)
```

The pre-filter is a conservative bounding-box test — tiles that intersect the bbox but
not the exact polygon geometry are still passed to workers, which handle them via the
normal `NoDataInBounds` path. For typical LVIS flight boxes that overlap only 1–2 of
the 15 tiles in the NetCDF directory, the pre-filter reduces worker dispatch from 15
tiles to 1–3, significantly reducing Phase 1 I/O contention on the source drive.

Temp files are stored in `{output_dir}/{region_label}/_tmp/` and are always removed
after Phase 2 completes, whether it succeeded or failed.

**Temp file compression:** Temp files are written **without zlib compression**. They are
written once by Phase 1 workers, read once by Phase 2, then deleted — compression would
add substantial CPU overhead (especially with `dask scheduler='synchronous'` on an
external drive) for no lasting benefit. Final output datacubes use `zlib complevel=4`.

**Temp disk space:** Temp files are uncompressed float32, so peak additional disk usage
during a region's processing is approximately:
`n_overlapping_tiles × clipped_y × clipped_x × n_time_steps × 4 bytes`
For typical LVIS flight boxes (1–3 overlapping tiles, clips much smaller than full tiles)
this is usually 2–15 GB. Ensure `--output-dir` has sufficient free space.

---

## Multi-Tile and Multi-CRS Handling

When a polygon spans multiple HLS MGRS tiles, the pipeline automatically detects the
number of CRS zones involved and selects the appropriate merge strategy.

### Same-CRS Tiles (polygon within one UTM zone)

Adjacent HLS MGRS tiles in the same UTM zone share an **identical 30-m pixel grid**.
No resampling is needed — tiles are mosaiced via a memory-bounded direct write loop
(one time step at a time, bounded to ~200 MB peak regardless of datacube size):

- **Time union**: the output datacube covers all acquisition dates from all tiles;
  pixels from tiles that have no data on a given date are NaN
- **No resampling** — pixel values are unmodified
- The ~163-pixel MGRS overlap zone is filled by the last tile written (scientifically
  equivalent to first-wins for co-acquired HLS pixels in the overlap zone)

### Cross-CRS Tiles (polygon spans a UTM zone boundary)

When a polygon crosses a UTM zone boundary, tiles are in different coordinate systems.
The dominant CRS (the CRS group with the most total pixels within the polygon) is
selected as the output CRS. Minority tiles are reprojected to the dominant CRS using
**bilinear resampling**, then mosaiced via the same memory-bounded write loop.

Bilinear reprojection between adjacent UTM zones introduces sub-pixel mixing comparable
to the sensor point spread function — scientifically acceptable for VI analysis at 30 m.

The output file's global attributes document the merge:

```
target_crs:        EPSG:32634
resampling_method: bilinear
```

### Merge Options

| `config.local.env` variable | CLI flag | Default | Effect |
|---|---|---|---|
| `MERGE_SAME_CRS=true` | *(default)* | on | Merge same-CRS tiles into one datacube per region |
| `MERGE_SAME_CRS=false` | `--no-merge-same-crs` | — | One file per tile, native CRS |
| `MERGE_CROSS_CRS=true` | *(default)* | on | Reproject + merge cross-CRS tiles |
| `MERGE_CROSS_CRS=false` | `--no-merge-cross-crs` | — | One file per tile, native CRS, no reprojection |

Setting both to `false` always produces one netCDF per tile per region.

---

## Output

### File Naming

| Condition | Output filename | Location |
|---|---|---|
| Single tile, or merged output | `{VI}_{region_label}_datacube.nc` | `{output_dir}/{shapefile_stem}/{region_label}/` |
| Per-tile output (merge disabled) | `{VI}_{region_label}_{tile_id}_datacube.nc` | `{output_dir}/{shapefile_stem}/{region_label}/` |
| No shapefile (full extent) | `{VI}_full_extent_datacube.nc` | `{output_dir}/full_extent/` |
| Log file | `netcdf_datacube_{YYYYMMDD_HHMMSS}.log` | `{output_dir}/` |

### File Structure Example

```
outputs/                                              ← --output-dir
├── netcdf_datacube_20260312_153100.log               ← log file at output-dir root
└── Parks_and_OpenSpace/                              ← shapefile stem
    ├── Mesa_Verde/                                   ← one subfolder per region
    │   ├── NDVI_Mesa_Verde_datacube.nc               ← merged (2 same-CRS tiles)
    │   └── EVI2_Mesa_Verde_datacube.nc
    └── Ridgway_State_Park/
        ├── NDVI_Ridgway_State_Park_datacube.nc
        └── EVI2_Ridgway_State_Park_datacube.nc
```

When `--no-merge-same-crs` is set:

```
outputs/
└── Parks_and_OpenSpace/
    └── Mesa_Verde/
        ├── NDVI_Mesa_Verde_T13SDA_datacube.nc
        ├── NDVI_Mesa_Verde_T13SDB_datacube.nc
        └── EVI2_Mesa_Verde_T13SDA_datacube.nc
```

### Output NetCDF Structure

Each output file is a CF-1.8 compliant netCDF4 with:

**Dimensions:**
- `time` — acquisition dates (datetime64, decoded from "days since 1970-01-01")
- `y` — northing in meters (UTM)
- `x` — easting in meters (UTM)

**Variables:**
- `{VI}` — float32, dimensions `(time, y, x)`, NaN where no valid data; zlib-compressed (level 4)
- `spatial_ref` — scalar, CRS container variable (CF-1.8 grid mapping convention)

**Global attributes:**

| Attribute | Always present | Description |
|---|---|---|
| `Conventions` | yes | `'CF-1.8'` |
| `history` | yes | Creation timestamp, region name, valid_range |
| `tiles` | yes | Source tile IDs (comma-separated) |
| `region` | yes | Region label |
| `vi` | yes | VI name (e.g. `'NDVI'`) |
| `target_crs` | cross-CRS merge only | Target CRS (e.g. `'EPSG:32634'`) |
| `resampling_method` | cross-CRS merge only | `'bilinear'` |

**Variable attributes:**
- `long_name`: e.g. `'NDVI vegetation index'`
- `valid_min` / `valid_max`: applied valid range values
- `grid_mapping`: `'spatial_ref'`

---

## CLI Reference

```
python src/netcdf_datacube_extract.py --help
```

| Argument | Default | Description |
|---|---|---|
| `--netcdf-dir PATH` | *(required)* | Directory containing `T{TILE}_{VI}.nc` files |
| `--vi VI [VI ...]` | `NDVI` | Vegetation indices: `NDVI` `EVI2` `NIRv` |
| `--shapefile PATH [PATH ...]` | — | Polygon shapefile(s). Omit for full extent. |
| `--shapefile-field FIELD [FIELD ...]` | — | Attribute field(s) to split by. Count must match `--shapefile`. |
| `--valid-range-ndvi MIN,MAX` | `-1,1` | Valid range for NDVI |
| `--valid-range-evi2 MIN,MAX` | `-1,2` | Valid range for EVI2 |
| `--valid-range-nirv MIN,MAX` | `-0.5,1` | Valid range for NIRv |
| `--output-dir PATH` | *(required)* | Root output directory |
| `--workers N` | `8` | Parallel worker processes for Phase 1 tile extraction |
| `--start-date YYYY-MM-DD` | — | Include only observations on or after this date |
| `--end-date YYYY-MM-DD` | — | Include only observations on or before this date |
| `--no-merge-same-crs` | off | Disable same-CRS tile merging; write per-tile files |
| `--no-merge-cross-crs` | off | Disable cross-CRS reprojection + merge; write per-tile files |
| `--log-level LEVEL` | `INFO` | `DEBUG` `INFO` `WARNING` `ERROR` |

---

## Notes on CRS Detection

CRS comparison uses `pyproj.CRS.from_wkt(wkt).to_epsg(min_confidence=20)` rather than
raw WKT string comparison. EPSG integer comparison is more robust — two WKT strings
that represent the same CRS may differ in formatting or authority prefix.

**HLS 2.0 CRS quirks:** Two separate issues affect HLS v2.0 source data.

*Non-standard datum name:* HLS v2.0 GeoTIFFs (particularly Sentinel-2 S30 granules)
embed `"Not specified (based on WGS 84 spheroid)"` rather than the official `"World
Geodetic System 1984"` as the datum name. pyproj's default `to_epsg()` (min_confidence
70) rejects these WKTs even though the CRS is functionally identical to EPSG:326xx or
EPSG:327xx. Using `min_confidence=20` reliably resolves them to the correct EPSG integer
so that same-UTM-zone tiles are always grouped together. When EPSG resolution fails even
at `min_confidence=20`, the fallback is `crs_obj.name` (e.g., `"UTM Zone 34, Southern
Hemisphere"`) rather than the raw WKT string — this prevents GDAL version differences
from producing non-equal string keys and causing false cross-CRS grouping.

*Southern hemisphere UTM convention:* HLS v2.0 GeoTIFFs for tiles south of the equator
store coordinates using a UTM North zone (EPSG:326xx, false_northing=0) with negative
northings, instead of the standard UTM South convention (EPSG:327xx,
false_northing=10,000,000). This is corrected in `03_hls_netcdf_build.py` (HLS_VI_Pipeline
step 03): when a northern UTM EPSG code is detected and the pixel y-centroid is negative,
the CRS is replaced with the UTM South equivalent (EPSG + 100) and y-coordinates are
shifted by +10,000,000 m before the NetCDF is written. NetCDF files produced by a
corrected pipeline run carry EPSG:327xx CRS with positive northings (6–9 million m range).
If you are using older NetCDF files that still carry the UTM North / negative-northing
convention, all VI_Phenology spatial operations remain internally consistent (tile and ROI
are projected in the same coordinate space), but output datacubes will carry the UTM North
hemisphere label; rebuilding the source NetCDFs with step 03 fully resolves this.

---

## Next Steps

The datacubes produced by this pipeline can be used as input for two other pipelines:

**[Pixel Phenology Pipeline](pixel_phenology.md)** — compute 19 per-pixel phenological
metric maps from these datacubes. Point `PIXEL_INPUT_DATACUBES` at the output directory
(e.g. `${OUTPUT_DIR}/LVIS_flightboxes_final`) to pick up all regions automatically.

**[Phenology Pipeline — Datacube Input Mode](cli_reference.md#input)** — run the phenology
pipeline without re-running tile discovery or clipping. Set `PHENOLOGY_INPUT_DATACUBES` in
`config.local.env` to the same directory. Faster for iterating on smoothing settings,
thresholds, or plot styles.

---

## Notes on Same-Day Multi-Sensor Acquisitions

HLS combines Landsat 8, Landsat 9, Sentinel-2A, and Sentinel-2B. When two or more of
these sensors observe the same MGRS tile on the same calendar day, the source NetCDF
contains duplicate time steps with identical timestamps (the HLS time axis uses integer
day precision — there is no sub-day information to distinguish them).

The datacube pipeline resolves this automatically in Phase 1, before writing the temp
file for each tile. For each group of same-day duplicates, the acquisitions are merged
**pixel by pixel**: the first non-NaN value at each `(y, x)` location wins. This
preserves valid data from all same-day acquisitions across the spatial footprint — no
valid pixel is discarded and no values are averaged.

As a result, the output datacube's time axis may have **fewer time steps** than the
source NetCDF tile — one step per unique calendar date rather than one step per
acquisition. The `tiles` global attribute records the source tile IDs, but the number
of acquisitions per sensor is not separately tracked.