netCDF Datacube Pipeline
The netcdf_datacube pipeline extracts per-pixel VI time series as CF-1.8 compliant
netCDF datacubes clipped to polygon regions. Unlike the phenology pipeline, which
spatially aggregates pixels into a single ROI-mean time series per region, this pipeline
preserves the full spatial dimension — every pixel within the polygon boundary is
retained for downstream scientific analysis.
When to Use This Pipeline
Use the datacube pipeline when you need:
Per-pixel VI values for input into spatial models (e.g., biodiversity, habitat, carbon models)
CF-1.8 compliant datacubes for interoperability with other geospatial tools
A clipped spatial subset of the source HLS VI tiles in a standardized format
Use the phenology pipeline when you need aggregated time series, smoothed curves, phenological metrics, or plots.
Selecting the Pipeline
Set PIPELINE in config.local.env:
PIPELINE="netcdf_datacube" # CF-1.8 datacubes (this pipeline)
PIPELINE="phenology" # ROI-mean time series, metrics, plots
All input variables (NETCDF_DIR, VI, SHAPEFILE, SHAPEFILE_FIELD,
VALID_RANGE_*, WORKERS, START_DATE, END_DATE) are shared between both pipelines
and are set in config.local.env (see Overview — Setup).
Processing Model
Extraction runs in three stages:
Pre-filter (main process, sequential)
For each tile in --netcdf-dir:
Read x/y coordinate min/max only (4 scalar values, no data decompression)
Reproject ROI bounding box to tile CRS → bounding-box intersection test
Exclude non-overlapping tiles before any worker is spawned
Phase 1 — Parallel tile extraction
For each bbox-passing tile:
Open NetCDF → clip to polygon boundary → apply valid-range mask → write temp file
Phase 2 — Merge and write (main process)
Detect CRS for each temp file
Group tiles by CRS
Apply merge strategy → write final datacube(s)
Cleanup — temp files deleted (always, even on error)
The pre-filter is a conservative bounding-box test — tiles that intersect the bbox but
not the exact polygon geometry are still passed to workers, which handle them via the
normal NoDataInBounds path. For typical LVIS flight boxes that overlap only 1–2 of
the 15 tiles in the NetCDF directory, the pre-filter reduces worker dispatch from 15
tiles to 1–3, significantly reducing Phase 1 I/O contention on the source drive.
Temp files are stored in {output_dir}/{region_label}/_tmp/ and are always removed
after Phase 2 completes, whether it succeeded or failed.
Temp file compression: Temp files are written without zlib compression. They are
written once by Phase 1 workers, read once by Phase 2, then deleted — compression would
add substantial CPU overhead (especially with dask scheduler='synchronous' on an
external drive) for no lasting benefit. Final output datacubes use zlib complevel=4.
Temp disk space: Temp files are uncompressed float32, so peak additional disk usage
during a region’s processing is approximately:
n_overlapping_tiles × clipped_y × clipped_x × n_time_steps × 4 bytes
For typical LVIS flight boxes (1–3 overlapping tiles, clips much smaller than full tiles)
this is usually 2–15 GB. Ensure --output-dir has sufficient free space.
Multi-Tile and Multi-CRS Handling
When a polygon spans multiple HLS MGRS tiles, the pipeline automatically detects the number of CRS zones involved and selects the appropriate merge strategy.
Same-CRS Tiles (polygon within one UTM zone)
Adjacent HLS MGRS tiles in the same UTM zone share an identical 30-m pixel grid. No resampling is needed — tiles are mosaiced via a memory-bounded direct write loop (one time step at a time, bounded to ~200 MB peak regardless of datacube size):
Time union: the output datacube covers all acquisition dates from all tiles; pixels from tiles that have no data on a given date are NaN
No resampling — pixel values are unmodified
The ~163-pixel MGRS overlap zone is filled by the last tile written (scientifically equivalent to first-wins for co-acquired HLS pixels in the overlap zone)
Cross-CRS Tiles (polygon spans a UTM zone boundary)
When a polygon crosses a UTM zone boundary, tiles are in different coordinate systems. The dominant CRS (the CRS group with the most total pixels within the polygon) is selected as the output CRS. Minority tiles are reprojected to the dominant CRS using bilinear resampling, then mosaiced via the same memory-bounded write loop.
Bilinear reprojection between adjacent UTM zones introduces sub-pixel mixing comparable to the sensor point spread function — scientifically acceptable for VI analysis at 30 m.
The output file’s global attributes document the merge:
target_crs: EPSG:32634
resampling_method: bilinear
Merge Options
|
CLI flag |
Default |
Effect |
|---|---|---|---|
|
(default) |
on |
Merge same-CRS tiles into one datacube per region |
|
|
— |
One file per tile, native CRS |
|
(default) |
on |
Reproject + merge cross-CRS tiles |
|
|
— |
One file per tile, native CRS, no reprojection |
Setting both to false always produces one netCDF per tile per region.
Output
File Naming
Condition |
Output filename |
Location |
|---|---|---|
Single tile, or merged output |
|
|
Per-tile output (merge disabled) |
|
|
No shapefile (full extent) |
|
|
Log file |
|
|
File Structure Example
outputs/ ← --output-dir
├── netcdf_datacube_20260312_153100.log ← log file at output-dir root
└── Parks_and_OpenSpace/ ← shapefile stem
├── Mesa_Verde/ ← one subfolder per region
│ ├── NDVI_Mesa_Verde_datacube.nc ← merged (2 same-CRS tiles)
│ └── EVI2_Mesa_Verde_datacube.nc
└── Ridgway_State_Park/
├── NDVI_Ridgway_State_Park_datacube.nc
└── EVI2_Ridgway_State_Park_datacube.nc
When --no-merge-same-crs is set:
outputs/
└── Parks_and_OpenSpace/
└── Mesa_Verde/
├── NDVI_Mesa_Verde_T13SDA_datacube.nc
├── NDVI_Mesa_Verde_T13SDB_datacube.nc
└── EVI2_Mesa_Verde_T13SDA_datacube.nc
Output NetCDF Structure
Each output file is a CF-1.8 compliant netCDF4 with:
Dimensions:
time— acquisition dates (datetime64, decoded from “days since 1970-01-01”)y— northing in meters (UTM)x— easting in meters (UTM)
Variables:
{VI}— float32, dimensions(time, y, x), NaN where no valid data; zlib-compressed (level 4)spatial_ref— scalar, CRS container variable (CF-1.8 grid mapping convention)
Global attributes:
Attribute |
Always present |
Description |
|---|---|---|
|
yes |
|
|
yes |
Creation timestamp, region name, valid_range |
|
yes |
Source tile IDs (comma-separated) |
|
yes |
Region label |
|
yes |
VI name (e.g. |
|
cross-CRS merge only |
Target CRS (e.g. |
|
cross-CRS merge only |
|
Variable attributes:
long_name: e.g.'NDVI vegetation index'valid_min/valid_max: applied valid range valuesgrid_mapping:'spatial_ref'
CLI Reference
python src/netcdf_datacube_extract.py --help
Argument |
Default |
Description |
|---|---|---|
|
(required) |
Directory containing |
|
|
Vegetation indices: |
|
— |
Polygon shapefile(s). Omit for full extent. |
|
— |
Attribute field(s) to split by. Count must match |
|
|
Valid range for NDVI |
|
|
Valid range for EVI2 |
|
|
Valid range for NIRv |
|
(required) |
Root output directory |
|
|
Parallel worker processes for Phase 1 tile extraction |
|
— |
Include only observations on or after this date |
|
— |
Include only observations on or before this date |
|
off |
Disable same-CRS tile merging; write per-tile files |
|
off |
Disable cross-CRS reprojection + merge; write per-tile files |
|
|
|
Notes on CRS Detection
CRS comparison uses pyproj.CRS.from_wkt(wkt).to_epsg(min_confidence=20) rather than
raw WKT string comparison. EPSG integer comparison is more robust — two WKT strings
that represent the same CRS may differ in formatting or authority prefix.
HLS 2.0 CRS quirks: Two separate issues affect HLS v2.0 source data.
Non-standard datum name: HLS v2.0 GeoTIFFs (particularly Sentinel-2 S30 granules)
embed "Not specified (based on WGS 84 spheroid)" rather than the official "World Geodetic System 1984" as the datum name. pyproj’s default to_epsg() (min_confidence
70) rejects these WKTs even though the CRS is functionally identical to EPSG:326xx or
EPSG:327xx. Using min_confidence=20 reliably resolves them to the correct EPSG integer
so that same-UTM-zone tiles are always grouped together. When EPSG resolution fails even
at min_confidence=20, the fallback is crs_obj.name (e.g., "UTM Zone 34, Southern Hemisphere") rather than the raw WKT string — this prevents GDAL version differences
from producing non-equal string keys and causing false cross-CRS grouping.
Southern hemisphere UTM convention: HLS v2.0 GeoTIFFs for tiles south of the equator
store coordinates using a UTM North zone (EPSG:326xx, false_northing=0) with negative
northings, instead of the standard UTM South convention (EPSG:327xx,
false_northing=10,000,000). This is corrected in 03_hls_netcdf_build.py (HLS_VI_Pipeline
step 03): when a northern UTM EPSG code is detected and the pixel y-centroid is negative,
the CRS is replaced with the UTM South equivalent (EPSG + 100) and y-coordinates are
shifted by +10,000,000 m before the NetCDF is written. NetCDF files produced by a
corrected pipeline run carry EPSG:327xx CRS with positive northings (6–9 million m range).
If you are using older NetCDF files that still carry the UTM North / negative-northing
convention, all VI_Phenology spatial operations remain internally consistent (tile and ROI
are projected in the same coordinate space), but output datacubes will carry the UTM North
hemisphere label; rebuilding the source NetCDFs with step 03 fully resolves this.
Next Steps
The datacubes produced by this pipeline can be used as input for two other pipelines:
Pixel Phenology Pipeline — compute 19 per-pixel phenological
metric maps from these datacubes. Point PIXEL_INPUT_DATACUBES at the output directory
(e.g. ${OUTPUT_DIR}/LVIS_flightboxes_final) to pick up all regions automatically.
Phenology Pipeline — Datacube Input Mode — run the phenology
pipeline without re-running tile discovery or clipping. Set PHENOLOGY_INPUT_DATACUBES in
config.local.env to the same directory. Faster for iterating on smoothing settings,
thresholds, or plot styles.
Notes on Same-Day Multi-Sensor Acquisitions
HLS combines Landsat 8, Landsat 9, Sentinel-2A, and Sentinel-2B. When two or more of these sensors observe the same MGRS tile on the same calendar day, the source NetCDF contains duplicate time steps with identical timestamps (the HLS time axis uses integer day precision — there is no sub-day information to distinguish them).
The datacube pipeline resolves this automatically in Phase 1, before writing the temp
file for each tile. For each group of same-day duplicates, the acquisitions are merged
pixel by pixel: the first non-NaN value at each (y, x) location wins. This
preserves valid data from all same-day acquisitions across the spatial footprint — no
valid pixel is discarded and no values are averaged.
As a result, the output datacube’s time axis may have fewer time steps than the
source NetCDF tile — one step per unique calendar date rather than one step per
acquisition. The tiles global attribute records the source tile IDs, but the number
of acquisitions per sensor is not separately tracked.