Sonoma County, 13 August 2024, 14:20 local. A mountain-station camera caught a smoke column 3.4 km away — under two pixels wide in a 4K frame, near-zero contrast against the yellow-grey slope. A human watcher would have missed it for 8-12 minutes. A YOLOv8 model trained on 47,000 HPWREN frames flagged it in 14 seconds with probability 0.71. CAL FIRE crew rolled within seven minutes. The fire stopped at 0.4 hectares.
Why deep learning came to the fire domain late
Burned area segmentation from satellite data has for decades relied on empirical spectral indices. The classic dNBR (differenced Normalized Burn Ratio), proposed by Key and Benson based on Landsat TM bands 4 and 7, remains the industry standard for the USGS Burned Area Essential Climate Variable. The principle is simple: after a fire, the near-infrared channel drops, the shortwave infrared rises, and the difference yields an intensity index. A dNBR threshold splits pixels into “burned” and “not burned”.
The problem with index methods lies in nonlinear cases. Cloud shadow, mowed grass, a water body with low NIR reflectance, a fragmented mosaic of burned and preserved understory all produce false positives. Pinto et al. (2020) showed that dNBR on Sentinel-2 in Mediterranean forests achieves recall of 0.72-0.81 with precision of 0.68-0.76 for fires from 5 to 50 hectares, with errors concentrating on heterogeneous landscapes (Pinto et al., 2020, Remote Sensing). For Ukraine, where the forest-steppe mosaic of fields, shelterbelts, marshes, and river valleys creates exactly this kind of heterogeneity, pure index-based approaches have a fundamental limit.
Deep learning bypasses this limit by learning to recognize context. A CNN looking at a 256×256 pixel patch sees not only the spectral signature at the center, but also texture, shape, and geographic neighborhood. Saha et al. (2020) showed that convolutional networks markedly improve change accuracy on Sentinel-2 in complex landscapes (Saha et al., 2020, ISPRS Journal of Photogrammetry and Remote Sensing). Knopp et al. (2020) applied a ResUNet architecture to burned area segmentation and reached F1 of about 0.90 on a Southern European test set, 8-12 percentage points above the dNBR baseline given identical inputs (Knopp et al., 2020, Remote Sensing).
Why did deep learning arrive in the fire domain only around 2018-2020, when the ImageNet revolution dates back to 2012? Three reasons. First, the absence of large annotated fire datasets before the GFED5 Initiative and Sentinel-2 Burned Area products. Second, labeling difficulty: “burned ground” is not a cat or a dog; expert assessment requires spectral analysis plus field validation. Third, regulatory inertia: USGS, ESA, and NASA published standard products built on indices, and any new paradigm had to pass a long validation cycle.
Burned area segmentation architectures
U-Net and its derivatives. The architecture proposed by Ronneberger et al. (2015) for medical segmentation turned out to be almost ideal for remote sensing: an encoder-decoder with skip connections works well on tasks demanding pixel accuracy on a heterogeneous background (Ronneberger et al., 2015, MICCAI). For burned areas, ResUNet (Knopp et al., 2020) adds residual blocks, which allows more stable training of deeper variants. Attention U-Net and nnU-Net are the next generation with adaptive attention mechanisms.
Transformer models for segmentation. Vision Transformer (ViT, Dosovitskiy et al., 2020), Swin Transformer, and SegFormer became mainstream in general computer segmentation after 2021. For burned areas these models give mixed results so far: they outperform U-Net on large homogeneous patches but require much more training data. Wang et al. (2024) showed that SegFormer-B2 on the MTBS dataset reaches F1 of about 0.93 but needs five times more training patches to reach the same generalization as ResUNet (Wang et al., 2024, Remote Sensing).
Hybrid dNBR + ML approaches. A pragmatic compromise: use dNBR as a first filter for candidates, then a convolutional network on 64×64 patches makes the final classification. Cardoso-Pereira et al. (2023) applied this approach to the Brazilian Amazon and improved precision from 0.71 to 0.87 with a small drop in recall (Cardoso-Pereira et al., 2023, Remote Sensing). We consider this approach the best starting stack for countries with limited compute resources, including Ukraine.
Multi-temporal models. Fire is a process in time, not a photograph. Models that take a series of Sentinel-2 images (for example, five frames at 5-day intervals) see not only the final dNBR but also the change trajectory. ConvLSTM and Transformer-in-time architectures show a 3-7 percentage point F1 advantage over single-shot models. Khan et al. (2023) gave a detailed review of multi-temporal architectures for fire remote sensing (Khan et al., 2023, IEEE JSTARS).
Active detection: hot spots
Burned area segmentation is post-factum work. Active hot spot detection is real-time work, while a fire is still burning. Here deep learning still plays a supporting role: the foundation remains physical, on algorithms such as VNP14 (Schroeder et al., 2014) for VIIRS or MOD14 (Giglio et al.) for MODIS.
Why has DL not displaced physical algorithms? Because active detection is a low signal-to-noise task with a high cost of false positives. One false positive passed to a decision-support system can dispatch a crew to an empty plot. The economic cost of such a dispatch in the United States is estimated at $5,000-15,000. Classical algorithms have a well-grounded error model: we know exactly when VNP14 produces a false positive (water glint, hot solar imprint, an industrial object glowing red). The black box of a CNN does not offer such transparency.
Hybrid approaches, however, show promise. Govil et al. (2020) demonstrated that a CNN classifier on top of VNP14 candidates lowers the commission error by 22-28% by filtering known spurious patterns (Govil et al., 2020, Remote Sensing). In this scheme the physical algorithm remains the primary detector, and the CNN performs contextual validation.
A separate case is the detection of small fires on Sentinel-2. Thanks to its spectral channels (including SWIR at 1610 and 2190 nm), Sentinel-2 can theoretically detect active fires from about 30-50 MW of radiative power, but ESA had no standard operational product for this for a long time. CNN models such as the algorithm of Liu et al. (2021) reach recall of 0.78 for fires of 0.1-1 hectare, a sharp leap over MODIS/VIIRS at such sizes (Liu et al., 2021, IEEE TGRS).
Smoke detection in video streams
Camera networks for fire detection, from AlertWildfire in the United States to private networks across the Mediterranean, generate tens of thousands of frames per minute. A human operator cannot physically review all of them. CNN and the YOLO family of models have become central.
YOLO variants. The YOLO architecture (You Only Look Once), starting with version 5 and especially in versions 7 and 8, dominates real-time detection. Sahyoun et al. (2024) deployed a modified YOLOv8 on Jetson Nano edge devices for forest cameras in the UAE and reached mean average precision mAP@0.5 of about 0.89 at 22 frames per second (Sahyoun et al., 2024, IEEE Access). This matters: the model can run on the camera’s embedded processor without sending gigabytes of video to the cloud.
3D CNN and temporal models. Smoke differs from clouds and fog not statically but dynamically: smoke rises, drifts, and expands. Hu et al. (2018) systematically dissected the limits of static CNN smoke detectors and showed that adding a temporal component (3D convolutions on a stack of 8-16 frames) drops the false positive rate from 12% to 4% at the same recall (Hu et al., 2018, Sensors).
Vision Transformer for smoke. Recent years brought a wave of work using ViT models on video streams. Khan et al. (2024) applied Swin Transformer to a CCTV archive of fires in Southeast Asia and achieved F1 = 0.93 against 0.87 for a YOLOv8 baseline (Khan et al., 2024, Ecological Informatics). The cost: inference on a single frame takes 180 ms on an RTX 4090 GPU, which makes edge deployment impractical.
Unmanned platforms. Vetrivel et al. (2018) developed an architecture for post-disaster assessment on drone RGB video; the approach also applies to the fire domain (Vetrivel et al., 2018, ISPRS Journal of Photogrammetry and Remote Sensing). Tang et al. (2015) showed an early example of integrating wireless sensor networks for fire detection (Tang et al., 2015, Sensors). Govil and colleagues formalized the input data format from CCTV sensors for an operational AI filter at ICIP-2020 (Govil et al., 2020, ICIP).
Evaluation metrics and how they mislead
The basic metric set for classification tasks is precision, recall, F1, and accuracy. For segmentation we add Intersection over Union (IoU or Jaccard index) and the Dice coefficient. For active detection, false alarm rate per unit area or unit time.
The problem: these metrics are sensitive to class balance and the classification threshold. In a typical burned area dataset, the ratio of “burned” to “not burned” is 1:50 to 1:200. A trivial classifier that always predicts “not burned” reaches accuracy of 99.5% with recall = 0. So accuracy for fires is a useless metric, and the precision-recall curve must always be examined.
The second problem: test sets often fail to represent landscape heterogeneity. A model trained on Mediterranean coniferous forests may show excellent F1 on other Mediterranean data and complete failure in the boreal forests of Canada or the mosaic forest-steppe of Ukraine. Mountrakis et al. (2023) systematically analyzed cross-biome model transfer and showed that a typical F1 drop in cross-biome application is 18-32 percentage points (Mountrakis et al., 2023, Remote Sensing).
The third problem: the operational value of a model does not reduce to F1. A model with F1 = 0.87 but a mean detection latency of 8 minutes is operationally more valuable than a model with F1 = 0.91 and a latency of 45 minutes. Goodrich et al. (2024) proposed a combined Time-Weighted F1 metric that weights accuracy by operational responsiveness (Goodrich et al., 2024, Environmental Modelling and Software).
Federated learning and multi-source fusion
The classical training scheme: collect all data centrally, train one model, deploy it. For the fire domain this scheme has two problems: data are often restricted by ownership constraints (CCTV operators, government agencies), and models trained on one region transfer poorly to others.
Federated learning (FL), proposed by McMahan et al. (2017), solves the first problem: local models train on each participant’s data, and only weight updates are sent to a central server (McMahan et al., 2017, AISTATS). Data never leave the owner organization.
For the fire domain, FL is especially relevant in international collaborations. Ahmed et al. (2024) described a pilot FL system between Greece, Portugal, and Spain for cross-validation of smoke detection models (Ahmed et al., 2024, Engineering Applications of AI). The first results show a 6-9 percentage point F1 improvement in cross-biome generalization compared with local models.
A separate case of multi-source fusion is combining data of different modalities: satellite imagery, ground cameras, meteorological fields, historical fire maps. Graph Neural Networks and Transformer models such as Cross-Attention Fusion show 4-12 percentage point gains over the sum of individual modalities. Zhang et al. (2024) gave a technical review of such architectures for disaster remote sensing (Zhang et al., 2024, IEEE TGRS).
Platforms and tooling
Google Earth Engine remains the most accessible platform for applying ML to satellite data. Documentation at developers.google.com/earth-engine; the API supports TensorFlow and PyTorch inference through cloud workers. For operational fire monitoring we recommend a combination of GEE for image preprocessing and local inference infrastructure for critical decisions.
NASA Earth System Pathfinder and EOSDIS provide archival and near-real-time data from MODIS, VIIRS, GOES, ICESat-2, and other missions. Access through earthdata.nasa.gov. For deep learning the Common Metadata Repository and the Cloud-Optimized GeoTIFF format are critical.
Sentinel Hub (an AWS-based platform) is a commercial alternative to GEE for working with Sentinel-1, 2, 3, and 5P data. It is convenient for prototyping DL models thanks to a unified interface.
Open-source libraries: torchgeo (PyTorch integration with geospatial data), raster-vision (a framework for semantic segmentation of satellite data), mmsegmentation (general-purpose segmentation models that can be adapted).
Regional comparison of AI deployment
| Country | DL for burned areas | DL for CCTV | Edge AI on drones | FL / cross-region |
|---|---|---|---|---|
| USA | USGS BAECV+R&D, academic labs | ALERTCalifornia AI, several commercial providers | USFS UAS pilots | Limited, mostly research |
| Canada | CWFIS R&D, NRCan | Spot deployments, BC + Alberta | University lab pilots | No |
| EU | JRC EFFIS, Sentinel-2 BA, academic | Greece, Portugal, Spain | Horizon Europe pilots | Yes, Horizon projects |
| Australia | CSIRO, BoM, universities | Spot deployments, NSW + VIC | Research | Limited |
| Brazil | INPE, academic | IBAMA pilots, limited | No (budget) | Through WGCapD/CEOS |
The general picture: deep learning in the fire domain has reached technological maturity in the research cycle, but operational deployment in government agencies remains limited. The reasons are regulatory (models need to be validated and explainable), budgetary (GPU infrastructure is expensive), and human (few ML engineers know remote sensing).
When DL helps and when it does not
An honest assessment: DL helps when (a) there is a large body of annotated data, (b) the task is nonlinear and contextual, (c) GPU infrastructure for training and inference is available, and (d) the operational cost of a false positive is not catastrophic.
DL does not help, or makes things worse, when (a) data are limited and uneven, (b) the task is well formalized by a physical model, (c) the regulatory environment requires transparent decisions, or (d) latency is critical and edge compute is limited.
Concrete examples:
Helps: burned area segmentation on Sentinel-2 in mosaic landscapes; smoke detection on CCTV against complex backgrounds; cross-sensor fusion for fire event prioritization; post-factum spread modeling based on historical patterns.
Does not help: baseline VIIRS/MODIS active hot spot detection (physical algorithms are more accurate and explainable); meteorological field forecasting for FWI (numerical WRF or ECMWF computational data are unsurpassed); PM2.5 estimation from smoke (here hybrid XGBoost/LSTM models with physical plume models such as HYSPLIT outperform pure DL).
Liu et al. (2024) gave a critical meta-assessment of DL in remote sensing and stressed the “modeling on demand” trap: as soon as a team gets a GPU, the temptation arises to apply DL to every task, even where physics or classical statistics work more simply and more accurately (Liu et al., 2024, Remote Sensing of Environment).
Limits and open problems
Domain shift. A model trained on 2018-2022 data starts to degrade on 2024-2026 data due to changes in sensor calibration, climate shifts, and new fire behavior regimes. The continual learning strategy with periodic fine-tuning on fresh data is not yet standardized.
Interpretability. SHAP, Grad-CAM, and other explanation techniques give partial intuition but do not satisfy regulatory requirements such as the EU AI Act for high-risk systems. Hoffman et al. (2024) stressed that traditional XAI methods fall short for civil protection decision-support systems (Hoffman et al., 2024, Information Fusion).
Energy use and ecological cost. Training a large Transformer for segmentation can consume 10-50 MWh of electricity. Strubell et al. (2019) drew attention to the carbon footprint of DL models early on (Strubell et al., 2019, ACL). For operational fire detection it is critical to balance model accuracy against energy cost.
Adversarial vulnerability. CNN models are vulnerable to adversarial attacks: small targeted pixel modifications can force a model to miss a fire or generate a false positive. In the context of critical infrastructure this is a separate class of risks that needs to be addressed at the system architecture level.
Data and annotation. Datasets such as MTBS, EUR-MED Burned Area, FIRESENSE, and others mostly cover landscapes of North America, the Mediterranean, and Australia. For the forest-steppe of Ukraine, the boreal forests of Polissia, and the peat bogs of the Chornobyl zone, very few open annotated datasets exist. This makes model development for the Ukrainian context partly a custom task.
Ukrainian context: WildFiresUA and academic cooperation with DNU
WildFiresUA chose a hybrid strategy on purpose: physical models (FLEXPART for smoke, HYSPLIT for trajectories, WRF for meteorological fields) as the foundation, supplemented by machine learning at two levels. At the operational level, XGBoost and LSTM models forecast PM2.5 concentrations at ground sensor points over a 6-72 hour horizon. At the research level, our team in partnership with the academic laboratory at DNU studies the application of CNNs to burned area segmentation on Sentinel-2 for the Ukrainian forest-steppe.
Why a hybrid? Because it is an honest compromise between accuracy and operational speed. For PM2.5 forecasting in Kyiv over the next 24 hours, an LSTM trained on the history of sensor data and meteorological fields delivers an RMSE of about 8-12 micrograms per cubic meter, while a pure numerical approach on FLEXPART with emission estimation requires 5-10 times more compute time at similar accuracy. For burned area segmentation in Kharkiv or Zaporizhzhia oblast, the classic dNBR misses 18-25% of contours due to landscape heterogeneity, so a second-level CNN filter delivers a real gain.
We do NOT use DL for active hot spot detection: there we rely on VIIRS FIRMS and EUMETSAT MTG-I1 products with physical algorithms. We do NOT use DL for meteorological field forecasting: there we use WRF with data from Ukrhydrometcentre and ECMWF. We do NOT use DL for emission estimation from fires: there we use Wiedinmyer-style emission factors tied to area and fuel type.
This division is not dogma but the result of a sober analysis of where DL adds real value and where it only adds error risk and a loss of transparency. Research experiments in partnership with DNU let us gradually expand the DL perimeter where the evidence base is strong enough.
Conclusion
Deep learning in the fire domain has moved from an experimental to a mature technology in the part covering burned area segmentation and CCTV smoke detection. For active hot spot detection, meteorological field forecasting, and emission estimation, physical algorithms remain superior. Hybrid architectures that combine physical models as a primary layer with DL as a contextual filter offer the best compromise between accuracy, transparency, and operational speed.
For Ukraine, where landscape heterogeneity makes simple index methods limited and resources for large-scale DL operations are constrained, the WildFiresUA hybrid strategy with the academic partnership at DNU is the optimal approach. We will continue to expand DL where it really improves operational detection and to hold back where it adds noise without value.
Ukrainian startup ecosystem: follow TechUkraine and AIN.ua — the two leading outlets covering Ukrainian deep tech, climate tech, and environmental startups.