Weakly supervised learning for drowning detection in over-water construction from videos

Wenkang Guo; Gangyan Xu; Changsheng Qu; Heng Li; Haosen Chen; Lei Hou; Guomin Zhang; Yushu Yang

doi:10.70401/jbde.2026.0039

Weakly supervised learning for drowning detection in over-water construction from videos

Wenkang Guo

Gangyan Xu

Changsheng Qu

Heng Li

Haosen Chen

Lei Hou

Guomin Zhang

Yushu Yang

1,*

Affiliation +

*Correspondence to: Yushu Yang, Department of Building and Real Estate, Hong Kong Polytechnic University, Hong Kong, China. E-mail: ys-yushu.yang@connect.polyu.hk

J Build Des Environ. 2026;4:2025143. 10.70401/jbde.2026.0039

Received: December 23, 2025Accepted: May 06, 2026Published: May 08, 2026

This article belongs to the Special lssue Health and Safety Management in Construction: Innovations and Challenges

Abstract

Open-water drowning is a leading serious-injury/fatality risk at public waterfronts and in construction over or adjacent to water, where long stand-off views, glare, waves, and occlusions hinder timely detection. We propose TimeSformer+MIL, a weakly supervised temporal framework for event-level drowning monitoring designed for deployment in safety-critical construction settings and aligned with supervisory workflows. The system standardizes video streams into short clips, extracts spatiotemporal evidence with a divided space–time TimeSformer, and aggregates clip scores via top-k multiple-instance learning with a lightweight consistency prior to stabilize weak labels and support calibration. By avoiding person detection and multi-object tracking, the pipeline reduces engineering complexity and failure modes common in cluttered, low-light, or small-scale scenes, improving reliability without increasing operator load. We curate an open-water dataset spanning construction and public-waterfront contexts and evaluate with event-focused metrics aligned to risk governance: recall at target false-positives per hour, alert latency relative to rescue windows, and calibration-aware ranking and thresholded decisions for alerting and escalation. Across clip lengths and aggregation strategies, the approach delivers robust discrimination and translates it into stable, low-latency alerts that meet rescue-time targets while limiting alarm fatigue. For architectural practice and construction safety management, the framework offers a practical path to augment human surveillance with machine attention, functioning as a verifiable administrative control within the hierarchy of controls and integrating with site safety processes to accelerate incident recognition and strengthen risk governance in dynamic open-water settings.

Keywords

Drowning detection, open-water, TimeSformer, multiple instance learning, transformer-based video modeling, event-level evaluation

1. Introduction

Drowning in open waters is a serious construction safety hazard for work over or adjacent to water. According to a World Health Organization report, hundreds of thousands of people die from drowning each year, with open spaces such as beaches, rivers, and lakes being high-risk environments^[1]. Beyond public recreation areas, water-related hazards are a persistent cause of severe and fatal incidents in construction and infrastructure operations, including marine construction, bridge and pier works, cofferdam and excavation near waterways, port logistics, and flood-control maintenance^[2]. In these settings, workers are exposed to unstable footing, variable weather, and limited visibility, and rescue latency is a key determinant of survivability^[3]. Regulatory frameworks reflect this reality: OSHA 29 CFR 1926.106 requires USCG-approved Personal flotation devices (PFDs), ring buoys (distance between ring buoys ≤ 200 ft; line length ≥ 90 ft), and an immediately available lifesaving skiff when employees work over or adjacent to water where a drowning danger exists; official interpretation letters further clarify that the skiff is required irrespective of continuous fall protection to ensure prompt rescue^[4].Recent enforcement cases also underscore gaps between policy and practice, and high-profile infrastructure incidents have renewed scrutiny of on-water rescue readiness and communication during jobs over. Compared to indoor swimming pools, open waters are more significantly affected by tides, wind, and waves, visibility, crowding, and weather conditions. Traditional human patrols and fixed camera monitoring often suffer from limited coverage, restricted viewing angles, and high response latency, making it difficult to promptly detect and respond to sudden drowning incidents^[5-7]. Furthermore, the scarcity of real drowning samples and the sensitive nature of annotations make constructing high-quality training datasets subject to ethical and privacy constraints, further exacerbating the uncertainty of algorithms in practical deployment.

Existing research has largely focused on controlled indoor swimming pool scenarios. In such environments, the water is clear, the lighting is stable, the background is simple, the camera’s viewing angle is fixed, and even underwater cameras are used, resulting in distinct human silhouettes and postures^[8-11]. Taking advantage of these conditions, mainstream methods often employ motion analysis based on human posture or skeletons, or perform end-to-end classification across the entire video, achieving high accuracy in close-range imaging conditions with good visibility. However, these methods face significant challenges in portability, robustness, and usability in open water environments. High-frequency noise caused by surface reflections and wave spray, object fragmentation and keypoint loss due to floating objects and occlusions, the fact that people occupy only a few pixels when captured from a distance, and significant domain shift due to the diversity of devices and shooting angles all combine to weaken the effectiveness of pose estimation and whole-video classification. Furthermore, simple video-level binary classification struggles to output “which person and when” is drowning, thus failing to provide actionable alert information for rescue efforts.

In response to the challenges of open-water surveillance, we propose a temporally focused, instance-agnostic detection framework that emphasizes engineering feasibility and online usability. Methodologically, we adopt a single-stage temporal discrimination strategy: videos are standardized (frame rate, resolution) and segmented into short, overlapping clips; a TimeSformer-based classifier then produces clip-level drowning scores, which are aggregated with sliding windows and hysteresis to yield stable, low-latency alerts. This design eliminates dependency on prior object detection and multi-object tracking, reducing engineering complexity and failure modes in cluttered scenes, low light, occlusions, or small-scale targets. To further exploit weak supervision and reduce annotation burden, we cast training as a multiple instance learning (MIL) problem: each raw video (bag) comprises temporally segmented clips (instances), with only bag-level labels available for many samples^[12,13]. We optimize the TimeSformer with MIL-style pooling over instance scores (e.g., max-/top-k/attention pooling) to align bag-level supervision with instance-level discrimination, while preserving the single-stage, detector-free pipeline. In practice, MIL pooling encourages the model to localize drowning-relevant segments within positive bags and to suppress spurious peaks in negatives, which directly benefits online stability and reduces false alarms. At the data level, we construct a standardized pipeline for open-media videos, including unified frames per second (FPS)/resolution, de-duplication and quality control, and segmentation into 10-30 s windows^[14,15]. We adopt event-level weak supervision because confirmations are faster to obtain across sites and entail fewer privacy and liability concerns than frame- or box-level annotations. We implement this with top-k MIL and use semi-automatic candidate mining to mitigate the scarcity of positive samples and the cost of temporal annotations. To assess real-world value, we emphasize threshold-independent metrics at the event level (e.g., average precision (AP) and area under the ROC curve (AUC)) and characterize alert latency under the online hysteresis policy; we also report clip-/video-level precision-recall (PR)/AUC. We further compare system outputs with judgments from professional rescuers or experienced administrators to quantify operating trade-offs and to identify deployment-relevant failure modes.

The main contributions of this paper are as follows:

• We target the open-water gap where pool-trained frame classifiers break down. By casting monitoring as instance-level, time-series event detection, answering who and when, we handle tiny/distant targets and intermittent visibility (glare, waves, occlusion), and evaluate at the event level with latency and false-alarm budgets; the same design considerations echo those in construction-site safety where timely, low-FP alerts are critical.

• We deliver a deployable, instance-agnostic temporal framework: standardize streams; score short clips with a TimeSformer; aggregate via sliding windows and hysteresis to yield stable, low-latency alerts without object detection or multi-object tracking (MOT). This reduces engineering complexity and failure modes in cluttered, low-light, or small-scale scenarios, a practical trait for safety monitoring in dynamic, visually noisy environments (e.g., waterfronts or construction perimeters).

• We establish an open-media benchmark and data pipeline: event-level ground truth and metrics (recall, PR/AUC); unified FPS/resolution, de-duplication, and 10-30 s segmentation with weak-label mining. This supports reproducible comparisons and operational analysis, and the benchmarking rubric, event-level metrics at target operating points, translates naturally to adjacent safety domains such as construction for auditing and risk governance.

2. Literature Review

2.1 Indoor and outdoor drowning detection

With more sensing platforms now available, recent studies on drowning monitoring can be grouped into four observation regimes: pool-focused vision, vision in natural waters, aerial views from unmanned aerial vehicles (UAVs), and non-visual channels. In pools, cameras above or below the waterline typically encode signs of distress as partially visible bodies near the surface, upright postures, and proximity to lanes or geofenced zones—conditions that benefit from short viewing ranges and stable lighting^{[5,6,10,11,16]}. In natural settings such as beaches, rivers, or offshore areas, cues shift toward brief head/arm glimpses under long stand-off, strong glare, waves, and frequent occlusions, so methods rely more on robust spatio–temporal evidence from local parts^[7,17,18]. UAV-based work adds top-down, wide-area coverage for rapid search and triage^[19]. Beyond vision, several threads target “distress intent” or physiological proxies: radio frequency identification (RFID) wearables that fall silent during prolonged submersion, multi-sensor bands that fuse oxygen/respiration/immersion signals, underwater acoustic save our souls (SOS) detection from commodity devices, and yarn-based strain/flow sensors that emit coded alerts on water entry^[9,20-22]. Taken together, these regimes span a cue continuum from appearance near the water surface to explicit distress signals and physiological markers.

2.2 Vision-based drowning detection method

Most vision pipelines start from single-frame detectors and adapt them to water scenes. Variants of the YOLO family (v5/v7/v8/v11) are common, pairing lightweight convolutions (e.g., Ghost or dynamic forms), attention-style modules, and revised feature pyramids (e.g., BiFPN-like bidirectional fusion) to balance accuracy, speed, and embedded deployment^{[5,6,11,17,18,23]}. In crowded lanes or multi-swimmer scenes, deformable operations, auxiliary heads, and loss redesign sharpen the boundary between “swimming” and “drowning” under clutter^[8]. Two-stage designs also appear: a light detector flags human candidates (often near-vertical poses), followed by a compact verifier to meet real-time limits on embedded boards^[10]. For natural waters, methods emphasize small-object sensitivity and occlusion robustness, combining activation replacements (e.g., FReLU), conv–self-attention hybrids, refined IoU losses, and structured pruning to sustain performance at distance while improving FPS^[7]. In parallel, nonvisual approaches treat events as thresholded states or learned detections in RFID/wearable or underwater-acoustic streams^[9,20,21]. Synthetic data has been explored to enrich rare appearances and hard cases^[8]. Overall, the field leans toward efficient detectors with water-aware modules, wrapped in pipelines that trade modest complexity for practical latency and deployment.

Modern detector-based systems (e.g., YOLO variants) perform strongly in controlled, short-range views. Performance can degrade in long-range scenes with glare, heavy occlusion, and spray when detections or associations become unstable, and errors at the detection stage propagate to downstream alert logic. Our instance-agnostic temporal formulation avoids this dependency and simplifies the deployment pipeline, so we focus comparisons on clip-level temporal classifiers rather than detector–tracker stacks.

2.3 Data sources and evaluation protocols

Across the literature, a typical data path emerges: controlled pools, often with underwater or near-wall viewpoints and staged behaviors, followed by selective forays into natural waters or UAV imagery. Many pool studies build self-collected sets under stable light and short ranges, with volunteers enacting variants of the instinctive drowning response or proxy motions (e.g., ladder-climbing, back-float/backstroke) to increase class separability^[5,11,23]. Embedded constraints are considered in some works, targeting real-time on devices such as Jetson Nano or Raspberry Pi^[6,10]. Natural-water and offshore efforts rely on bespoke collections or task-specific corpora to capture distant, small targets and cluttered backgrounds, while UAV data extends area coverage^[7,17-19].

This data ecology exposes two structural limits for open-water use and for “data truthfulness”. First, the domain gap: illumination statistics, specular highlights, wave patterns, occlusion rates, viewing geometry, and crowd density differ markedly between pools and open waters, so models trained on controlled, short-range imagery may not generalize reliably. Second, the action source: staged episodes by cooperative volunteers, bounded by ethics and safety, need not match the timing, surfacing rhythm, or partial-visibility patterns of real, non-cooperative incidents, which weakens event-level validity and claims about alert timeliness^[6,23]. Synthetic imagery can diversify appearances, but its distributional fidelity to real open-water incidents remains to be established against in-situ benchmarks^[8]. These observations motivate datasets that prioritize authentic open-water scenes, cross-domain splits, and event-level reporting (e.g., detection latency and false alarms per hour), so algorithmic gains translate into reliable alerting in the wild.

3. Methodology

We adopt a consistent notation throughout. Table 1 summarizes all symbols used in this section.

Table 1. Notation used in the methodology (kept only for components implemented in our training code).

Symbol	Description	Default/Range
V = {Ft}^N_t=1	A video as a frame sequence	-
N	Total number of frames in V	-
K	Number of temporal segments (clips) in V	K_train = 4, K_val = 8
T	Frames per clip	{8,16}
C_k	k-th clip, C_k = {F_I_k,i}^T_i=1	-
Φ_θ	TimeSformer encoder (divided space-time)	-
h_k ∈ R^D	Clip representation (backbone output)	-
y_k	Logit for clip k (pre-sigmoid)	-
s_k	Calibrated probability s_k = σ(y_k/τ)	-
τ	Temperature for calibration, τ = max{1, exp(ℓ_τ)}	learned
y_V	Video-level label (weak supervision)	{0,1}
k_MIL	MIL top-k size for a bag with N_V clips	k_MIL = [k_ratioN_V]
k_ratio	MIL top-k ratio	0.2
λ	Weight of the consistency term	0.1

Split	Pos (#clips)	Neg (#clips)	Total (#clips)
Train	74	68	142
Val	18	22	40
Test	9	13	22
All	101	102	203

Category	Pos	Neg	All	% of All
Lake	44	62	106	52.2%
River	32	26	58	28.6%
Coastal	13	14	27	13.3%
Mixed/Unknown	12	0	12	5.9%
Total	101	102	203	100.0%

Condition	Pos	Neg	All	% of All
Bright	88	91	179	88.2%
Dusk/Night	7	3	10	4.9%
Backlight/Glare	6	8	14	6.9%
Total	101	102	203	100.0%

Model (T/backbone/agg/loss/weight)	Acc	AUC	F1@0.5(%)	AP(%)
Main Model	0.950	0.9750	94.73	96.80
T = 8, r3d_18_3d, n/a, BCE, pos_w = 1.0	0.900	0.9722	90.00	96.80
T = 8, r3d_18_3d, n/a, BCE, pos_w = 2.0	0.900	0.9773	89.47	97.48
T = 8, r3d_18_3d, n/a, Focal (α = 0.25, γ = 2)	0.875	0.9268	85.71	91.77
T = 8, resnet50_2d, mean, BCE, pos_w = 1.0	0.925	0.9369	91.43	90.57
T = 8, resnet50_2d, mean, BCE, pos_w = 2.0	0.950	0.9697	94.44	97.58
T = 8, resnet50_2d, mean, Focal (α = 0.25, γ = 2)	0.925	0.9495	91.89	94.95
T = 8, resnet50_2d, topk (k = 0.2), BCE, pos_w = 1.0	0.900	0.9520	88.24	95.72
T = 8, resnet50_2d, topk (k = 0.2), BCE, pos_w = 2.0	0.925	0.9773	91.89	97.59
T = 8, resnet50_2d, topk (k = 0.2), Focal (α = 0.25, γ = 2)	0.925	0.9545	92.31	94.30
T = 16, r3d_18_3d, n/a, BCE, pos_w = 1.0	0.925	0.9722	92.31	96.46
T = 16, r3d_18_3d, n/a, BCE, pos_w = 2.0	0.900	0.9470	89.47	92.57
T = 16, r3d_18_3d, n/a, Focal (α = 0.25, γ = 2)	0.875	0.9066	85.71	91.77
T = 16, resnet50_2d, mean, BCE, pos_w = 1.0	0.925	0.9672	92.31	93.90
T = 16, resnet50_2d, mean, BCE, pos_w = 2.0	0.925	0.9268	91.43	93.17
T = 16, resnet50_2d, mean, Focal (α = 0.25, γ = 2)	0.950	0.9722	94.44	97.09
T = 16, resnet50_2d, topk (k = 0.2), BCE, pos_w = 1.0	0.925	0.9419	91.89	92.44
T = 16, resnet50_2d, topk (k = 0.2), BCE, pos_w = 2.0	0.925	0.9141	91.43	89.37
T = 16, resnet50_2d, topk (k = 0.2), Focal (α = 0.25, γ = 2)	0.925	0.9720	91.43	95.74

Method	Accuracy(%)	AP(%)	AUC	F1@0.5(%)
CNN-LSTM	62.5	88.4	0.922	62.1
I3D	85.0	57.9	0.687	61.5
TimeSformer(noMIL)	92.5	96.4	0.965	91.9
YOLO11-LiB (pool-trained, cross-domain)	66.6	61.7	0.724	60.8
YOLOv7 (FAEA, pool-trained, cross-domain)	64.1	58.3	0.705	57.5
MS-YOLO (pool-trained, cross-domain)	61.8	54.9	0.681	53.9
Ours (TimeSformer+MIL)	95.0	96.8	0.975	94.7

Journal of Building Design and Environment

Weakly supervised learning for drowning detection in over-water construction from videos

Wenkang Guo

Gangyan Xu

Changsheng Qu

Heng Li

Haosen Chen

Lei Hou

Guomin Zhang

Yushu Yang

Abstract

Keywords

References

Copyright

Publisher’s Note

Share And Cite

Science Exploration Style

Download

Export Citation

Article Metrics

Article Updates

Related Articles

Contents

Science Exploration Style

Share Link

Subscribe

Journal of Building Design and Environment

Navigation

Follow us