KVS_Relay_Documentation/AWS Kinesis Video Streams (KVS) Relay Documentation and Troubleshooting.md
2025-10-17 00:53:01 +00:00

164 lines
7.7 KiB
Markdown

# KVS Stream Drop Prevention and Recovery
## Overview
To address intermittent stream drops in **AWS Kinesis Video Streams (KVS)** caused by upstream source instability, credential expiry, or network issues during live relay operations, this document proposes an ideal solution to ensure robust, continuous KVS relay uptime. The approach includes an incident summary, initial analysis, root cause findings, and a comprehensive strategy for proactive prevention, automated recovery, and enhanced monitoring for GStreamer-based pipelines relaying camera feeds.
---
## Incident Summary
During live relay operations, intermittent stream drops were observed across multiple camera feeds, particularly affecting **HIKVision-2MP-Bullet-1**, relayed to AWS KVS via GStreamer pipelines. The issue was identified through monitoring `kvssink` logs on the EC2 relay instance, which revealed repeated stream interruptions.
**Example Log Snippet:**
```
INFO - Stopping kvssink for HIKVision-2MP-Bullet-1-Stream
2025-10-17 00:22:48 [137012699014976] INFO - stopKinesisVideoStreamSync(): Synchronously stopping Kinesis Video Stream 000057a4952be080.
DEBUG - streamClosedHandler invoked for upload handle: 18446744073709551615
INFO - Stopped kvssink for HIKVision-2MP-Bullet-1-Stream
ERROR: from element /GstPipeline:pipeline0/GstURIDecodeBin:uridecodebin0/GstSoupHTTPSrc:source: Internal data stream error.
Additional debug info:
gst_base_src_loop (): streaming stopped, reason error (-5)
ERROR: pipeline doesn't want to preroll.
```
The logs indicate that the upstream video source (HLS) stopped providing data, resulting in a **GStreamer “Internal data stream error”** and subsequent pipeline termination.
---
## Initial Analysis
From the observed behavior, three recurring failure patterns were identified:
1. **Source-side Interruption (Most Common):**
The `souphttpsrc` or `uridecodebin` elements ceased receiving packets, typically due to the HLS playlist becoming unreachable or stalling mid-stream.
- **Symptom**: `streaming stopped, reason error (-5)`
- **Result**: KVS stream stopped gracefully but did not auto-recover.
2. **KVS Session Expiry Due to Stale IAM Credentials:**
Temporary IAM role credentials were not refreshed before expiry, causing KVS upload sessions to fail silently.
- **Symptom**: `AccessDeniedException` or stalled uploads without reconnection.
- **Root Cause**: Credentials not renewed via the EC2 metadata service.
3. **Network or Encoder Stalls:**
EC2 outbound connection drops or encoder freezes caused the stream buffer to drain, leading GStreamer to halt.
- **Symptom**: Pipeline entered `NULL` state and failed to preroll.
- **Root Cause**: OS-level TCP timeouts or intermittent upstream loss.
---
## Root Cause Summary
After extensive testing, the primary cause of stream drops was determined to be **transient network or HLS source interruptions**, not KVS service instability. However, GStreamer pipelines lack built-in auto-reconnection, causing the relay process to remain down until manual intervention. This created a **single point of failure** at the process level, exacerbated by occasional IAM credential expiry and network instability.
---
## Proposed Solution Components
### 1. Robust GStreamer Pipeline with Auto-Reconnect
To mitigate stream drops due to transient network or source interruptions, implement a GStreamer pipeline with built-in reconnection logic using adaptive elements and error-handling plugins.
**Proposed Pipeline Configuration:**
```bash
gst-launch-1.0 -v \
souphttpsrc location="https://example-playback-url.m3u8" retries=5 timeout=10 ! \
hlsdemux ! \
queue max-size-buffers=0 max-size-bytes=0 max-size-time=5000000000 leaky=downstream ! \
decodebin ! \
videoconvert ! \
x264enc tune=zerolatency bitrate=2000 speed-preset=superfast key-int-max=30 ! \
video/x-h264,profile=baseline ! \
kvssink stream-name="HIKVision-2MP-Bullet-1-Stream" aws-region="eu-west-1" storage-size=512
```
**Rationale:**
- `souphttpsrc retries` and `timeout` enable automatic retries for failed connections.
- `queue` with `leaky=downstream` discards outdated frames during congestion, maintaining real-time streaming.
- The pipeline handles transient errors internally, reducing the need for external restarts.
---
### 2. Automated Credential Management
To prevent KVS session failures due to expired IAM credentials, implement a continuous credential refresh mechanism.
**Proposed Script:**
```bash
#!/bin/bash
while true; do
TOKEN=$(curl -sX PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
CREDENTIALS=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/iam/security-credentials/LiveStreamRole)
export AWS_ACCESS_KEY_ID=$(echo $CREDENTIALS | jq -r .AccessKeyId)
export AWS_SECRET_ACCESS_KEY=$(echo $CREDENTIALS | jq -r .SecretAccessKey)
export AWS_SESSION_TOKEN=$(echo $CREDENTIALS | jq -r .Token)
echo "$(date): Refreshed IAM credentials" >> kvs_credential.log
sleep 3600 # Refresh every hour
done
```
**Rationale:**
- Ensures credentials remain valid through periodic refreshes.
- Logs refresh events for auditability.
- Prevents `AccessDeniedException` errors during KVS uploads.
---
### 3. Systemd-Based Process Management
Replace manual watchdog scripts with a **systemd** service to manage GStreamer pipelines as persistent daemons, ensuring automatic restarts and centralized control.
**Proposed Systemd Service File:**
```ini
[Unit]
Description=GStreamer KVS Relay Service for HIKVision-2MP-Bullet-1
After=network.target
[Service]
ExecStart=/usr/bin/gst-launch-1.0 -v \
souphttpsrc location="https://example-playback-url.m3u8" retries=5 timeout=10 ! \
hlsdemux ! \
queue max-size-buffers=0 max-size-bytes=0 max-size-time=5000000000 leaky=downstream ! \
decodebin ! \
videoconvert ! \
x264enc tune=zerolatency bitrate=2000 speed-preset=superfast key-int-max=30 ! \
video/x-h264,profile=baseline ! \
kvssink stream-name="HIKVision-2MP-Bullet-1-Stream" aws-region="eu-west-1" storage-size=512
Restart=always
RestartSec=5
EnvironmentFile=/etc/kvs/credentials.env
StandardOutput=append:/var/log/kvs_relay.log
StandardError=append:/var/log/kvs_relay.log
[Install]
WantedBy=multi-user.target
```
**Rationale:**
- `systemd` ensures automatic pipeline restarts on failure.
- Centralized logging simplifies debugging.
- Environment file integration securely loads credentials.
---
### 4. Alternative Input Feed Exploration
To reduce reliance on unstable HLS feeds, evaluate **AWS Elemental MediaConnect** or **SRT (Secure Reliable Transport)** as alternative input protocols.
**Proposed Steps:**
- Test **MediaConnect** for managed, low-latency video ingestion.
- Convert RTSP camera feeds to SRT for better resilience.
- Update GStreamer pipeline to use `srtclientsrc`:
```bash
srtclientsrc uri="srt://camera-source:8888" ! ...
```
**Rationale:**
- MediaConnect offers fault-tolerant ingestion.
- SRT handles network jitter and packet loss better than HLS.
---
## Expected Outcomes
- **Near-zero downtime**: Auto-reconnect pipelines and credential management eliminate single points of failure.
- **Improved reliability**: Buffered pipelines and alternative protocols handle transient issues.
- **Proactive monitoring**: CloudWatch ensures rapid issue detection.
- **Scalability**: Systemd and modular credential management support multiple feeds.
---
## Conclusion
This proposed solution addresses KVS stream drops by tackling root causes—unstable HLS feeds, credential expiry, and non-recovering pipelines—through resilient GStreamer pipelines, automated credential management, systemd-based process control, and CloudWatch monitoring. By incorporating alternative input protocols, the system ensures continuous, reliable video relay with minimal downtime.