MediaPipe Hand Tracking

Annotate Your Egocentric Video With MediaPipe

Q: How can I improve MediaPipe hand tracking accuracy in egocentric setups?

Use preprocessing like sharpening and CLAHE, lower confidence thresholds, enable VIDEO mode, and apply smoothing techniques to stabilize landmarks.

Q: Why should I export landmarks as JSON along with the video?

JSON provides structured landmark data per frame, enabling analysis, gesture recognition, and integration with machine learning workflows.

Annotate egocentric videos using MediaPipe hand tracking. Learn how to detect, stabilize, and export hand landmarks with preprocessing, smoothing, and JSON output for real-world applications like gesture recognition and action analysis.

akash rawal

May 5, 2026 • 5 min read

Share this blog

Egocentric Video With MediaPipe

Your hands are washing dishes. The camera watches from above. And a model traces every knuckle, every fingertip, every wrist movement, in real time, on every frame. That is egocentric hand detection. And it is more useful than it sounds.

Researchers use it for action recognition. Surgeons use it for skill assessment. AR developers use it for gesture control. If you have a first-person video and want to know exactly where hands are at every moment, MediaPipe is where you start.

What Is MediaPipe?

MediaPipe is Google's open-source ML framework for perception tasks. It runs on-device, needs no GPU, and ships pre-trained models for faces, poses, and hands.

The HandLandmarker model detects up to two hands per frame. For each hand it returns 21 landmarks, points mapped to every joint and fingertip. Two hands give you 42 points total. Each point carries three values: x, y, and z. X and Y are normalized between 0 and 1 relative to frame size. Z is depth relative to the wrist, negative means closer to the camera.

That is your raw signal. Everything else is built on top of it.

Why Egocentric Video Is Harder

Standard hand detection assumes a fixed background and a hand entering frame from the side. Egocentric video breaks both assumptions.

The camera is mounted on the body or placed overhead. Hands dominate the frame. They overlap. They move fast. Water, soap, and reflections scatter light unpredictably. Motion blur is constant. Detection fails more often, and when it does, the skeleton vanishes for several frames.

This matters because missed frames break downstream analysis. If you are counting wash cycles or classifying gestures, a flickering skeleton is noise. You need to engineer around it.

The Pipeline : From Video to Annotated Output

The full system has four stages: load, detect, draw, save. Here is the logic.

Pipeline Overview

Github : https://github.com/Labellerr/Hands-On-Learning-in-Computer-Vision/blob/main/fine-tune%20YOLO%20for%20various%20use%20cases/mediapipe_hand_tracking_egocentric.ipynb

Stage 1: Load the video with OpenCV

OpenCV reads the video frame by frame. You extract FPS, width, and height upfront. These three values control everything downstream, timestamps for MediaPipe, dimensions for the video writer, and pixel conversion for landmarks.

Stage 2: Detect with HandLandmarker

MediaPipe's new Tasks API replaced the old mp.solutions.hands. You now load a .task model file and configure a HandLandmarkerOptions object. Two modes matter here.

IMAGE mode processes each frame independently. VIDEO mode tracks across frames using timestamps. VIDEO mode is more stable, it uses temporal context to keep the skeleton alive between frames. The timestamp must be strictly increasing in milliseconds. Calculate it as (frame_index / fps) * 1000.

Stage 3: Draw the skeleton

For each detected hand, convert normalized x and y to pixel coordinates by multiplying by frame width and height. Draw lines between connected landmark pairs. Draw circles at each of the 21 points. Label the hand as Left or Right with its confidence score.

MediaPipe labels hands from the camera's perspective, not the subject's. In egocentric video this often means the labels are flipped. Keep that in mind.

Stage 4: Write the output

OpenCV's VideoWriter saves the annotated frames. Use XVID codec with .avi for maximum compatibility. After processing, run FFmpeg to convert to H.264 .mp4 for browser playback.

The Core Detection Code

options = HandLandmarkerOptions(
    base_options=mp_python.BaseOptions(model_asset_path="hand_landmarker.task"),
    running_mode=RunningMode.VIDEO,
    num_hands=2,
    min_hand_detection_confidence=0.3,
    min_hand_presence_confidence=0.3,
    min_tracking_confidence=0.3,
)

with HandLandmarker.create_from_options(options) as detector:
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret: break

        rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=rgb)
        timestamp_ms = int((frame_idx / fps) * 1000)
        result = detector.detect_for_video(mp_image, timestamp_ms)

Confidence thresholds at 0.3 instead of the default 0.5 catches more detections in difficult lighting. Lower thresholds mean more false positives, but in a controlled egocentric setup, that tradeoff favors recall over precision.

Before vs After What Preprocessing Does

Raw egocentric frames are often dark, wet, and blurry. Two preprocessing steps fix most of it.

Sharpening applies a 3x3 convolution kernel that amplifies edge contrast. This partially reverses motion blur and makes finger boundaries cleaner for the detector.

CLAHE : Contrast Limited Adaptive Histogram Equalization normalizes brightness across small regions of the frame. Wet surfaces create bright hotspots. CLAHE compresses those while lifting dark areas. The detector sees a more uniform image.

Run preprocessing on the frame before passing it to MediaPipe. Draw the skeleton on the original frame not the processed one. This keeps the output visually clean.

Smoothing the Skeleton

VIDEO mode helps with temporal consistency. But the skeleton still jitters — landmarks jump a few pixels between frames even when the hand barely moves.

A 5-frame rolling average fixes this. For each of the 21 landmarks on each hand, store the last 5 pixel positions in a deque. Replace the current position with the average. The skeleton becomes stable without introducing visible lag.

from collections import deque
smooth_buf = {
    "Left":  [deque(maxlen=5) for _ in range(21)],
    "Right": [deque(maxlen=5) for _ in range(21)]
}

This one addition makes the output look professional rather than noisy.

Saving Landmarks as JSON

The video is the visual output. The JSON is the data output. Every frame gets an entry with its index, timestamp, and the full x, y, z coordinates for all detected landmarks.

json

{
  "frame": 42,
  "timestamp_ms": 1400,
  "hands": {
    "Left": [
      {"landmark_id": 0, "landmark_name": "WRIST",
       "x": 0.512, "y": 0.743, "z": -0.012,
       "pixel_x": 614, "pixel_y": 892}
    ],
    "Right": null
  }
}

Null means the hand was not detected that frame. This structure is queryable — load it into pandas, filter by frame range, plot joint trajectories, or feed it into a classifier. The JSON turns a video annotation task into a structured dataset.

Conclusion

MediaPipe makes hand landmark detection accessible. But accessible does not mean trivial. Egocentric video introduces real challenges, motion blur, reflections, occlusion, and label flipping, that default settings do not handle well.

The pipeline described here addresses all of them. Preprocessing cleans the input. Lower thresholds catch more detections. VIDEO mode adds temporal stability. Smoothing removes jitter. JSON export turns annotations into a usable dataset.

The result is a system that takes a raw egocentric video and produces two outputs: a fully annotated video with color-coded skeletons, and a structured JSON file with 42 landmark coordinates per frame. That is the foundation for anything built on top, gesture recognition, action classification, skill assessment, or augmented reality.

FAQs

Q1. Why does hand detection fail in egocentric videos?

Egocentric videos involve motion blur, occlusions, reflections (water/soap), and fast hand movement. These factors reduce detection stability and cause landmark flickering across frames.

Q2. How can I improve MediaPipe hand tracking accuracy in egocentric setups?

Apply preprocessing (sharpening + CLAHE), lower confidence thresholds, use VIDEO mode for temporal consistency, and apply smoothing (rolling average) to reduce jitter.

Q3. Why should I export landmarks as JSON along with the video?

JSON provides structured data (x, y, z coordinates per frame), enabling analysis, gesture classification, trajectory tracking, and integration with ML pipelines.

Free

Data Annotation Workflow Plan

Simplify Your Data Annotation Workflow With Proven Strategies

Download the Free Guide