DenseMarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks

Abstract

We propose DenseMarks — a new learned representation for human heads, enabling high-quality dense correspondences. A Vision Transformer network predicts a 3D embedding for each pixel, corresponding to a location in a 3D canonical unit cube. We train our network using pairwise point matches from diverse talking heads videos, guided by a contrastive loss that encourages matched points to have close embeddings.

Multi-task learning with face landmarks and segmentation constraints, combined with spatial continuity through latent cube features, results in an interpretable canonical space. The representation enables finding common semantic parts, face/head tracking, and stereo reconstruction, with robustness to pose variations and full head coverage including hair.

Video Presentation

Tracking Results

Head tracker

Head tracker + DenseMarks

Input

Predicted tracking

Template texture

Overlay

Method Overview

DenseMarks learns per-pixel representations of human heads by mapping each pixel to a 3D canonical unit cube. The method uses a Vision Transformer backbone to predict dense embeddings, supervised by point correspondences from video tracking and enhanced with landmark and segmentation constraints.

The key innovation is the 3D canonical space representation, which enables robust correspondences across different individuals and poses, including challenging regions like hair and accessories that are typically difficult for traditional landmark-based approaches.

Point Matching

The requirement of the canonical space is that the same semantic points will have a fixed location in the cube, regardless of the person's identity. We test this on a number of points that have distinct semantics: points on hair, ear centers, forehead center, eyebrow corners.

○ left side of long hair

○ center of right ear

○ center of left ear

○ forehead center

○ left eyebrow corner

DINOv3

Sapiens

HyperFeats

Fit3D

DenseMarks

Dense Warping

To demonstrate the semantic consistency of embeddings predicted for the whole image, not only specific points or regions, we demonstrate the warping by embeddings. For each target image pixel, we replace its color with the color of the nearest neighbor by embedding in the source.

Source

Target

DINOv3

Sapiens

Fit3D

DenseMarks

Projection

Cross-check images to visualize their corresponding projections in the canonical cube space. Select multiple images to see how different samples map to the 3D canonical space.

Select Images

Rotatable Cube Projection

Status: Loading...
Three.js: Checking...
Mode: Same Person
Selected: None
Face Parsing: Off

Stereo Reconstruction

We demonstrate that triangulating 2+ images can be done purely using embeddings from our model, on the example of a sample with known camera poses and intrinsics. This way, we demonstrate the capabilities of multi-view stereo and dense estimation. With known cameras and DenseMarks embeddings, the whole triangulation takes just a few seconds on CPU.

DenseMarks

Dmitrii Pozdeev

Alexey Artemov

Ananta R. Bhattarai

Artem Sevastopolsky

Monocular Video

DenseMarks

3D Cube Projection

Abstract

Video Presentation

Tracking Results

Method Overview

Point Matching

DINOv3

Sapiens

HyperFeats

Fit3D

DenseMarks

Dense Warping

Source

Target

DINOv3

Sapiens

Fit3D

DenseMarks

Projection

Select Images

Rotatable Cube Projection

Stereo Reconstruction

Input images

Cameras placement

Rotatable Point Cloud