Abstract

We propose DenseMarks — a new learned representation for human heads, enabling high-quality dense correspondences. A Vision Transformer network predicts a 3D embedding for each pixel, corresponding to a location in a 3D canonical unit cube. We train our network using pairwise point matches from diverse talking heads videos, guided by a contrastive loss that encourages matched points to have close embeddings.

Multi-task learning with face landmarks and segmentation constraints, combined with spatial continuity through latent cube features, results in an interpretable canonical space. The representation enables finding common semantic parts, face/head tracking, and stereo reconstruction, with robustness to pose variations and full head coverage including hair.

DenseMarks Teaser

Video Presentation

Tracking Results

Head tracker
Head tracker + DenseMarks
Input
Predicted tracking
Template texture
Overlay

Method Overview

DenseMarks learns per-pixel representations of human heads by mapping each pixel to a 3D canonical unit cube. The method uses a Vision Transformer backbone to predict dense embeddings, supervised by point correspondences from video tracking and enhanced with landmark and segmentation constraints.

The key innovation is the 3D canonical space representation, which enables robust correspondences across different individuals and poses, including challenging regions like hair and accessories that are typically difficult for traditional landmark-based approaches.

Point Matching

The requirement of the canonical space is that the same semantic points will have a fixed location in the cube, regardless of the person's identity. We test this on a number of points that have distinct semantics: points on hair, ear centers, forehead center, eyebrow corners.

left side of long hair
center of right ear
center of left ear
forehead center
left eyebrow corner
DINOv3 Matching

DINOv3

Sapiens Matching

Sapiens

HyperFeats Matching

HyperFeats

Fit3D Matching

Fit3D

DenseMarks Matching

DenseMarks

Dense Warping

To demonstrate the semantic consistency of embeddings predicted for the whole image, not only specific points or regions, we demonstrate the warping by embeddings. For each target image pixel, we replace its color with the color of the nearest neighbor by embedding in the source.

Source Image

Source

Target Image

Target

DINOv3 UV DINOv3 RGB

DINOv3

Sapiens UV Sapiens RGB

Sapiens

Fit3D UV Fit3D RGB

Fit3D

DenseMarks UV DenseMarks RGB

DenseMarks

Projection

Cross-check images to visualize their corresponding projections in the canonical cube space. Select multiple images to see how different samples map to the 3D canonical space.

Select Images

Rotatable Cube Projection

Status: Loading...
Three.js: Checking...
Mode: Same Person
Selected: None
Face Parsing: Off

Stereo Reconstruction

We demonstrate that triangulating 2+ images can be done purely using embeddings from our model, on the example of a sample with known camera poses and intrinsics. This way, we demonstrate the capabilities of multi-view stereo and dense estimation. With known cameras and DenseMarks embeddings, the whole triangulation takes just a few seconds on CPU.

Input images

Cameras placement

Camera Placement

Rotatable Point Cloud