Abstract
We propose DenseMarks — a new learned representation for human heads, enabling high-quality dense correspondences. A Vision Transformer network predicts a 3D embedding for each pixel, corresponding to a location in a 3D canonical unit cube. We train our network using pairwise point matches from diverse talking heads videos, guided by a contrastive loss that encourages matched points to have close embeddings.
Multi-task learning with face landmarks and segmentation constraints, combined with spatial continuity through latent cube features, results in an interpretable canonical space. The representation enables finding common semantic parts, face/head tracking, and stereo reconstruction, with robustness to pose variations and full head coverage including hair.
Video Presentation
Tracking Results
Method Overview
DenseMarks learns per-pixel representations of human heads by mapping each pixel to a 3D canonical unit cube. The method uses a Vision Transformer backbone to predict dense embeddings, supervised by point correspondences from video tracking and enhanced with landmark and segmentation constraints.
The key innovation is the 3D canonical space representation, which enables robust correspondences across different individuals and poses, including challenging regions like hair and accessories that are typically difficult for traditional landmark-based approaches.
Point Matching
The requirement of the canonical space is that the same semantic points will have a fixed location in the cube, regardless of the person's identity. We test this on a number of points that have distinct semantics: points on hair, ear centers, forehead center, eyebrow corners.
DINOv3
Sapiens
HyperFeats
Fit3D
DenseMarks
Dense Warping
To demonstrate the semantic consistency of embeddings predicted for the whole image, not only specific points or regions, we demonstrate the warping by embeddings. For each target image pixel, we replace its color with the color of the nearest neighbor by embedding in the source.
Source
Target
DINOv3
Sapiens
Fit3D
DenseMarks
Projection
Cross-check images to visualize their corresponding projections in the canonical cube space. Select multiple images to see how different samples map to the 3D canonical space.
Stereo Reconstruction
We demonstrate that triangulating 2+ images can be done purely using embeddings from our model, on the example of a sample with known camera poses and intrinsics. This way, we demonstrate the capabilities of multi-view stereo and dense estimation. With known cameras and DenseMarks embeddings, the whole triangulation takes just a few seconds on CPU.