HumanSplat: Closing the Loop Between Human Mesh Recovery and Gaussian Splatting Avatars
University of Michigan
Robotics

*Indicates Equal Contribution

Abstract

Accurately recovering human pose and appearance from video is an essential component of scene reconstruction, with applications to motion capture, motion prediction, virtual reality, and digital twinning. Despite significant interest in building realistic human avatars from video, this paper demonstrates that existing methods do not accurately recover the 3D geometry of humans. ViT-based approaches are not consistently reliable and can overfit to 2D views, while NeRF- and Gaussian Splatting–based avatars treat pose and appearance separately, limiting rendering generalization to new poses. To resolve these shortcomings, this paper proposes HumanSplat, a joint optimization framework that refines 3D human poses while simultaneously learning a high-fidelity avatar for novel-view and novel-pose synthesis. Our key insight is to close the loop between geometric pose estimation and differentiable rendering. Unlike prior human avatar methods that rely on accurate human pose obtained through motion capture systems or offline refinement, which are impractical in in-the-wild scenarios, our approach uses only human mesh estimates from a state-of-the-art human pose estimator to better reflect real-world conditions. Therefore, instead of using the human pose only as a deformation prior, HumanSplat backpropagates photometric, segmentation, and depth losses through a differentiable renderer to the pose parameters and global position. This coupling refines the global 3D pose over time, improving accuracy and alignment while producing better renderings from novel views. Experiments show consistent improvements over pose recovery baselines that omit image-level refinement and avatar baselines that decouple pose estimation from avatar reconstruction.

High-level overview of HumanSplat
HumanSplat advances both SMPL estimation and novel-view rendering by enabling joint optimization of the human mesh and Gaussian splats, taking SMPL estimates from HMR 2.0 as input. In contrast, prior works, such as GART, rely on accurate SMPL from motion capture or refined SMPL to deform Gaussians for avatar reconstruction. These methods also deliberately decouple SMPL from Gaussian splats to improve rendering quality, but this prevents refinement of human pose estimation and ultimately results in sub-optimal novel-view performance. HumanSplat addresses this limitation and allows human pose and Gaussian splats to mutually refine each other. As a result, it achieves more accurate SMPL estimation and consistently higher rendering quality.

Method Overview

Method overview illustration
Method Overview. HumanSplat takes SMPL estimation as input and combines SMPL with Gaussians with the proposed CAMEL.This enables both SMPL and Gaussian representation optimization using rendered color and depth images.
CAMEL illustration
Cloth-Aware Mesh-Embedded Loss (CAMEL) illustration. CAMEL loosely couples the SMPL mesh with the Gaussian representation. The key motivation is to better model clothing by allowing local non-rigid deformations. CAMEL constrains Gaussians to remain close to the human mesh while enforcing surface alignment and ensuring full mesh coverage. n is the normal of the mesh vertex and δ represents the tolerance margin between the cloth and the body.

HumanSplat Deforms on Shohei

HumanSplat deforms on Shohei's motion.

HumanSplat Deforms on Brady

HumanSplat deforms on Brady's motion.

BibTeX

@article{zongkung2025humansplat,
  title={HumanSplat: Closing the Loop Between Human Mesh Recovery and Gaussian Splatting Avatars},
  author={Zong, Yeheng* and Kung, Pou-Chun* and Pan, Yike and Isaacson, Seth and Chen, Yizhou and Vasudevan, Ram and Skinner, Katherine A.},
  journal={In Submission 2025},
  year={2025},
  url={https://scottyehengz.github.io/HumanSplat/}
}