Towards Viewpoint Invariant 3D Human Pose Estimation

Albert Haque
Stanford University

Emma Peng*
Stanford University

Alan Luo*
Stanford University

Alexandre Alahi
Stanford University

Serena Yeung
Stanford University

Fei-Fei Li
Stanford University


Abstract

We propose a viewpoint invariant model for 3D human pose estimation from a single depth image. To achieve viewpoint invariance, our deep discriminative model embeds local regions into a learned viewpoint invariant feature space. Formulated as a multi-task learning problem, our model is able to selectively predict partial poses in the presence of noise and occlusion. Our approach leverages a convolutional and recurrent network with a top-down error feedback mechanism to self-correct previous pose estimates in an end-to-end manner. We evaluate our model on a previously published depth dataset and a newly collected human pose dataset containing 100K annotated depth images from extreme viewpoints. Experiments show that our model achieves competitive performance on frontal views while achieving state-of-the-art performance on alternate viewpoints.

Paper

Full Paper: arXiv:1603.07076


ITOP Dataset

Side View

Side View (labeled)

Top View

Top View (labeled)

Note: Make sure you have enough disk space for both the compressed and uncompressed versions. The file sizes are below.
View Split Frames People Actions Images Depth Maps Point Clouds Labels
Side Train 39,795 16 15 jpg (964 MB, 1.1 GB) HDF5 (884 MB, 5.7 GB) HDF5 (7.4 GB, 18 GB) HDF5 (17 MB, 2.9 GB)
Side Test 10,501 4 15 jpg (247 MB, 276 MB) HDF5 (234 MB, 1.6 GB) HDF5 (2.0 GB, 4.6 GB) HDF5 (3.6 MB, 771 MB)
Top Train 39,795 16 15 jpg (882 MB, 974 MB) HDF5 (876 MB, 5.7 GB) HDF5 (7.1 GB, 18 GB) HDF5 (31 MB, 2.9 GB)
Top Test 10,501 4 15 jpg (236 MB, 261 MB) HDF5 (235 MB, 1.6 GB) HDF5 (1.9 GB, 4.6 GB) HDF5 (8.9 MB, 771 MB)

Data Schema

Each file contains several HDF5 datasets at the root level. Dimensions, attributes, and data types are listed below. The key refers to the (HDF5) dataset name. Let \(n\) denote the number of images.

Transformation
To convert from point clouds to a \(240\times 320\) image, the following transformations were used. Let \(x_{img}\) and \(y_{img}\) denote the \((x,y)\) coordinate in the image plane. Using the raw point cloud \((x,y,z)\) real world coordinates, we compute the depth map as follows: \( x_{img} = \frac{x}{Cz} + 160 \) and \(y_{img} = -\frac{y}{Cz} + 120 \) where \( C = 3.50\times 10^{-3} = 0.0035 \) is the intrinsic camera calibration parameter. This results in the depth map: \( (x_{img}, y_{img}, z)\).
Joint ID (index) Mapping
joint_id_to_name = {
  0: 'Head',        8: 'Torso',
  1: 'Neck',        9: 'R Hip',
  2: 'R Shoulder',  10: 'L Hip',
  3: 'L Shoulder',  11: 'R Knee',
  4: 'R Elbow',     12: 'L Knee',
  5: 'L Elbow',     13: 'R Foot',
  6: 'R Hand',      14: 'L Foot',
  7: 'L Hand',
}

Depth Maps

Key Dimensions Data Type Description
id \( (n,) \) uint8 Frame identifier in the form XX_YYYYY where XX is the person's ID number and YYYYY is the frame number.
data \( (n,240,320) \) float16 Depth map (i.e. mesh) corresponding to a single frame. Depth values are in real world meters (m).

Point Clouds

Key Dimensions Data Type Description
id \( (n,) \) uint8 Frame identifier in the form XX_YYYYY where XX is the person's ID number and YYYYY is the frame number.
data \( (n,76800,3) \) float16 Point cloud containing 76,800 points (240x320). Each point is represented by a 3D tuple measured in real world meters (m).

Labels

Key Dimensions Data Type Description
id \( (n,) \) uint8 Frame identifier in the form XX_YYYYY where XX is the person's ID number and YYYYY is the frame number.
is_valid \( (n,) \) uint8 Flag corresponding to the result of the human labeling effort. This is a boolean value (represented by an integer) where a one (1) denotes clean, human-approved data. A zero (0) denotes noisy human body part labels. If is_valid is equal to zero, you should not use any of the provided human joint locations for the particular frame.
visible_joints \( (n,15) \) int16 Binary mask indicating if each human joint is visible or occluded. This is denoted by \( \alpha \) in the paper. If \( \alpha_j=1 \) then the \( j^{th} \) joint is visible (i.e. not occluded). Otherwise, if \( \alpha=0 \) then the \( j^{th} \) joint is occluded.
image_coordinates \( (n,15,2) \) int16 Two-dimensional \( (x,y) \) points corresponding to the location of each joint in the depth image or depth map.
real_world_coordinates \( (n,15,3) \) float16 Three-dimensional \( (x,y,z) \) points corresponding to the location of each joint in real world meters (m).
segmentation \( (n,240,320) \) int8 Pixel-wise assignment of body part labels. The background class (i.e. no body part) is denoted by \( -1 \).

Citation

If you would like to cite our work, please use the following.

Haque A, Peng B, Luo Z, Alahi A, Yeung S, Fei-Fei L. (2016). Towards Viewpoint Invariant 3D Human Pose Estimation. European Conference on Computer Vision (ECCV). Amsterdam, Netherlands. Springer.

@inproceedings{haque2016viewpoint,
    title={Towards Viewpoint Invariant 3D Human Pose Estimation},
    author={Haque, Albert and Peng, Boya and Luo, Zelun and Alahi, Alexandre and Yeung, Serena and Fei-Fei, Li},
    booktitle = {European Conference on Computer Vision (ECCV)},
    month = {October},
    year = {2016}
}