Selecting Your Real-Time Pose Estimation Models

Triggered by the advent of social media, human pose estimation has gained traction in various applications such as gaming, activity recognition, gaming and augmented reality. In recent years, there has been a multitude of new models released, leveraging on deep learning networks as backbone (ResNet, MobileNet, VGG etc) and trained on datasets like COCO, MPII and Body25.

It is confusing for many starting out on real-time pose estimation to select an approach. This article outlines considerations when selecting the model to utilize.

1. Define Purpose

The first over-riding factor to ponder upon is purpose of project you are embarking on. Video is used when we need to to track objects, segment object and recognise actions or events.

2. Define Requirements

A clear focus on purpose will define the type of datasets, dataset acquisition problems, post processing issues and other limitations in the environment solution will be deployed in.

Some questions to brainstorm include

a. What is the use case or domain you are using your model in?

b. Single or multiple persons detection? If multiple, how dense?

c. Are you expecting continuous flow of video? Or discrete?

d. Do you need to process it immediately or post processing?

b. What are some of the challenges you see in data acquisition and processing? Errors?

e. What are your expectations for performance?

f. What are your system requirements?

g. Any other considerations?

This will determine the selection of models and approaches you undertake for the pose estimation project. Broadly classified into 3 categories, these are interlinked and dependent on the answers to questions above.

3. Select Models

i. Accuracy and Inference Speed

Accuracy and speed is often discussed together as a trade-off criteria. Accuracy is measured in terms of MOTA and/or mAP while speed is measured in Frames per Second.

MOTA : Multi-Object Tracking Accuracy, detecting the presence of multiple objects in video and associating these detections over time according to object identities

mAP : Mean Average Precision, calculated by taking the mean AP over all classes and/or overall Intersection over Union (IoU) thresholds

TOP DOWN MODEL FASTPOSE. Top : Inference speed and MOTA performance. Bottom: Inference speed and mAP performance on PoseTrack.
BOTTOM UP MODEL OPENPOSE

When deciding between the models, consider performance and speed jointly. For popular models, they are usually released with a lighter version which enables higher inference speed. Inference speed can also be considered on web vs mobile — CPU, GPU vs NPU (Neural Processing Unit). While many deep learning models have focused on accuracy, the rising use cases on mobile technology is forcing a re-evaluation in what defines a “good” model. With more advanced and recent models, some managed to achieve higher performance with faster speed.

Errors like occlusion is also common and some models are designed for higher accuracy under certain circumstances. FastPose, for example, has Re-ID built in for tracking purpose.

ii. Platform Deployment and Scalability

Platform deployment and scalability is also critical, not only because of inference speed. Several models enable easy deployment across multiple platforms with different coding languages.

The ability to scale is also a consideration. MediaPipe, for example, offers different modules on various body parts that may be used cross-platforms. This enables scalability across platforms and use case.

Google has been aggressively launching new real-time pose estimation model including MediaPipe and MoveNet (May 2021) recently.

MediaPipe Deployment Possibilities

iii. Domain Specific

Depending on your use case, some models are trained on labelled datasets that can cater more specifically to a domain, eg fitness. This almost ensures a higher performance when using the model for that domain.

After working through your priorities, model selection and experimentation can begin. There are two main established approaches to 2D pose estimation.

Top-down approaches leverages on existing techniques for single-person pose estimation but is limited by early detection issues. It is challenging to correct if a person detector fails and runtime is directly proportional to the number of people.

Bottom-up offers more robustness for early commitment with the ability to design for optimal multi-person detection. However, previous bottom-up methods requires costly global inference. Newer approaches jointly labeled part detection candidates and associated them to individual people but generally harder to process with a limit on the number of part proposals.

The following provides a quick review of the various popular models, their architectures and specific strategies for pose estimation errors.

1. Bottom Up Models

i. OpenPose (multi-person)

Pros :

  1. High-quality results, at fraction of computational cost.
  2. Part Affinity Fields preserves both location and orientation across limbs. Better able to distinguish common cross-over cases, e.g., overlapping arms
  3. Non-maximum suppression on detection confidence maps for discrete set of part candidate locations

Cons :

  1. Lower accuracy than the top-down methods on people of smaller scales

OpenPose is the first real-time multi-person system to jointly detect human body, hand, facial, and foot keypoints (up to 135 keypoints) on single images. It leverages on a bottom-up nonparametric representation of association Part Affnity Fields (PAFs), to “connect” and find body joints on an image, associating them with individual people.

The original OpenPose was developed based on VGG pre-trained network using Caffe framework.

Cao et al [1] focused on refining PAF, removing body part confidence map refinement while increasing the network depth, resulting in a faster and more accurate model.

OpenPose: Architecture of multi-stage CNN

1. A feed-forward network predicts set of 2D confidence maps (S) of body part locations and 2D vector fields of part affinities concurrently.

2. The predictions, along with the image features, are then concatenated in the next stage.

3. In the last step, Non-Maximum Suppression NMS is applied on confidence maps for a discrete set of part locations and output the 2D keypoints for all people in the image.

ii. PersonLab (single, multi-person)

PersonLab is a box-free bottom-up approach for multi-person pose estimation and instance segmentation based on an efficient single-shot model. George et al [2] used both semantic-level reasoning and object-part associations to detect individual keypoints and predict their relative displacements, allowing grouping of keypoints into person pose instances. A part-induced geometric embedding descriptor further associates semantic person pixels with their corresponding person instance, delivering instance-level person segmentations. Trained on COCO data, efficient inference is achieved with runtime essentially independent of the number of people.

PersonLab: Person Pose Estimation and Instance Segmentation

1. Based on Pose Estimation module, predicts keypoint heatmaps, short-range offsets and mid-range pairwise offsets

2. Instance Segmentation module then predicts person segmentation maps and long-range offsets.

Google has also continuously released new models such as PoseNet and MoveNet with APIs accessible via Tensorflow.

iii. MoveNet (multi-person)

Pros :

  1. Suppress high-frequency noise (jitter) and outliers from model while optimising throughput for quick motions.
  2. Four prediction heads are attached to the feature extractor
  3. 2 models that run faster than real time (30+ FPS) on cross platforms.
  4. Trained on fitness, dance, and yoga poses in collaboration with IncludeHealth. Great for fitness use cases.

Released on 15 May 2021, MoveNet leverages on heatmaps to accurately localise human keypoints using a feature extractor and a set of prediction heads. The prediction generally follows CenterNet [3], improving both speed and accuracy with a few key changes.

All models are trained using the TensorFlow Object Detection API. MobileNetV2 is used with an attached feature pyramid network (FPN), allowing for a high resolution (output stride 4) and a semantically rich feature map output. A Temporal Filter is tuned to simultaneously suppress high-frequency noise (jitter) and outliers from model while optimising throughput for quick motions.

MoveNet: Comparison of a traditional detector(top) vs MoveNet(bottom) on dicult poses
MoveNet: Architecture Based on 4 Predictor Heads

Four prediction heads are attached to the feature extractor:

1. Person center heatmap to predict geometric center of person instances

2. Keypoint regression field to predict full set of keypoints for a person to group keypoints into instances.

3. Person keypoint heatmap to predict location of all keypoints irrespective of person instances

4. 2D per-keypoint offset field to predict local offsets from each output feature map pixel to the precise sub-pixel location of each keypoint.

iv. PoseNet (multi-person)

PoseNet is the modified version of truncated GoogLeNet. Designed as a lightweight for mobile cross-platform and web in Tensorflow.js, it uses ResNet and MobileNet (faster) as the backbone.

2. Top Down Models

i. FastPose (single, multi-persons)

Pros :

  1. Scale invariant
  2. Occlusion-aware Re-ID strategy is designed for articulated multi-person pose tracking in video

FastPose uses pose tracking framework for pose estimation and tracking towards real-time speed. Leveraging on concurrently detect, pose estimate and tracking, Zhang et al[4] created a model that is scale invariant and enables “occlusion-aware”.

astPose : end-to-end multi-task network (MTN)

1. Build a multi-task network (MTN) that optimises human detection, pose estimation and person Re-ID simultaneously.

2. Next, the three groups of outputs are used to perform pose tracking. A scale-normalized paradigm (SIFP) is proposed to alleviate the scale variation problem for the multi-task network.

3. Finally, an occlusion-aware Re-ID strategy is designed for articulated multi-person pose tracking in video. For a better utilisation of Re-ID features, the pose information is used to infer the occlusion state.

ii. AlphaPose (single, multi-person)

AlphaPose is a regional multi-person pose estimation (RMPE) to handle errors for inaccurate and redundant human bounding boxes. It is based on three key components, Symmetric Spatial Transformer Network (SSTN), Parametric Pose Non Maximum Suppression (NMS) and Pose-Guided Proposals Generator (PGPG). With AlphaPose, Hao et al [5] achieved 76.7 mAP on the MPII (multi person) dataset.

Symmetric STN architecture and training strategy with parallel SPPE

1. Generate bounding boxes from human detector and feed into the “Symmetric STN + SPPE” module for the pose proposals.

2. The Parallel SPPE acts as an extra regularizer during the training phase.

3. Finally, parametric Pose NMS (p-Pose NMS) is used to eliminate redundant pose estimations.

iii. MediaPipe (single-person)

Pros :

  1. Heavy occlusions is managed with substantial occlusion-simulating augmentation.
  2. 25K images with a single person in the scene performing fitness exercises annotated by humans were used.
  3. At 30 frames inference speed with 33 points with multiple versions offering up to 543 keypoints for pose, face and hands

MediaPipe(BlazePose) extended the idea from Stack Hourglass[6] and used an encoder-decoder network architecture to predict heatmaps for all joints, followed by another encoder that regresses directly to the coordinates of all joints.

This enables heatmap branch to be discarded during inference so that it is lightweight enough to run on a mobile phone. At 30 frames inference speed with 33 points with multiple versions offering up to 543 keypoints for pose, face and hands, it is an ideal choice to consider for cross platforms deployment especially in fitness applications.

MediaPipe : Inference Pipeline
MediaPipe : 33 keypoints topology
MediaPipe : Network architecture

1. A detector-tracker pipeline consisting of a lightweight body pose detector followed by a pose tracker network is used for inference. The tracker predicts keypoints, presence of person and the refined region of interest for the current frame. When tracker indicates that there is no human present, the detector network goes to next frame.

2. The majority of models depends on Non-Maximum Suppression (NMS) algorithm for the last post-processing step. However, highly articulated poses causes inaccuracies for NMS algorithm. MediaPipe detects bounding box of a relatively rigid body part like the human face or torso (high-contrast features and fewer variation) instead.

An assumption is made that head should always be visible for single-person to enable a lightweight detector. The face detector then predicts additional person-specific alignment parameters — the middle point between person’s hips, size of circle circumscribing the whole person, and incline (angle between the lines connecting two mid-shoulder and mid-hip points).

MediaPipe : Vitruvian man aligned via detector vs face detection bounding box

3. A combined heatmap, offset and regression is used for training while heatmap and offset loss only is removed during before inference. The heatmap is used to supervise the lightweight embedding and feed into the regression encoder network. Partially inspired by Stacked Hourglass, an additional tiny encoder-decoder heatmap-based network and a subsequent regression encoder network is also stacked.

MediaPipe requires an initial pose alignment and dataset to where either the whole person is visible, or where hips and shoulders keypoints can be confidently annotated.

MediaPipe : Results on yoga and fitness poses.

Heavy occlusions is managed with substantial occlusion-simulating augmentation. Training dataset of 60K images with single or few people in scene in common poses and 25K images with a single person in the scene performing fitness exercises annotated by humans were used.

Skip-connections between all the stages of the network is embedded to achieve a balance between high and low-level features. Gradients from regression encoder are not propagated back to the heatmap-trained features to improve heatmap predictions and substantially increase the coordinate regression accuracy.

References :

[1] Zhe Cao Gines Hidalgo Tomas Simon Shih-En Wei Yaser Sheikh. \OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Anity Fields”. In: arXiv:1812.08008v2 (2019).

[2] George Papandreou Tyler Zhu Liang-Chieh Chen Spyros Gidaris Jonathan Tompson Kevin Murphy. \PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model”. In: arXiv:1803.08225v1 (2018).

[3] Xingyi Zhou Dequan Wang Philipp Krahenb. \Objects as Points”. In: arXiv:1904.07850v2 [cs.CV] (2019).

[4] Jiabin Zhang Zheng Zhu Wei Zou Peng Li Yanwei Li Hu Su Guan Huang. \FastPose: Towards Real-time Pose Estimation and Tracking via Scale-normalized Multi-task Networks”. In: arXiv:1908.05593v1 (2019).

[5] Hao-Shu Fang et al. \RMPE: Regional Multi-person Pose Estimation”. In: ICCV (2017).

[6] Kaiyu Yang Alejandro Newell and Jia Deng. \Stacked Hourglass Networks for Human Pose Estimation”. In: arXiv:1603.06937v2 (2016).