## Overview.

Our multi-modal STCrowd dataset supports detection, tracking, and prediction tasks currently. We give evaluation metrics and provide benchmarks.

We use Average Precision (AP) metric with the 3D center distance threshold. Instead of IoU, since pedestrian are objects with small footprints, and IoU may not be suitable for measuring. AP is the normalized area under the precision recall curve. For crowded scenes, the distance thresholds are chosen from $D = \{0.25,0.5,1\}$ meters and the mean Average Precision (mAP) is calculated by: $mAP = \frac{1}{|D|} \sum_{d \in D} AP_d$.
In addition to AP, for crowded scenes, the performance on occluded instances are also considered, and we calculate the average recall with different center distance thresholds $D = \{0.25,0.5,1\}$ for different levels of occlusion i: $AR_i = \frac{1}{|D|} \sum_{d \in D} Recall_{i,d}, i \in \{0,1,2\}$.