0%

## Overview

1. 给定环境的context，估计每个候选点的可能性，从而选择概率高的候选点，下图分别用钻石和星星表示候选点和选中点
2. 根据目标，估计每个选定目标的轨迹（分布）
3. 对所有的轨迹进行排名的评分和选择

TNT在Argoverse Forecasting dataset 和INTERACTION dataset 取得了最好的效果。行人预测的任务中，在Stanford Drone dataset 和in-house Pedestrian-atIntersection dataset都取得了最好的效果。

## 建模

### 步骤1. target prediction

target的分布可以用下面离散化的形式表示： $p(\tau^n | \boldsymbol{x}) = \pi(\tau^n | \boldsymbol{x}) \cdot \mathcal{N}(\Delta x^n | v_x^n(\boldsymbol{x})) \cdot\mathcal{N}(\Delta y^n | v_y^n(\boldsymbol{x}))$

$$f(\cdot)$$$$v(\cdot)$$实现中采用2层的MLP，输入为$$(x^k, y^k)$$和环境context特征$$\boldsymbol{x}$$，用来预测target的概率和最可能的offset。训练的loss function定义如下： $\mathcal{L}_{s1} = \mathcal{L}_{cls}(\pi, u) + \mathcal{L}_{offset}(v_x, v_y, \Delta x^u, \Delta y^u)$ $$\mathcal{L}_{cls}$$采用cross entropy作为Loss，$$\mathcal{L}_{offset}$$采用hubder作为loss， u是最接近ground truth location的target，$$\Delta x^u$$$$\Delta y^u$$则是偏离ground truth的距离。

• 对于机动车：对lane的centerlines进行均匀采样作为候选的target（标记为黄色菱形）
• 对于行人，在agent的周围产生虚拟的grid，并用grid point作为target的候选

### 步骤2. Target-conditioned motion estimation

1. future time是条件独立的，这样使得计算更加高效
2. 给定target后轨迹们的分布是unimodal的

### Training and inference details

1. encode context
2. sample N个候选的target，并用$$\pi(\tau|\boldsymbol{x})$$选中top的M个。
3. 对M个target进行轨迹的估计$$p(\boldsymbol{s}_F | \tau, \boldsymbol{x})$$
4. 从M条轨迹中用$$\phi(\boldsymbol{s}_F |\tau, \boldsymbol{x})$$进行打分，并选中top K

## 实验

### 数据集

1. Argoverse forecasting dataset [9] provides trajectory histories, context agents and lane centerline for future trajectory prediction. There are 333K 5-second long sequences in the dataset. The trajectories are sampled at 10Hz, with (0, 2] seconds for observation and (2, 5] seconds for future prediction.
2. INTERACTION dataset [10] focuses on vehicle behavior prediction in highly interactive driving scenarios. It provides 4 different categories of interactive driving scenarios: roundabout (10479 vehicles), un-signalized intersection (14867 vehicles), signalized intersection (10933 vehicles), merging and lane changing (3775 vehicles).
3. In-house Pedestrian-at-Intersection dataset (PAID) is an in-house pedestrian dataset collected around crosswalks and intersections. There are around 77K unique pedestrians for training and 12k unique pedestrians for test. The trajectories are sampled at 10Hz, 1-sec history trajectory is used to predict 3-sec future. Map features include crosswalks, lane boundaries and stop/yield signs.
4. Stanford Drone dataset (SDD) [11] is a video dataset with top-down recordings of college campus scenes, collected by drones. The RGB video frames provide context similar to road maps in other datasets. We follow practice of other literature [2, 16, 37], focusing on pedestrian trajectories only: frames are sampled at 2.5 Hz, 2 seconds of history (5 frames) are used as model input, and 4.8 seconds (12 frames) are the future to be predicted.

### Comparison with state-of-the-art

Table7说明了TNT在机动车上的下偶偶最好

Table8和9说明了在PAID数据集和SDD数据集的行人预测上效果最好。

## 参考文献

1. Zhao H, Gao J, Lan T, et al. Tnt: Target-driven trajectory prediction[J]. arXiv preprint arXiv:2008.08294, 2020.