PointCNN论文笔记

作者：陆语

ABSTRACT

The key to the success of CNNs is the convolution operator that is capable of leveraging spatially-local correlation in data represented densely in grids (e.g. images).

卷积操作：可以利用网格中密集表示的数据的空间-局部相关性

point cloud are irregular and unordered, thus a direct convolving of kernels against the features associated with the points will result in deserting the shape information while being variant to the orders.

点云数据无序且不规范，导致直接卷积会造成形状信息的丢失，同时使其对于顺序可变。

To address these problems, we propose to learn a X-transformation from the input points, and then use it to simultaneously weight the input features associated with the points and permute them into latent potentially canonical order, before the element-wise product and sum operations are applied.

文章提出在进行乘和加操作前，从输入学习X变换，并同时对与点相关的特征进行加权，而后将特征排列成潜在的规范顺序。

INTRODUCTION

Spatially-local correlation is an ubiquitous property of various types of data that is independent of the data representation.

The convolution operator has shown to be quite effective in exploiting such correlation.

空间-局部相关性是许多数据都有的属性，与数据的表示方式无关。
卷积在常规域（如图像）对于发现数据的空间局部相关性效果显著。

point cloud is irregular and unordered, rendering convolution operator ill-suited for leveraging spatially-local correlation in the data

点云的无序性和不规性，不利于卷积运算符发现局部相关性。

We illustrate the problems and challenges of applying convolution on point cloud with Figure 1. Suppose the unordered set of the C dimensional input features are the same F = { fa, fb , fc , fd } in all the cases in Figure 1, and we have one convolution kernel K = [kα , kβ , kγ , kδ ] T in shape 4 × C. In (i), by following canonical order given by the regular grid structure, the features in the local 2 × 2 patch can be casted into [fa, fb , fc , fd ] T of shape 4 × C, for convolving with K, yielding fi = Conv(K, [fa, fb , fc , fd ] T ), where Conv(·, ·) is simply an element-wise product followed by a sum1 . In (ii), (iii), and (iv), the points are sampled from local neighborhoods, thus can be in arbitrary orders. By following orders as illustrated in the figure, the input feature set F can be casted into [fa, fb , fc , fd ] T in (ii) and (iii), and [fc , fa, fb , fd ] T in (iv). Based on this, if the convolution operator is directly applied, the output features for the three cases could be computed as:

fii = Conv(K, [fa, fb , fc , fd ] T ),

fiii = Conv(K, [fa, fb , fc , fd ] T ),

fiv = Conv(K, [fc , fa, fb , fd ] T ). (1)

Note that fi i ≡ fi i i holds for all cases, while fi i i , fiv holds for most cases. Now, it is clear that a direct convolving results in deserting the shape information (i.e., fi i ≡ fi i i ) while being variant to the orders (i.e., fi i i , fiv ).

我们用图1说明了在点云上应用卷积的问题和挑战。假设图1中所有情况下无序的C维输入特征集都是相同的F = {fa，fb，fc，fd}，我们具有一个卷积核K = [kα，kβ，kγ，kδ] T形状4×C。在（i）中，按照规则网格结构给出的规范顺序，可以铸造局部2×2补丁中的特征进入[fa，fb，fc，fd]形状为4×C的T，用于与K卷积，得到fi = Conv（K，[fa，fb，fc，fd] T），其中Conv（·，·）就是简单的元素方面的产品，后跟sum1。在（ii），（iii）和（iv）中，点是从当地社区采样的，因此可以是任意顺序。通过遵循如图所示的顺序，输入特征集F可以被铸造成（ii）和（iii）中的[fa，fb，fc，fd] T，以及[fc，fa，fb，fd] T in（ ⅳ）。基于此，如果直接应用卷积运算符，则三种情况的输出特征可以计算为：

fii = Conv（K，[fa，fb，fc，fd] T），

fiii = Conv（K，[fa，fb，fc，fd] T），

fiv = Conv（K，[fc，fa，fb，fd] T）。（1）

请注意，对于所有情况，我都适用，而对于大多数情况，fi i i，fiv都适用。现在，很明显，直接卷积导致形状信息（即，fii = fiii）抛弃，同时变为订单（即，fiii，fiv）。

In this paper, we propose to learn a K ×K X-transformation from the coordinates of K input points(p1,p2, …,pK ) with multilayer perceptron , i.e., X = MLP(p1,p2, …,pK ). then use it to simultaneously weight and permute the input features, and finally apply the typical convolution on the transformed features. We call the process X-Conv, and it is the basic building block for our PointCNN. The X-Conv for (ii), (iii), and (iv) in Figure 1 can be depicted as:

fi i = Conv(K, Xi i × [fa, fb , fc , fd ] T ),

fi i i = Conv(K, Xi i i × [fa, fb , fc , fd ] T ),

fiv = Conv(K, Xiv × [fc , fa, fb , fd ] T ),

where the Xs are 4×4 matrices, as K = 4 in Figure 1. Note that since Xi i and Xi i i are learnt from points in different shapes, they can be different to weight the input features accordingly, thus achieve fi i , fi i i . For Xi i i and Xiv , if they are learnt to satisfy Xi i i = Xiv × Π, where Π is the permutation matrix for permuting (c, a,b,d) into (a,b,c,d), then fi i i ≡ fiv can be achieved.

在本文中，我们建议用多层感知机的从K个输入点（p1，p2，…，pK）的坐标学习K×K X变换，即X = MLP（p1，p2，… ，pK）。然后使用它同时对输入要素进行加权和置换，最后对转换后的要素应用典型的卷积。我们将流程称为X-Conv，它是我们的PointCNN的基本构建块。图1中的（ii），（iii）和（iv）的X-Conv可以描述为：

fi i = Conv（K，Xi i×[fa，fb，fc，fd] T），

fi i i = Conv（K，Xi i i×[fa，fb，fc，fd] T），

fiv = Conv（K，Xiv×[fc，fa，fb，fd] T），

其中X是4×4矩阵，如图1中的K = 4.注意，由于Xi i和Xi ii是从不同形状的点学习的，因此它们可以相应地对输入特征进行加权，从而实现fi i，fi ii。对于Xi ii和Xiv，如果他们被学会满足Xi ii = Xiv×Π，其中Π是用于置换（c，a，b，d）到（a，b，c，d）的置换矩阵，则fi ii ≡fiv可以实现。

Figure 1: Convolution input from regular grids (i) and point cloud (ii, iii, and iv). In regular grids, each grid cell is associated with a feature. In point cloud, the points are sampled from local neighborhoods, in analogy to local patches in regular grids, and each point is associated with a feature, an order index, as well as its coordinates. However, the lack of regular grids poses the challenge of sorting the points into canonical orders.

图1：来自常规网格（i）和点云（ii，iii和iv）的卷积输入。在常规网格中，每个网格单元与一个特征相关联。在点云中，点从局部邻域中采样，类似于常规网格中的局部块，并且每个点与特征，顺序索引以及其坐标相关联。然而，缺乏常规网格带来了将点分类为规范顺序的挑战。

From the analysis of the example in Figure 1, it is clear that, with ideal X-transformations, X-Conv is capable of taking the point shapes into consideration, while being independent of point orders. In practice, we found that the learnt X-transformations are far from ideal, especially in terms of the permutation equivalence aspect. Nevertheless, PointCNN built with X-Conv is still significantly better than a direct application of typical convolution on point cloud, and on par or better than state-of-the-art non-convolutional neural networks designed for consuming point cloud data, such as PointNet++ [Qi et al. 2017b].

通过对图1中的示例的分析，很明显，通过理想的X变换，X-Conv能够考虑点形状，同时独立于点顺序。在实践中，我们发现学习的X变换远非理想，特别是在置换等价方面。尽管如此，使用X-Conv构建的PointCNN仍然明显优于在点云上直接应用典型卷积，并且比用于处理点云数据的最先进的非卷积神经网络更好或更好，例如PointNet++

We explain the detail of X-Conv, as well as PointCNN architectures in Section 3. We show our results on multiple challenging benchmark datasets and tasks in Section 4, together with ablation experiments and visualizations for better understanding of PointCNN.

我们在第3节中解释了X-Conv以及PointCNN架构的细节。我们在第4节中展示了多个具有挑战性的基准数据集和任务的结果，以及更好地理解PointCNN的对照实验和可视化。

PointCNN

The hierarchical application of convolution operator is essential to CNNs for learning hierarchical representation. PointCNN shares the same design, and generalizes it to point cloud. In this section, we firstly introduce hierarchical convolution in PointCNN, in analogy to that image CNNs, then explain the core X-Conv operator in detail, and finally present PointCNN architectures for classification and segmentation tasks.

卷积算子的分层应用对于CNN学习分层表示至关重要。 PointCNN拥有相同的设计，并将其概括为点云。在本节中，我们首先在PointCNN中引入分层卷积，类似于图像CNN，然后详细解释核心X-Conv算子，最后提出用于分类和分割任务的PointCNN架构。

Hierarchical Convolution

Before we introduce the hierarchical convolution in PointCNN, we briefly go through that in image CNNs, with the illustration of Figure 2 upper. The input to image CNNs is a feature map F1 in shape R1 × R1 × C1, where R1 is the spatial resolution, and C1 is the feature channel depth. The convolution of kernels K in shape K × K × C1 × C2 against local patches in shape K × K × C1 from F1 yields another feature map F2 in shape R2 × R2 × C2. Note that in Figure 2 upper, R1 = 4, K = 2, and R2 = 3. Compared with F1, F2 often is of lower resolution (R2 < R1) and deeper channels (C2 > C1), and encodes higher level information. This process is recursively applied, producing feature maps in less and less spatial resolution (4×4 → 3×3 → 2×2 in Figure 2 upper), but deeper and deeper channels (visualized by thicker and thicker dots in Figure 2 upper).

在我们在PointCNN中引入分层卷积之前，我们简要介绍图像CNN中的图像，图2的图示为上图。图像CNN的输入是形状R1×R1×C1的特征图F1，其中R1是空间分辨率，C1是特征通道深度。形状K×K×C1×C2的核K对来自F1的形状K×K×C1的局部斑块的卷积产生形状R2×R2×C2的另一特征映射F2。注意，在图2的上部，R1 = 4，K = 2，并且R2 = 3.与F1相比，F2通常具有较低的分辨率（R2 <R1）和较深的通道（C2> C1），并且编码较高级别的信息。递归地应用该过程，产生越来越少空间分辨率的特征图（图2中上部为4×4→3×3→2×2），但是更深和更深的通道（图2中较厚和较粗的点可视化）。

The input to PointCNN is F1 = {(p1,i , f1,i) : i = 1, 2, …, N1}, i.e., a set of points {p1,i : p1,i ∈ R D }, each associated with a feature { f1,i : f1,i ∈ R C1 }. Following the hierarchical construction of image CNNs, we would like to apply X-Conv on F1 and get a higher level representation F2 = {(p2,i , f2,i) : f2,i ∈ R C2 ,i = 1, 2, …, N2}, where {p2,i } is a set representative points of {p1,i }, i.e., N2 < N1, and C2 > C1, so F2 is of less resolution and deeper feature channels than F1. When the X-Conv process of turning F1 into F2 is recursively applied, the input points with features are “projected”, or “aggregated”, into less and less points (9 → 5 → 2 in Figure 2 lower), but each with richer and richer features (visualized by thicker and thicker dots in Figure 2 lower).

PointCNN的输入是F1 = {（p1，i，f1，i）：i = 1,2，…，N1}，即一组点{p1，i：p1，i∈RD}，每个与特征{f1，i：f1，i∈RC1}相关联。在图像CNN的分层构造之后，我们想在F1上应用X-Conv并获得更高级别的表示F2 = {（p2，i，f2，i）：f2，i∈RC2，i = 1,2， …，N2}，其中{p2，i}是{p1，i}的集合代表点，即N2 <N1，C2> C1，因此F2的分辨率更小和特征深度更深。当递归应用将F1转换为F2的X-Conv过程时，具有特征的输入点被“投影”或“聚合”到越来越少的点（图2中的9→5→2），但每个都有更丰富，更丰富的特征（图2中较粗和较粗的点可视化）。

Note that {p2,i } is not necessarily a subset of {p1,i }. The representative points can be at arbitrary locations in the space whichever are beneficial for the information “projection” or “aggregation”. In our implementation, {p2,i } is simply a random down-sampling of {p1,i } for classification tasks, and farthest point sampling for segmentation tasks, as segmentation tasks are more demanding on a uniform point distribution. We suspect some special points which have shown promising performance in geometric processing, such as Deep Points [Wu et al. 2015a], could fit in here as well. However, we leave the exploration of better representative points generation methods as future work.

请注意，{p2，i}不一定是{p1，i}的子集。代表点可以位于空间中的任意位置，无论哪个对信息“投影”或“聚合”有益。在我们的实现中，{p2，i}只是{p1，i}对分类任务的随机下采样，以及分段任务的最远点采样，因为分段任务对均匀点分布要求更高。我们怀疑一些特殊点在几何处理中表现出很好的表现，例如Deep Points [Wu et al。 2015a]，也适合这里。但是，我们将更好的代表点生成方法的探索留作未来的工作。

X-conv Operator

X-Conv is the core operator for turning F1 into F2. To leverage spatially-local correlation, similar to convolution in image CNNs, X-Conv works with local regions. Since the output features are supposed to be associated with the representative points {p2,i }, X-Conv takes their neighborhood points in {p1,i }, as well as the associated features, as input to convolve with.

X-Conv是将F1变为F2的核心操作符。为了利用空间局部相关性，类似于图像CNN中的卷积，X-Conv与局部区域一起工作。由于输出特征应该与代表点{p2，i}相关联，因此X-Conv将其在{p1，i}中的邻域点以及相关联的特征作为输入进行卷积。

For simplicity, we denote a representative point in {p2,i } as p, and its K neighbors in {p1,i } as N, thus the X-Conv input for this specific p is S = {(pi , fi) : pi ∈ N}. Note that S is an unordered set. Without loss of generality, S can be casted into a K × D matrix P = (p1,p2, …,pK ) T , and a K ×C1 matrix F = (f1, f2, …, fK ) T . The trainable parameters of X-Conv is a K × (C1 + Cδ ) × C2 tensor K. With these inputs, we would like to compute feature Fp , which is the input features “projected”, or “aggregated” into the representative point p. We depict the X-Conv operator in Algorithm 1, or maybe more concisely, it can be summarized as:

Fp = X−Conv(K,p, P, F) = Conv(K, MLP(P − p) × [MLPδ (P − p), F]), (3)

where MLPδ (·) is a multilayer perceptron applied individually on each point, same to that in PointNet. Note that all the operations involved in building X-Conv, i.e., Conv(·, ·), MLP(·), matrix multiplication (·)×(·), and MLPδ (·), are differentiable. In this case, clearly, X-Conv is differentiable, thus can be plugged into neural network for training by back propagation.

为简单起见，我们将{p2，i}中的代表点表示为p，将其在{p1，i}中的K个邻域表示为N，因此该特定p的X-Conv输入为S = {（pi，fi）： pi∈N}。请注意，S是无序集。在不失一般性的情况下，S可以被铸造成K×D矩阵P =（p1，p2，…，pK）T，并且K×C1矩阵F =（f1，f2，…，fK）T 。 X-Conv的可训练参数是K×（C1 +Cδ）×C2张量K.通过这些输入，我们想要计算特征Fp，即输入特征“投射”或“聚合”到代表点p。我们在算法1中描述了X-Conv算子，或者更简洁，它可以概括为：

Fp = X-Conv（K，p，P，F）= Conv（K，MLP（P-p）×[MLPδ（P-p），F]），（3）

其中MLPδ（·）是在每个点上单独应用的多层感知器，与PointNet中的相同。注意，构建X-Conv所涉及的所有操作，即Conv（·，·），MLP（·），矩阵乘法（·）×（·）和MLPδ（·）是可微分的。在这种情况下，显然，X-Conv是可微分的，因此可以通过反向传播插入神经网络进行训练。

In our implementation, K nearest neighbor search is applied for extracting the K neighboring points. This assumes a more or less uniform distribution of input points. For point cloud with nonuniform point distribution, a radius search can be applied first, and then randomly sample K points out of the radius search results.

Note that trainable kernel K of X-Conv is a K × (C1 + Cδ ) × C2 tensor. The trainable parameter number is proportional to the number of neighboring points K, instead of being quadratic in image CNNs, or cubic in 3D CNNs. In this sense, we consider our PointCNN sparse in both the input representation and kernels, and it saves both memory and computation. The sparse kernels enables the coupling of long range information without dramatic growth of trainable parameter numbers.

在我们的实现中，应用K最近邻搜索来提取K个邻近点。这假定输入点的或多或少均匀分布。对于具有非均匀点分布的点云，可以首先应用半径搜索，然后从半径搜索结果中随机采样K点。

注意，X-Conv的可训练核K是K×（C1 +Cδ）×C2张量。可训练参数编号与相邻点K的数量成比例，而不是在图像CNN中是二次的，或在3D CNN中是立方的。从这个意义上讲，我们在输入表示和内核中都考虑了PointCNN稀疏，它既节省了内存又节省了计算。稀疏内核使得能够耦合长程信息，而不会显着增加可训练的参数数量。

Since Line 4-6 of Algorithm 1 have been covered in the Introduction, here we explain the rationale behind Line 1-3 of Algorithm 1 in detail. Since X-Conv is designed to work on local point regions, the output should not be dependent on the absolute position of p and its neighboring points, but on their relative positions, thus we build local coordinate systems at the representative points and the neighboring points are translated to center around the origins, i.e., P ′ ← P − p (Line 1 of Algorithm 1). Note that one point may be in the neighborhood of multiple representative points, for example, p1,1 is neighboring to both p2,1 and p2,2 in Figure 3 a and b, thus one point can be at different relative positions in local coordinate systems of different representative points.

由于算法1的第4-6行已在引言中介绍，因此我们在此详细解释算法1的第1-3行背后的基本原理。由于X-Conv设计用于局部点区域，因此输出不应取决于p及其相邻点的绝对位置，而应取决于它们的相对位置，因此我们在代表点和相邻点处建立局部坐标系转换为原点周围的中心，即P’←P - p（算法1的第1行）。注意，一个点可以在多个代表点的邻域中，例如，p1,1与图3a和b中的p2,1和p2,2都相邻，因此一个点可以位于局部坐标中的不同相对位置不同代表点的系统。

It is the local coordinates of neighboring points, together with their associated features, that defines the output features. In other word, besides the associated features, the local coordinates themselves are part of the input features as well. However, the local coordinates are of quite different dimensionality and representation than the associated features. We first lift the coordinates into an higher dimensional and more abstract representation (Fδ ← MLPδ (P ′ ), Line 2 of Algorithm 1), and then combine it with the associated features (F∗ ← [Fδ , F], Line 3 of Algorithm 1) for being further processed (Figure 3 c).

它是相邻点的局部坐标及其相关特征，用于定义输出特征。换句话说，除了相关的特征之外，本地坐标本身也是输入特征的一部分。然而，局部坐标与相关特征具有完全不同的维度和表示。我们首先将坐标提升为更高维度和更抽象的表示（Fδ←MLPδ（P’），算法1的第2行），然后将其与相关特征（F *←[Fδ，F]，第3行算法1）用于进一步处理（图3c）

The lifting of coordinates into features is through a point-wise MLPδ (·), which is the same as that in PointNet and PointNet++. However, the lifted features are not processed by a symmetric function in PointCNN. Instead, they are weighted and potentially permuted, together with the associated features, by the learnt X-transformation. Note that, unlike MLPδ (·), MLP(·) is applied on the entire neighboring point coordinates. Thus the resulting X is dependent on the order of the points, and this is desired, as X is supposed to permute F∗ according to the input points, thus it has to be aware of the specific input order.

将坐标提升为特征是通过逐点MLPδ（·），与PointNet和PointNet ++中的相同。但是，提升的特征不是由PointCNN中的对称函数处理的。相反，它们通过学习的X变换被加权并且可能与相关特征一起被置换。注意，与MLPδ（·）不同，MLP（·）应用于整个相邻点坐标。因此，得到的X取决于点的顺序，并且这是期望的，因为X应该根据输入点置换F *，因此它必须知道特定的输入顺序。

One nice property of X-Conv is that it handles point cloud with or without additional features in a quite uniform fashion. For input point cloud without any additional features, i.e., F is empty, the first X-Conv layer uses only Fδ .

Note that, in theory, X-transformation can be applied on either the features, or the kernels. We opt to apply it on the features, in which way, the follow up operation is a standard Conv operation — an operation that is highly optimized by popular deep learning frameworks. Otherwise, it will result in a convolution between features and the kernels “spawned” by X, which is not common, thus probably not fully optimized.

X-Conv的一个不错的特性是它以非常统一的方式处理带有或不带有附加功能的点云。对于没有任何附加特征的输入点云，即F为空，第一X-Conv层仅使用Fδ。

注意，理论上，X变换可以应用于特征或内核。我们选择将其应用于功能，这样，后续操作就是标准的Conv操作 - 一种通过流行的深度学习框架高度优化的操作。否则，它将导致特征与X“生成”的内核之间的卷积，这不常见，因此可能未完全优化。

Figure 3: The process for converting point coordinates to features. The neighboring points of representative points are transformed to local coordinate systems of the representative points (a and b). Then the local coordinates of each point are individually lifted into features, and combined with the associated features (c).

图3：将点坐标转换为特征的过程。代表点的相邻点被变换为代表点（a和b）的局部坐标系。然后将每个点的局部坐标分别提升为特征，并与相关特征（c）组合。

PointCNN Architecture

From Figure 2, we can see that the convolution layers in image CNNs and X-Conv layers in PointCNN only differs in two aspects: the way the local regions are extracted (K × K patches in image CNNs vs. K neighboring points around representative points.) and the way the information from local regions is learnt (Conv in image CNNs vs. X-Conv). Otherwise, there is no much difference in assembling a deep network with the X-Conv layers than that with convolution layers in image CNNs.

从图2中，我们可以看到，图像CNN中的卷积层和PointCNN中的X-Conv层仅在两个方面有所不同：提取局部区域的方式（图像CNN中的K×K个补丁与代表点周围的K个邻近点）。）以及学习本地区域信息的方式（图像CNN与X-Conv中的转换）。否则，与X-Conv层组装深度网络与图像CNN中的卷积层没有太大区别。

In Figure 4 (a), we show a simple PointCNN, with two X-Conv layers that gradually turn the input points (with or without features) into less representation points, but each with richer feature. After the second X-Conv layer, there is only one representative point left, and it received information from all the points from the previous layer. In PointCNN, we can roughly define the receptive field of each representative point as the ratio K/N, where K is the neighboring point number, and N is the point number in the previous layer. With this definition, the only one left point “sees” all the points from previous layer, thus has receptive field 1.0 — it has a global view of the entire shape, thus its features are informative for semantic understanding of the shape. We can add some fully connected layers on top of the last X-Conv layer output followed by a loss for training the network.

在图4（a）中，我们展示了一个简单的PointCNN，其中两个X-Conv层逐渐将输入点（有或没有特征）转换为更少的表示点，但每个都具有更丰富的特征。在第二个X-Conv层之后，只剩下一个代表点，它从前一层的所有点接收信息。在PointCNN中，我们可以粗略地将每个代表点的接收场定义为比率K / N，其中K是相邻点数，并且N是前一层中的点数。根据这个定义，唯一一个左点“看到”来自前一层的所有点，因此具有接收场1.0 - 它具有整个形状的全局视图，因此其特征对于形状的语义理解提供信息。我们可以在最后一个X-Conv层输出之上添加一些完全连接的层，然后丢失用于训练网络。

Figure 4: PointCNN architecture for classification (a and b) and segmentation (c), where N and C denote the output representative point number and feature dimensionality, K is the neighboring point number for each representative point, and D is the X-Conv dilation rate.

图4：用于分类（a和b）和分段（c）的PointCNN架构，其中N和C表示输出代表点数和特征维度，K是每个代表点的相邻点数，D是X-Conv 扩张率。

Note that the number of training samples for the top X-Conv layers of PointCNN in Figure 4 (a) drops rapidly, making it inefficient to train the top X-Conv layers thoroughly. To address this problem, we propose the PointCNN in Figure 4 (b), where more representative points are kept in the X-Conv layers. However, we want to maintain the depth of the network, while keeping the receptive field growth rate, such that the deeper representative points “see” larger and larger portion of the entire shape. We achieve this goal by employing the dilated convolution idea from image CNNs into PointCNN. Instead of always taking the K neighboring points as input, we may uniformly sample K input points from K × D neighboring points, where D is the dilation rate. In this case, the receptive field increases from K/N to (K ×D)/N, without the increase of the actual neighboring point number, nor the kernel size.

请注意，图4（a）中PointCNN的顶部X-Conv层的训练样本数量迅速下降，因此无法彻底训练顶部X-Conv层。为了解决这个问题，我们提出了图4（b）中的PointCNN，其中更多代表点保留在X-Conv层中。然而，我们希望保持网络的深度，同时保持感受野增长率，使得更深的代表点“看到”整个形状的越来越大的部分。我们通过采用从图像CNN到PointCNN的扩张卷积思想来实现这一目标。我们可以从K×D个相邻点均匀地采样K个输入点，而不是总是将K个相邻点作为输入，其中D是扩张率。在这种情况下，感受野从K / N增加到（K×D）/ N，而不增加实际相邻点数，也不增加内核大小。

In the second X-Conv layer of PointCNN in Figure 4 (b), dilation rate D = 2 is used, thus all the four remaining representative points “see” the entire shape, and all of them are suitable for making predictions. Note that, in this way, we can train the top X-Conv layers more thoroughly, as much more connections are involved in the network, compared with that in PointCNN of Figure 4 (a). In testing time, the output from the multiple representative points are averaged right before the softmax to stabilize the prediction. This design is quite similar to that of Network in Network [Lin et al. 2014]. PointCNN in the denser style (Figure 4 (b)) is the one we used for classification tasks.

在图4（b）中的PointCNN的第二X-Conv层中，使用膨胀率D = 2，因此所有剩余的四个代表点“看到”整个形状，并且它们全部适合于进行预测。请注意，通过这种方式，我们可以更彻底地训练顶级X-Conv层，与图4（a）中的PointCNN相比，网络中涉及更多连接。在测试时间中，来自多个代表点的输出在softmax之前平均以稳定预测。这种设计非常类似于网络中的网络。更密集风格的PointCNN（图4（b））是我们用于分类任务的风格。

For segmentation tasks, high resolution pointwise output is required, and this can be realized by building PointCNN following Conv-DeConv [Noh et al. 2015] architecture, where the DeConv part is responsible of propagating global information into high resolution predictions (see Figure 4 (c)). Note that both the “Conv” and “DecConv” in PointCNN segmentation network are the same X-Conv operator. For “DeConv” layers, the only difference with the “Conv” layers is that there are more points, but less feature channels, in the output than that in the input. And the higher resolution points for the “DeConv” layers are forwarded from earlier “Conv” layers, following the design of U-Net .

对于分段任务，需要高分辨率逐点输出，这可以通过在Conv-DeConv之后构建PointCNN来实现[Noh等人。 2015]架构，其中DeConv部分负责将全局信息传播到高分辨率预测中（参见图4（c））。注意，PointCNN分段网络中的“Conv”和“DecConv”都是相同的X-Conv运算符。对于“DeConv”图层，与“Conv”图层的唯一区别在于输出中的点数少于特征通道，但输入中的特征通道较少。根据U-Net的设计，“DeConv”层的更高分辨率点从早期的“Conv”层转发。

ELU is the nonlinear activation function used in PointCNN, as we found it is more stable and performs slightly better than ReLU . Batch normalization is applied on P ′ , Fp and the fully connected layer outputs (except for that of the last fully connected layer) for reducing internal covariate shift. It is important to note that batch normalization should not be applied in MLPδ and MLP, since F∗ and X, especially X, are supposed to be quite specific for a particular representative point. For the Conv in Line 6 of Algorithm 1, separable convolution is used for reducing parameter number and computation than that of typical convolution. We use ADAM optimizer with initial learning rate 0.01 for the training of PointCNN.

ELU是PointCNN中使用的非线性激活函数，因为我们发现它比ReLU更稳定并且表现稍好。批量归一化应用于P’，Fp和完全连接的层输出（除了最后一个完全连接的层的输出）以减少内部协变量偏移。重要的是要注意，批量标准化不应该应用于MLPδ和MLP，因为F *和X，尤其是X，应该对特定的代表点非常具体。对于算法1的第6行中的Conv，可分离卷积用于减少参数数量和计算，而不是典型卷积。我们使用初始学习率为0.01的ADAM优化器来训练PointCNN。

Dropout is applied before the last fully connected layer for reducing over-fitting. We also employed the “sub-volume supervision” idea from [Qi et al. 2016] for addressing over-fitting problem. In the last X-Conv layers, the receptive field is set to be less than 1, such that only a partial information is “seen” by the representative points in the last X-Conv layers. The network is pushed to learn harder from the partial information at training time, and performs better in testing time.

在最后一个完全连接的层之前应用Dropout以减少过度拟合。我们还采用了[Qi等人,2016]的“子容量监督”思想。解决过度拟合问题。在最后的X-Conv层中，感受野被设置为小于1，使得仅最后的X-Conv层中的代表点“看到”部分信息。推动网络在训练时从部分信息中学习更加困难，并且在测试时间方面表现更好。

In this paper, PointCNN is demonstrated with simple feed forward networks on classification tasks, and simple feed forward layers plus skip-links in segmentation network. However, since the interface X-Conv exposed to its input and output layers is quite similar to that of Conv, we think many advanced neural network techniques from image CNNs can be adopted to work with X-Conv, e.g., recurrent PointCNN. We leave the exploration along these directions as future work.

在本文中，PointCNN通过简单的分类任务前馈网络，简单的前馈层和分段网络中的跳过链接进行了演示。然而，由于暴露于其输入和输出层的界面X-Conv非常类似于Conv，我们认为可以采用来自图像CNN的许多高级神经网络技术来处理X-Conv，例如，循环的PointCNN。作为未来的工作，我们将沿着这些方向进行探索。

Data augmentation. For the training of the parameters in X-Conv, clearly, it is not beneficial if the neighboring points are always the same set in the same order for a specific representative point. To improve the generalizability, we propose to randomly sample and shuffle the input points, such that both the neighboring point sets and order can be different from batch to batch. To train a model that takes N points as input, N (N, (N/8) 2 ) points are used for the training, where N denotes Gaussian distribution. We found this strategy is crucial for the training of PointCNN.

数据增强。对于X-Conv中的参数的训练，显然，如果相邻点对于特定代表点始终以相同顺序设置，则是不利的。为了提高普遍性，我们建议随机地对输入点进行采样和混洗，使得相邻点集和顺序可以在批次之间不同。为了训练以N点为输入的模型，N（N，（N / 8）2）点用于训练，其中N表示高斯分布。我们发现这种策略对于PointCNN的培训至关重要。

CONCLUSION

We proposed PointCNN, which is a generalization of CNN into leveraging spatially-local correlation from data represented as point cloud. We demonstrated its strong performance on multiple challenging benchmark datasets and tasks. The core of PointCNN is the X-Conv operator that weights and permutes input points and features before they are process by a typical convolution.

As point cloud data is becoming more accessible, we envision it is of great importance to develop methods that can effectively leverage spatially-local correlation from such data, and our method is just a starting point in the important undertaking. We open source our code at https://github.com/yangyanli/PointCNN for encouraging future developments.

我们提出了PointCNN，它是CNN的一种推广，利用来自表示为点云的数据的空间局部相关性。我们在多个具有挑战性的基准数据集和任务上展示了其强大的性 PointCNN的核心是X-Conv运算符，它在通过典型卷积处理输入点和特征之前对其进行加权和置换。

随着点云数据变得越来越容易，我们设想开发能够有效利用这些数据的空间局部相关性的方法非常重要，而我们的方法只是重要事业的起点。我们在https://github.com/yangyanli/PointCNN上开源我们的代码，以鼓励未来的发展。