“we propose RIU-Net (for Range-Image U-Net), the adaptation of U-Net […] to the semantic segmentation of 3D LiDAR point clouds.” ### 2. Related works
“Recently, Wu et al. proposed SqueezeSeg, a novel approach for the semantic segmentation of a LiDAR point cloud represented as a spherical range-image.” ### 3. Methodology
“The method consists in feeding the U-Net architectures with 2-channels [range-images] encoding range and elevation.” #### 3.A Input of the Network
“we use a range-image named u of 512 × 64px with two channels: the depth towards the sensor and the elevation.” #### 3.B Architecture
“the encoder consists in the repeated application of two 3×3 convolutions followed by a rectified linear unit (ReLU) and a 2×2 max-pooling layer that downsamples the input by a factor 2. Each time a downsampling is done, the number of features is doubled.”
“The decoder consists in upsampling blocks where the input is upsampled using 2 × 2 up-convolutions. Then, concatenation is done between the upsampled feature map and the corresponding feature map of the encoder. After that, two 3 × 3 convolutions are applied followed by a ReLU. This block is repeated until the output of the network matches the dimension of the input.”
“the last layer consists in a 1x1 convolution that outputs as many features as the wanted number of possible labels K 1-hot encoded.” #### 3.C Loss function
“is defined as the cross-entropy of the softmax of the output of the network.”
“[in the equation shown in the paper] m(x) > 0 are the valid pixels and w(x) is a weighting function introduced to give more importance to pixels that are close to a separation between two labels, as defined in U-Net.” #### 3.D Training
“We train the network with the Adam stochastic gradient optimizer and a learning rate set to 0.001. We also use batch normalization with a momentum of 0.99 to ensure good convergence of the model. Finally, the batch size is set to 8 and the training is stopped after 10 epochs.” ### 4. Experiments
“we follow the experimental setup of the SqueezeSeg approach […] which contains 8057 samples for training and 2791 for validation.”
“we use the intersection-over-union metric”
“we advocate that the proposed model can operate with a frame-rate of 90 frames per second on a single GPU”
For the decoder, the paper mentions the application of “up-convolutions”, which were first defined in U-Net as:
“an upsampling of the feature map followed by a 2x2 convolution (“up-convolution”)”
Our implementation changes the 2x2 convolution to a 3x3 one to avoid croping the feature maps to handle shape mismatches with the encoder skip connections.
class Decoder(Module):"RIU-Net decoder architecture."def__init__(self, channels=(1024, 512, 256, 128, 64)):super().__init__()self.upconvs = ModuleList( [UpConv(channels[i], channels[i+1]) for i inrange(len(channels)-1)] )self.blocks = ModuleList( [Block(channels[i], channels[i+1]) for i inrange(len(channels)-1)] )def forward(self, enc_features): x = enc_features[-1]for i, (upconv, block) inenumerate(zip(self.upconvs, self.blocks)): x = upconv(x) x = torch.cat([x, enc_features[-(i+2)]], dim=1) x = block(x)return x