The idea is that at each layer of your convet, you can make a choice. Have a pooling operation, have a convolution, then need to decide is it a 1x1, 3x3, 5x5? So why choose! Instead of having a single convolution, you use a composition of average pooling, then 1x1 convolution, then 1x1 followed by 3x3, then 1x1 followed by 5x5. And at the top, you simply concatenate the output of each of them.