First of all, it is not necessary to have square images. Secondly, bigger images means more computation operations per layer as well as more memory requirements. However it doesn’t impact the conv network as it doesn’t work with the full image as one input, but rather a fixed window that slides over the image, the convolution operation.
In training we need to work with batches, and that means a set of data represented as a tensor. That tensor is taken through conv layer operations of convolution follow by ReLU to get the activation volume (the input to the following layer). All of this has to be done in memory. For better performance, the tensor should be able to fit fully in the memory either RAM or GPU memory.
If you are scaling down the images, then keep the aspect ratio, as otherwise it will impact the structural relations in your data. Moreover, if your data is scale down version of high resolution images, then your network will be able to see key features in the initial layers. If you are images are large, then these key features might be learned later at the end of the network.
Also see Konstantin answer.