Switch Net 4 - reducing the cost of a neural network layer.
Switch Net 4 The layers in a fully connected artificial neural network don't scale nicely with width. Width of a neural network. For width n the number of multiply add operations required per layer is n squared. For a layer of width 256 the number multiply adds would be 65536 (256*256.) Even with modern hardward layers cannot be made much wider than that. Or can they? A layer of width 2 only requires 4 operations, width 4 only 16 operations, width 8 only 64 operations. If you could combine k width n (n being small) layers into a new much wider layer you'd end up with a computational advantage. For example 64 width 4 layers combined into a width 256 layer would cost 64*16=1024 multiply add operations plus the combining cost. A combining algorithm. The fast Walsh Hadmard transform can be used as a combiner because a change in a single input causes all the outputs to vary. The combining cost is n*log2(n) add subtract operations. For a layer of width 256 the combining cost...