Posts

The case for Fast Transforms as Averaging Machines

Image
Are Fast Transforms like the FFT and Walsh Hadamard transform averaging machines? Certainly there is an all sum term such as A+B+C+D+E+F+G in the diagram which can be considered an average especially after vector length renormalization where all the output terms would typically be multiplied by 1/sqr(width), 1/sqr(8) in this case, to leave vector magnitude unchanged by the transform. For a different input, say {+1,+1,-1,-1,-1,-1,+1,+1] the output term A+B-C-D-E-F+G+H becomes 1+1+1+1+1+1+1+1.  For that input pattern that particular output term becomes an averaging operation. Degrees of freedom For standard averaging of n values many different combinations of values can result in the same average. There are many degrees of freedom. As you start demanding more and more output terms of a fast transform be particular values the degrees of freedom in the input shrink. Until, when you demand all the outputs be particular values, then only one exact input vector will give that result. To get a

Switch Net 4 (Switch Net N) Combine multiple low width neural layers with a fast transform.

Image
 Switch Net 4  Switch Net 4. Random sign flipping before a fast Walsh Hadamard transform results in a Random Projection. For almost any input the result is a Gaussian distributed output vector where each output element contains knowledge of all the input elements. With Switch Net 4 the output elements are 2-way switched using (x<0?) as a switching predicate. Where x is the input to the switch.  The switches are grouped together in units of 4 which together form a small width 4 neural. When a particualr x>=0, the pattern in the selected pattern of weights is forward projected with intensity x. When x<0 a different pattern of forward selected weights is again projected with intensity x (x being negative this time.) If nothing was projected when x<0 then the situation would be identical to using a ReLU. You could view the situation in Switch Net 4 as using 2 ReLUs one with input x and one with input -x.  The reason to 2-way switch (or +- ReLU) is to avoid early information los

Switch Net

Image
 Switch Net Switch Net. Switch Net is a neural network based on using the fast Walsh Hadamard tranform (WHT) as a fixed set of layer weights with a parametric switching based activation function. The cost of the WHT is nlog2(n) add subtracts as opposed to n squared fused multiply adds for a conventional dense neural layer's weight calculations. A fixed random pattern of sign flips is applied to the input data so that the first WHT fairly distributes information with a Gaussian distribution across its outputs. Then a switching decision is applied to each output of the WHT and the output is multiplied by one weight or another.  The section between the red lines is basically a layer and can be repeated a number of times. A final WHT is done to combine the last lot of switched weight outputs. Some example code is here: https://editor.p5js.org/congchuatocmaydangyeu7/sketches/7ekZTwQMF One slight issue is if the switching weights for a particular output have opposite signs then the resul

ReLU as a Switch

Image
 ReLU as a Switch The conventional view - ReLU as a function ReLU as a function. ReLU as a Switch A slight confusion in computer science is that switches are purely digital, from switching gates and such. That appears to be the case because the supply voltage is fixed and that is the only thing switched. However a light switch in you house is binary on off, yet connects or disconnects an AC sine wave voltage.  A switch therefore is a mixed digital analog device. When on, the input signal goes through 1 to 1. When off, the output is zero. You can then view ReLU as a switch that is 1 to 1 when on and output zero when off. Of course there is a switching decision to make.  In your house you make the switching decision for a light. With ReLU the switching decision is based on the predicate (x>=0)? Where x is the input to the switch. You could supply other predicates for the switching decisions if you wanted but switching at x=0 is very helpful for optimization. A ReLU network with all th

Subrandom Projections

Image
Subrandom Projections Performing random sign flips on natural data like images before many types of fast transform such as the FFT or WHT gives Gaussian noise outputs.  Essentially the sign flips convert the input data to a collection of random variables that then get weighted, summed and differenced by the fast transform giving Gaussian noise by the Central Limit Theorem. Another way of stating the situation is that random sign flips combined with fast transforms create Random Projections (RPs) of the input data. While such RPs have many uses there are some disadvantages such as the complete loss of direct rotation, scaling and translation (RST) information about objects in the orignal image.  It would be nice if some of that RST information was kept along with some of the nice information distribution properties of Random Projections. One solution to the problem is to use subrandom (low discrepancy) numbers to decide the sign flips. For example, Additive Recurrence based Subrandom Nu

Associative Memory using a Locality Sensitive Hash (Random Projection + Binarization)

Associative Memory using a Locality Sensitive Hash Hashing and Associative Memory Suppose you have a hash algorithm with d (+1,-1) binary outputs and you apply a weighted sum to those d binary outputs. Then you can store up to d <input,scalar> associations. Where the input goes to the hash function and the scalar value is the output of the weighted sum.  Only if by chance the hash algorithm outputs 2 binary patterns (for particular inputs) that are not linearly seperable will the storage capacity be less than d. As d increases the probability of such a collision decreases. The training algorithm simply involves adding or subtracting the recall error (divided by d) to the weights according to whether the hash bit for the particular weight is valued -1 or +1.  The training algorithm is applied to the training data a number of times to drive the recall error to a low value solving a set of d linear simultaneous equations. An input that was not in the training set results in a comple

Random Projections for Neural Networks

Image
Random Projections for Neural Networks  Variance and the CLT For Linear Combinations of Random Variables a negative sign before a particular variable has no effect on variance. And of course the Central Limit Theorem is in operation. The Walsh Hadamard transform (WHT) is a set of orthogonal weighted sums (where the weights are +1 or -1 or some constant multiple of those.)  And so the variance equation for linear combinations of random variables applies (minus signs not invalidating that), as does the Central Limit Theorem. 8-Point Walsh Hadamard transform. Therefore applying the WHT to a sequence of random numbers from the Uniform random number distribution results in a sequence of numbers from the Gaussian (Normal) distribution. Example code is here: https://editor.p5js.org/siobhan.491/sketches/WhoxMA7pH If you want to use that behavior to generate Gaussians you should remember that the WHT leaves vector magnitued (length) unchanged (except by a constant c.)  The result is all the Gau