for all then the largest product increases exponentially with . That is, the error blows up, and conflicting error signals arriving at unit can lead to oscillating weights and unstable learning (for error blow-ups or bifurcations see also [19,2,8]). On the other hand, if

for all , then the largest product

the size of the derivative goes to zero for , and it is less than for (e.g., if the absolute maximal weight value is smaller than 4.0). Hence with conventional logistic sigmoid transfer functions, the error flow tends to vanish as long as the weights have absolute values below 4.0, especially in the beginning of the training phase. In general the use of larger initial weights does not help though -- as seen above, for the relevant derivative goes to zero ``faster'' than the absolute weight can grow (also, some weights may have to change their signs by crossing zero). Likewise, increasing the learning rate does not help either -- it does not change the ratio of long-range error flow and short-range error flow. BPTT is too sensitive to recent distractions. Note that since the summation terms in equation (2) may have different signs, increasing the number of units does not necessarily increase error flow.