I've found $rand having two minor kinds of bias.

One is a division-remainder (mod) bias due to the fact that random number generators have a fixed range of possible input values, which are often unrelated to the size of the requested output range.

The other is an uneven frequency distribution of outputs as the range gets larger. Some outputs appear much more often than they should, while other outputs appear to not be possible at all.

1. MOD bias

This is a very small bias which is hard to detect for small ranges, but becomes more apparent at larger range sizes. Because of the way the bias fluctuates as the range size changes, I don't believe this bias is caused due to the huge ranges in this example, but instead the larger ranges make its presence more obvious.

The max value I've been able to return from $rand(0,999999999) is approximately 6 million short of the string of 14 9's, so I'm assuming the intent is to allow returning values from 0 to 14 9's. The minimum non-zero value I found from this same range was also approximately 6 million greater than zero, but that's related to issue#2.

This issue#1 bias appears to be a modulo bias that's caused when (the range-size of input values received from the random generator) MOD (the range-size of requested output values) is not zero. The bias favors values toward the low end of the range. For smaller size ranges, the bias can be harder to detect, but it's technically there.

For a simplistic explanation, it's similar to what would happen if you dealt out an entire deck of cards to different group sizes of players. If the number of players is a factor of 52 like 52-26-13-4-2-or-1, everyone gets the same number of cards, but for all other numbers of players at least 1 player gets an extra card. If there are 12 players, most receive 1 fewer card. If there are 14 players, most receive 1 extra card.

For another example, assume a number generator calculates a perfectly random value from 0-255, and you're trying to get a random number from 0-9. There's 256 possible inputs but only 10 possible outputs, so some outputs map to 25 inputs but some outputs map to 26 inputs. If taking the 0-255 random input MOD 10 and returning (low_value + remainder 0-9), then outputs 0-5 have 26 inputs mapped to each of them, but outputs 6-9 have 25 inputs mapped to each of them. This would cause 0-5 to appear 4% more often than 6-9.

If the input is a 16-bit value from 0-65535, there's still a bias toward the lowest values of a requested range of 10, but it's a much smaller bias due to the difference between having 6553 vs 6554 inputs, which is a much smaller percent change. However if the range were increased to be 0-49151, the lowest 16384 outputs have 2 inputs while the other 32768 outputs only had 1 input, causing some outputs to appear 100% more often than others. The number of outputs being favored changes depending on how close the output range is to 1/2 1/4th 1/8th etc of the 65536.

The solution to avoiding this type of bias is to determine a value, based on the sizes of the input and output ranges, above which any input value is going to be the extra input given to only a portion of the outputs. If the random generator returns a number above this value, it should be discarded, and a replacement random value should be requested from the RNG until the RNG returns a value within the range.

Assuming the input value comes from a range from zero to %in_max, below should work to determine the range of inputs that need to be discarded. This could also be combined with my

suggestion to allow the output range to be either partly or entirely below zero.

//var -s %in_max 255 , %in_min $calc(must_remain_zero) , %out_min 1 , %out_max 10 , %out_range $calc(%out_max + 1 - %out_min) , %throwaway_above $calc(%in_max - ((%in_max % %out_range) + 1) % %out_range)

If you substitute an output range size that's a factor of 256, such as 128 or 64 or even 1, the %throwaway_above value is always going to be the same as %in_max, so nothing is ever thrown away. By throwing away these extra highest values, it eliminates the bias described above. The number of values thrown away can never be larger than half the values output by the RNG, so throwing away RNG output should be rarely needed in most cases.

--

In this next example, you can see the bias changing as you change the 0.996. At 1.00 any bias is not very visible, nor at 0.50 where it's half of 10^15. However at 0.996 the 1st pocket attracts twice as many random numbers as the other pockets, and at changing 0.996 to 0.66 this effects spreads to half the pockets. The bias gets smaller as you repeat your tests after changing the 14 into 13, but the effect is still there at around 50% more outputs instead of 100% more outputs.

alias randbiasdemo {
var %pct 0.996 | if (($1 > 0) && ($1 <= 1)) var %pct $1
var -s %pockets 256 , %array $str(0 $chr(32),%pockets) , %i 25600 , %max $str(9,14) * %pct , %div %max / %pockets
while (%i) {
var %t $calc(1+ ($r(1,%max) // %div)) , %a $gettok(%array,%t,32) + 1 , %array $puttok(%array,%a,%t,32)
dec %i
}
echo -a %array = $calc($replace(%array,$chr(32),+))
}

2. Uneven frequency distribution

The above dealt only with how the output from the random number generator is translated into an output value, and not about the quality of random values returned by the generator itself. Even at the largest ranges, each of the regions of the output appeared to have similar total number of outputs.

However at much smaller ranges than that, the random generator has gaps where some outputs are rarely if ever returned, while other outputs appear much more often than they should. 0-16777215 is a large range, but it's small enough to be a range used by scripts, such as choosing a random RGB color index.

$rand(0,16777215) should return numbers 0-255 an average of 1 time per approx. 65536 random numbers, and appears to be close to doing so. However, while random numbers should not have a completely smooth distribution of numbers, the number of times each numbers is returned should be closer to the mean than $rand is returning. The first 2560 times that numbers 0-255 were returned by $rand(0,16777215), in sequential order of 0 to 255, the number of times each number was returned by $rand was:

39 00 08 13 12 05 00 12 00 20 00 18 00 09 11 05 09 00 14 00 23 00 06 05 10 10 15 14 00 20 00 13 00 10 08 07 11 00 12 00 15 00 28 00 07 14 12 09 00 29 00 28 00 24 00 11 14 10 09 00 30 00 18 00 18 00 07 08 11 08 00 23 00 20 08 10 21 08 27 11 23 00 27 11 19 10 09 15 14 16 13 19 00 08 11 08 08 00 28 00 24 00 23 00 09 06 15 10 00 22 00 11 00 10 14 10 09 14 05 00 16 00 22 00 12 11 08 07 10 10 00 19 00 22 00 07 18 19 12 00 21 00 24 00 20 00 14 05 15 10 00 22 00 19 00 27 00 10 09 09 08 00 22 00 26 07 26 08 09 24 09 22 00 23 11 17 07 09 20 14 19 09 21 00 10 12 10 07 00 14 00 14 00 19 00 11 10 07 10 00 11 00 23 00 20 00 07 16 10 15 00 26 00 19 00 11 05 04 07 08 12 00 21 00 25 00 05 05 13 12 00 26 00 22 00 16 00 13 11 08 03 00 16 00 27 00 21 00 12 07 13 10 00 29 00 20

The most frequent value was '0' appearing 39 times, nearly 4 times the mean of 10. 76 of the 256 output numbers did not appear at all in the first 2560 random values which were in the 0-255 range, which is very unlikely in a random sample of this size.

As the output range increases, the possible returned values are increasingly spaced apart, but the low value continues to appear much more frequently than the others. When the range is $rand(0,99999999999999) which is greater than 2^46, the '0' value appears slightly more frequently than once every 2^24 numbers, which is not significantly less often than '0' appeared in $rand(0,16777215).