#264732  08/01/19 02:07 AM
$rand bias

Hoopy frood
Registered: 12/01/04
Posts: 1036

I've found $rand having two minor kinds of bias. One is a divisionremainder (mod) bias due to the fact that random number generators have a fixed range of possible input values, which are often unrelated to the size of the requested output range. The other is an uneven frequency distribution of outputs as the range gets larger. Some outputs appear much more often than they should, while other outputs appear to not be possible at all. 1. MOD bias This is a very small bias which is hard to detect for small ranges, but becomes more apparent at larger range sizes. Because of the way the bias fluctuates as the range size changes, I don't believe this bias is caused due to the huge ranges in this example, but instead the larger ranges make its presence more obvious. The max value I've been able to return from $rand(0,999999999) is approximately 6 million short of the string of 14 9's, so I'm assuming the intent is to allow returning values from 0 to 14 9's. The minimum nonzero value I found from this same range was also approximately 6 million greater than zero, but that's related to issue#2. This issue#1 bias appears to be a modulo bias that's caused when (the rangesize of input values received from the random generator) MOD (the rangesize of requested output values) is not zero. The bias favors values toward the low end of the range. For smaller size ranges, the bias can be harder to detect, but it's technically there. For a simplistic explanation, it's similar to what would happen if you dealt out an entire deck of cards to different group sizes of players. If the number of players is a factor of 52 like 52261342or1, everyone gets the same number of cards, but for all other numbers of players at least 1 player gets an extra card. If there are 12 players, most receive 1 fewer card. If there are 14 players, most receive 1 extra card. For another example, assume a number generator calculates a perfectly random value from 0255, and you're trying to get a random number from 09. There's 256 possible inputs but only 10 possible outputs, so some outputs map to 25 inputs but some outputs map to 26 inputs. If taking the 0255 random input MOD 10 and returning (low_value + remainder 09), then outputs 05 have 26 inputs mapped to each of them, but outputs 69 have 25 inputs mapped to each of them. This would cause 05 to appear 4% more often than 69. If the input is a 16bit value from 065535, there's still a bias toward the lowest values of a requested range of 10, but it's a much smaller bias due to the difference between having 6553 vs 6554 inputs, which is a much smaller percent change. However if the range were increased to be 049151, the lowest 16384 outputs have 2 inputs while the other 32768 outputs only had 1 input, causing some outputs to appear 100% more often than others. The number of outputs being favored changes depending on how close the output range is to 1/2 1/4th 1/8th etc of the 65536. The solution to avoiding this type of bias is to determine a value, based on the sizes of the input and output ranges, above which any input value is going to be the extra input given to only a portion of the outputs. If the random generator returns a number above this value, it should be discarded, and a replacement random value should be requested from the RNG until the RNG returns a value within the range. Assuming the input value comes from a range from zero to %in_max, below should work to determine the range of inputs that need to be discarded. This could also be combined with my suggestion to allow the output range to be either partly or entirely below zero. //var s %in_max 255 , %in_min $calc(must_remain_zero) , %out_min 1 , %out_max 10 , %out_range $calc(%out_max + 1  %out_min) , %throwaway_above $calc(%in_max  ((%in_max % %out_range) + 1) % %out_range)
If you substitute an output range size that's a factor of 256, such as 128 or 64 or even 1, the %throwaway_above value is always going to be the same as %in_max, so nothing is ever thrown away. By throwing away these extra highest values, it eliminates the bias described above. The number of values thrown away can never be larger than half the values output by the RNG, so throwing away RNG output should be rarely needed in most cases.  In this next example, you can see the bias changing as you change the 0.996. At 1.00 any bias is not very visible, nor at 0.50 where it's half of 10^15. However at 0.996 the 1st pocket attracts twice as many random numbers as the other pockets, and at changing 0.996 to 0.66 this effects spreads to half the pockets. The bias gets smaller as you repeat your tests after changing the 14 into 13, but the effect is still there at around 50% more outputs instead of 100% more outputs. alias randbiasdemo {
var %pct 0.996  if (($1 > 0) && ($1 <= 1)) var %pct $1
var s %pockets 256 , %array $str(0 $chr(32),%pockets) , %i 25600 , %max $str(9,14) * %pct , %div %max / %pockets
while (%i) {
var %t $calc(1+ ($r(1,%max) // %div)) , %a $gettok(%array,%t,32) + 1 , %array $puttok(%array,%a,%t,32)
dec %i
}
echo a %array = $calc($replace(%array,$chr(32),+))
} 2. Uneven frequency distribution The above dealt only with how the output from the random number generator is translated into an output value, and not about the quality of random values returned by the generator itself. Even at the largest ranges, each of the regions of the output appeared to have similar total number of outputs. However at much smaller ranges than that, the random generator has gaps where some outputs are rarely if ever returned, while other outputs appear much more often than they should. 016777215 is a large range, but it's small enough to be a range used by scripts, such as choosing a random RGB color index. $rand(0,16777215) should return numbers 0255 an average of 1 time per approx. 65536 random numbers, and appears to be close to doing so. However, while random numbers should not have a completely smooth distribution of numbers, the number of times each numbers is returned should be closer to the mean than $rand is returning. The first 2560 times that numbers 0255 were returned by $rand(0,16777215), in sequential order of 0 to 255, the number of times each number was returned by $rand was: 39 00 08 13 12 05 00 12 00 20 00 18 00 09 11 05 09 00 14 00 23 00 06 05 10 10 15 14 00 20 00 13 00 10 08 07 11 00 12 00 15 00 28 00 07 14 12 09 00 29 00 28 00 24 00 11 14 10 09 00 30 00 18 00 18 00 07 08 11 08 00 23 00 20 08 10 21 08 27 11 23 00 27 11 19 10 09 15 14 16 13 19 00 08 11 08 08 00 28 00 24 00 23 00 09 06 15 10 00 22 00 11 00 10 14 10 09 14 05 00 16 00 22 00 12 11 08 07 10 10 00 19 00 22 00 07 18 19 12 00 21 00 24 00 20 00 14 05 15 10 00 22 00 19 00 27 00 10 09 09 08 00 22 00 26 07 26 08 09 24 09 22 00 23 11 17 07 09 20 14 19 09 21 00 10 12 10 07 00 14 00 14 00 19 00 11 10 07 10 00 11 00 23 00 20 00 07 16 10 15 00 26 00 19 00 11 05 04 07 08 12 00 21 00 25 00 05 05 13 12 00 26 00 22 00 16 00 13 11 08 03 00 16 00 27 00 21 00 12 07 13 10 00 29 00 20 The most frequent value was '0' appearing 39 times, nearly 4 times the mean of 10. 76 of the 256 output numbers did not appear at all in the first 2560 random values which were in the 0255 range, which is very unlikely in a random sample of this size. As the output range increases, the possible returned values are increasingly spaced apart, but the low value continues to appear much more frequently than the others. When the range is $rand(0,99999999999999) which is greater than 2^46, the '0' value appears slightly more frequently than once every 2^24 numbers, which is not significantly less often than '0' appeared in $rand(0,16777215).

Top




#264733  08/01/19 04:44 AM
Re: $rand bias
[Re: maroon]

Hoopy frood
Registered: 18/02/03
Posts: 2515

indeed. I've been aware of scaling artifacts in $rand(), but these samples are pretty wild. I wouldn't have expected 24 bits (2^241) to be so profound with this effect. I do wonder if there might be a better, faster even, pseudorandom generator mIRC might use. AutoHotkey is using Mersenne Twister MT19937 by (C) 1997  2002, Makoto Matsumoto and Takuji Nishimura (which only needs a copyright notice in the help file; steal it from AutoHotkey's help file.) http://www.math.sci.hiroshimau.ac.jp/~mmat/MT/emt.html ((Oh, looks like they don't have any license requirements anymore.))
_________________________
At least I won lunch. Good philosophy, see good in bad, I like!

Top




#264735  08/01/19 03:19 PM
Re: $rand bias
[Re: maroon]

Hoopy frood
Registered: 18/11/04
Posts: 798
Loc: I live inside your computer. S...

I couldn't help but remember this. That said, why not introduce some kind of .prop which allows the user to select another variant, be it Mersenne or Wells and just default back to the current one if it's not used?
_________________________
This signature is currently out of order. We apologize for the inconvenience.

Top




#264837  Yesterday at 09:06 AM
Re: $rand bias
[Re: Khaled]

Hoopy frood
Registered: 12/01/04
Posts: 1036

In the new rand, i'm seeing evidence of uneven bias at huge ranges, but not the same kind as before. The bias seems identical for both $rand and $rands, so it's more likely caused by identical postprocessing of the random inputs. So far I've tested only the lowest bit, as it's easiest to test odd/even since $isbit doesn't work above 2^32. At largest ranges, there can be a significant difference depending on whether the range size itself is odd or even. The Jenkins PRNG is 64bit output, and the SystemFunction036 allows the app to determine the size of the random input, but it appears to allow 64bits and even more. $rand now returns values as large as the 2^53 doubles range instead of chopping the range near 2^46.5, however the odd/even range size can determine whether all or mostly even numbers are returned. In all these examples, the low end of the range is zero, and if $2 is 's' it uses $rands instead of $rand, showing the effects appear in both identifiers. At the largest range, it can return numbers very close to the top of the range. However while this example shows an even split between odd/even numbers, the odd numbers are at the top of the range and the even numbers are all at the bottom of the range. /randbiasdemo3 2^531  For 2^522 and all other range maxes in both directions where the 2 is replaced by a negative or positive even number, all returned values are even. /randbiasdemo3 2^522  The odd/even bias comesandgoes as the range size shrinks, like the previous modulo bias did. In this example and other evennumbered range maxes in both directions, 2/3rds of returned numbers are even. /randbiasdemo3 0.67*2^512  60% evens: /randbiasdemo3 0.80*2^501 53% evens: /randbiasdemo3 0.53*2^491 52% evens: /randbiasdemo3 0.70*2^481 The above post's calculation of %throwaway_above should enable chopping a large input range down to a smaller output range without introducing bias. alias randbiasdemo3 {
if ($calc($1) isnum 19007199254740991) var s %max $gettok($v1,1,46)  else var s %max $calc(2^522)
var %pockets 256 , %array_odd $str(0 $chr(32),%pockets) , %array_even %array_odd
var %i 25600 , %div %max / %pockets , %odd.count 0 , %even.count 0
if ($2 == s) echo a using $ $+ rands
while (%i) {
if ($2 == s) var %rr $rands(0,%max)  else var %rr $rand(0,%max)
if ($right(%rr,1) isin 13579) {
var %t $calc(1+ (%rr // %div)) , %a $gettok(%array_odd,%t,32) + 1 , %array_odd $puttok(%array_odd,%a,%t,32)  inc %odd.count
}
else {
var %t $calc(1+ (%rr // %div)) , %a $gettok(%array_even,%t,32) + 1 , %array_even $puttok(%array_even,%a,%t,32)  inc %even.count
}
dec %i
}
echo 3 a odd: %array_odd = $calc($replace(%array_odd ,$chr(32),+)) $chr(22) $calc(%odd.count *100/(%odd.count + %even.count)) $+ % $chr(22) . $stddev(%array_odd)
echo 4 a eve: %array_even = $calc($replace(%array_even,$chr(32),+)) $chr(22) $calc(%even.count *100/(%odd.count + %even.count)) $+ % $chr(22) . $stddev(%array_even)
}
alias l stddev {
tokenize 32 $1  if ($0 < 2) return  var %sum 0 , %sum.of.squares 0
var %j $0  while (%j) { inc %sum $gettok($1,%j,32)  dec %j }
var %mean %sum / $0  var %j $0  while (%j) {
inc %sum.of.squares $calc( ($gettok($1,%j,32)  %mean)^2 )  dec %j }
var %pop.variance $calc(%sum.of.squares / ($0 ))
var %sample.variance $calc(%sum.of.squares / ($0 1))
var %pop.std.dev $calc((%sum.of.squares / ($0 ))^.5)
var %sample.std.dev $calc((%sum.of.squares / ($0 1))^.5)
var %std.err $calc(%sample.std.dev / ($0 ^ .5) )
return count: $0 sum: %sum mean: %mean popvar: %pop.variance pop.stddev: %pop.std.dev sample.var: %sample.variance sample.std.dev: %sample.std.dev std.err: %std.err
}

Top




#264838  Yesterday at 11:41 AM
Re: $rand bias
[Re: maroon]

Planetary brain
Registered: 04/12/02
Posts: 4326
Loc: London, UK

In the new rand, i'm seeing evidence of uneven bias at huge ranges, but not the same kind as before. The bias seems identical for both $rand and $rands, so it's more likely caused by identical postprocessing of the random inputs. That is very likely the reason for it. In both cases, mIRC needs to round an int64 to a double, as doubles are used in most calculations, eg. $calc(). On the other hand, are you saying that when $rand() is used with values in a reasonable range, ie. not huge values that are going to overflow due to internal conversions, that it is working well? Regarding %throwaway_above, I am not convinced that this is something mIRC should be doing. The scripter can easily cause calcuations to overflow throughout the scripting language, if they choose large enough numbers. Or are you saying that $rand() and $rands() should be mapping all int64 results to a specific maximum value?

Top




#264844  Yesterday at 10:40 PM
Re: $rand bias
[Re: Khaled]

Hoopy frood
Registered: 12/01/04
Posts: 1036

I had this written up already, so I'll go ahead an post it.
With random numbers, it's hard to state that something is 'working well' just because you can't easily see problems. When problems are visible at huge ranges but aren't at small ranges doesn't mean the problem isn't there at the smaller ranges.
If it's not reasonable to try to obtain a uint64 %throwaway_above where the range value could be a 53bit integer, a better method than trying to round to a double, would be something similar to the XOR folding mentioned in the FNV site. If trying to reduce a 64bit data input to a number in the 02^531 range where $calc can accurately do math, it can first split the 64 bits into group #1 having 53 bits and group #2 having 11 bits. The 11 bits of group#2 could be XOR'ed against the lowest 11 bits of group#1. This would create a 53 bit value where each of the 01FFFFFFFFFFFF values have the same 2^11 number of possible inputs.
Once you have the 53 bit value, then $calc can accurately calculate the %throwaway_above value needed for the $rand(N1,N2) range as long as none of %range or N1 or N2 is outside the accuracy range.
Once the 64bit number has been narrowed to a 53bit number, using the %throwaway_above value ensures that all outputs have an equal number of outputs, because it throws away extra inputs only available to some of the outputs. Where %in_max is a 52/53 bit number, the number of values being thrown away can never exceed half of %in_max and can never exceed (%range 1). Worst case scenario is throwing away nearly half the first random number, and for smaller ranges it should rarely come into play.
Any method of taking a 2^N size input and applying it against an output range whose size is not a 2^N size is going to have some bias in it, even if it's not easily detectable at 'normal' ranges. Other uses like the 64bit salt/iv used by $encode, or the 128bit salt used by $zip, could use the raw input used by $rands().

Top





