cat /etc/os-release
Random file contents are meaningless, you can put there whatever you want.
WSL isn't archlinux.
Which is not archlinux at all…
[root@blob Arch]# cat /etc/os-release
NAME="Arch Linux"
PRETTY_NAME="Arch Linux"
ID=arch
BUILD_ID=rolling
ANSI_COLOR="38;2;23;147;209"
HOME_URL="https://archlinux.org/"
DOCUMENTATION_URL="https://wiki.archlinux.org/"
SUPPORT_URL="https://bbs.archlinux.org/"
BUG_REPORT_URL="https://bugs.archlinux.org/"
LOGO=archlinux-logo
IMAGE_ID=archlinux
IMAGE_VERSION=2022.10.01
Another neat test is compressing the output with any modern general-purpose algorithm:
How about this time?
$ exam.password 32 10000 | xz -9 | wc -c
274724
$ pwgen --capitalize --numerals --symbols --secure 32 10000 | xz -9 | wc -c
275684
Here is the new bash function:
exam.password ()
{
local len=${1:-8};
local amount=${2:-1};
local i str index j k;
local str='01234567';
declare -a Data=([0]='0123456789' [1]='abcdefghijklm' [2]='nopqrstuvwxyz'
[3]='ABCDEFGHIJKLM' [4]='NOPQRSTUVWXYZ' [5]="'!?%(*)+,-" [6]='.:;<=>$&[]'
[7]='\`@"#^_{|}/~');
[[ $len -lt 8 ]] && {
\builtin echo "Length must be greater then 7." 1>&2;
return
};
for ((j=0,i=0; i<$[amount * len]; i++,j++))
do
index=${str:$[$SRANDOM%${#str}]:1};
if [[ ${#str} -gt 1 ]]; then
str=${str//"$index"};
else
str='01234567';
fi;
\builtin printf "%b" "${Data[${index}]:$[$SRANDOM%${#Data[$index]}]:1}";
[[ $j -lt $len ]] && continue;
j=0;
\builtin echo;
done;
\builtin echo;
set +x
}
If the thread is not about Arch, I suspect it’s going to be nuked soon.
I really hope not, because I am running ArchWSL.
]]>Bash makes no strong promises regarding `$SRANDOM` being a CSPRNG. It’s a best-effort attempt, but not a guarantee.⁽¹⁾ Until now the assumption was that we’re talking about Arch Linux. Because this distribution uses a recent Linux kernel, it provides getrandom. That means on Arch Linux `$SRANDOM` is expected to be a CSPRNG.⁽²⁾ That assumption doesn’t hold for other platforms.
On other systems that may be BSD-flavoured arc4random, if available, or a PRNG not meant for security purposes at all.⁽³⁾ With a note that `arc4random` itself may be a placeholder using a proper CSPRNG⁽⁴⁾ or a vulnerable RC4-based generator.
This should underline, how many traps there is to consider. If this is not Arch Linux, even my “/dev/urandom” solution may be invalid.
____
⁽¹⁾ “The random number generator is not linear on systems that support /dev/urandom or arc4random, so each returned number has no relationship to the numbers preceding it.” — bash: $SRANDOM (emphasis added by me)
⁽²⁾ Even that is not strictly true, because `gerandom` is used in non-blocking mode.
⁽³⁾ bash-5.2: lib/sh/random.c, lines 223–240
⁽⁴⁾ Nowadays usually ChaCha20 or AES.
academic assignment?
A BASH exercise yes. but not an academic assignment. In an constrained working environment
e.g: Windows Subsystem for Linux, only limit set tools are available. of cause there
are coreutils and pipes, but they are slower then pure BASH.
As for the OP, I hold my position. But it may be a great exercise in general programming
to understand where the above issues come from.
I agree this only suitable as an exercise not for production use. And I am not searching for a direct answer here,
all the advises I have got so far are valuable.
For security purposes one does not verify the distribution. One calculates the distribution. That’s supposed to be deductive reasoning, not inductive. People are making mistakes, so running a test is always nice — but that serves only as a safety net, never the proof. That is because a test, that does not produce a failure, is meaningless.⁽¹⁾
However: a test, that produces a failure, is meaningful. That property may be used to avoid wasting time on doing unnecessary work. And with a right test this can be done here. To even start asking if the distribution of symbols is uniform, the events of choosing a symbol must be mutually independent. A trivial and commonly employed visualization can be used here:
while true; do exam.password 64; done
Let it scroll. Do you see the dark bars repeating each 8 columns? There shouldn’t be there, just like any other easily recognizable pattern. That image saved me looking into the code, not to mention calculating anything.⁽²⁾
Another neat test is compressing the output with any modern general-purpose algorithm:
$ exam.password 32 10000 | xz -9 | wc -c
192220
$ for ((i=0; i<10000; ++i)); do keepassxc-cli generate -L 32; done | xz -9 | wc -c
242944
$ for ((i=0; i<10000; ++i)); do cat /dev/urandom | tr -cd '[:alnum:]' | head -c 32; echo ; done | xz -9 | wc -c
251900
Since those algorithms work by maximizing entropy-per-bit, they may be used for a quick, very rough estimation of how random was the compressed data. Perfectly random data would be incompressible at all. The smaller the result is, the more predictable is the data. As we can see, `exam.password` scores noticeably worse than proven alternatives. And it does that despite… two other options are not even using punctuation. In fact it is worse than a version using just 26 uppercase letters.
Finally, but that is not relevant to password generation for the reason given in the second paragraph: even if testing the distribution this way would have some value,⁽³⁾ it must be done without ignoring it’s probabilistic nature. Just eyeballing values will not do. If the underlying distribution may produce all values,⁽⁴⁾ there is always a non-zero chance a sample will show any other arbitrarily chosen distribution. Sometimes it’s fine to just take a look. Not here: there is too many similar distributions, which are supposed to be ruled out, to simply accept those values as uniform because they look flateish. There is too many flateish things around. Even a back of the envelope test, like Kolmogorov-Smirnov,⁽⁵⁾ makes it clear the look is deceiving. But let me propose another distribution:
P(x is letter) = 1/52 · 4/8
P(x is digit) = 1/10 · 1/8
P(x is special) = 1/31 · 3/8
As for the OP, I hold my position. But it may be a great exercise in general programming to understand where the above issues come from.
____
⁽¹⁾ At least in terms of binary logic.
⁽²⁾ As it happens, using `tr` to replace some character categories in this image with a single character points to the source of the problem.
⁽³⁾ In many areas it has.
⁽⁴⁾ That is: for any possible outcome, probability for that outcome is non-zero.
⁽⁵⁾ Before someone starts building a stake for applying KS to categorical data: the reference is a uniform distribution, so it works the same for continuous and discrete cases, and whichever total order is used.
exam.password.validate ()
{
declare -A Res;
local len=${1:-8};
local i c format=8 total amount=${2:-1};
local fun=${3:-'exam.password'};
[[ $len -lt 8 ]] && {
\builtin echo "Length must be greater then 7." 1>&2;
return
};
[[ $fun == 'pwgen' ]] && fun="/usr/sbin/pwgen --capitalize --numerals --symbols
--secure";
while read line; do
for ((i=0; i<${#line}; i++))
do
c="${line:$i:1}";
Res["$c"]=$[${Res["$c"]} + 1];
done;
done <<< \$"$($fun $len $amount)";
total=$[len*amount];
for i in "${!Res[@]}";
do
\builtin printf "%b" "${i}:";
\builtin printf '%1.2f' "$[${Res[$i]}*10000/total]e-2";
\builtin printf "%s" "% ";
if [[ $format -le 0 ]]; then
format=8;
\builtin echo;
else
format=$[format - 1];
fi;
done
}
I got this result:
$ exam.password.validate 8 100000 pwgen >/tmp/pwgen
?:1.25% >:1.27% =:1.25% <:1.25% ;:1.24% ::1.21% 9:1.24% 8:1.25% 7:1.25%
6:1.25% 5:1.24% 4:1.24% 3:1.25% 2:1.26% 1:1.24% 0:1.25% /:1.03% .:1.26%
-:1.24% ,:1.27% +:1.24% *:1.24% ):1.24% (:1.25% ':1.25% &:1.28% %:1.23%
$:1.24% #:1.04% ":1.04% !:1.24% _:1.04% ^:1.04% ]:1.23% [:1.23% Z:0.97%
Y:0.94% X:0.96% W:0.95% V:0.96% U:0.96% T:0.96% S:0.95% R:0.96% Q:0.97%
P:0.95% O:0.94% N:0.94% M:0.95% L:0.96% K:0.95% J:0.95% I:0.95% H:0.97%
G:0.96% F:0.94% E:0.96% D:0.97% C:0.96% B:0.96% A:0.95% @:1.05% ~:1.05%
}:1.04% |:1.04% {:1.02% z:0.97% y:0.94% x:0.96% w:0.95% v:0.98% u:0.95%
t:0.95% s:0.97% r:0.95% q:0.95% p:0.96% o:0.95% n:0.96% m:0.96% l:0.95%
k:0.97% j:0.97% i:0.97% h:0.95% g:0.97% f:0.95% e:0.94% d:0.95% c:0.96%
b:0.95% a:0.95% `:1.02%
Then I do the same verification with improved bash function:
exam.password ()
{
declare -a Data;
declare -a Res;
declare -A Type=([1]=10 [2]=13 [3]=13 [4]=13 [5]=13 [6]=10 [7]=10 [8]=12);
local len=${1:-8};
local amount=${2:-1};
local i num lower upper punct type type_index=8 j k;
Data[1]='0123456789';
Data[2]='abcdefghijklm';
Data[3]='nopqrstuvwxyz';
Data[4]='ABCDEFGHIJKLM';
Data[5]='NOPQRSTUVWXYZ';
Data[6]="'!?%(*)+,-";
Data[7]='.:;<=>$&[]';
Data[8]='\`@"#^_{|}/~';
[[ $len -lt 8 ]] && {
\builtin echo "Length must be greater then 7." 1>&2;
return
};
for ((j=0; j<$amount; j++))
do
for ((i=0; i<$len; i++))
do
type=$[$SRANDOM%$type_index];
if [[ $type_index > 1 ]]; then
type_index=$[type_index - 1];
else
type_index=8;
fi;
Res["$i"]=""$type_index":$[$SRANDOM%${Type[$type_index]}]";
done;
for ((i=0; i<$len; i++))
do
k=${Res[$i]#*:};
\builtin printf "%b" "${Data["${Res[$i]%:*}"]:$k:1}";
done;
\builtin echo;
done;
set +x
}
I got this result:
$ exam.password.validate 8 100000 >/tmp/pass
?:1.23% >:1.27% =:1.24% <:1.24% ;:1.23% ::1.23% 9:1.26% 8:1.24% 7:1.24%
6:1.25% 5:1.22% 4:1.23% 3:1.25% 2:1.26% 1:1.26% 0:1.25% /:1.05% .:1.23%
-:1.25% ,:1.23% +:1.23% *:1.24% ):1.24% (:1.25% ':1.24% &:1.24% %:1.27%
$:1.25% #:1.03% ":1.04% !:1.27% _:1.03% ^:1.06% ]:1.26% [:1.24% Z:0.96%
Y:0.94% X:0.95% W:0.94% V:0.96% U:0.97% T:0.96% S:0.95% R:0.98% Q:0.94%
P:0.96% O:0.94% N:0.98% M:0.95% L:0.96% K:0.96% J:0.97% I:0.96% H:0.95%
G:0.96% F:0.96% E:0.96% D:0.94% C:0.96% B:0.96% A:0.96% @:1.02% ~:1.04%
}:1.03% |:1.03% {:1.03% z:0.96% y:0.97% x:0.96% w:0.96% v:0.95% u:0.94%
t:0.97% s:0.97% r:0.95% q:0.95% p:0.95% o:0.95% n:0.95% m:0.97% l:0.95%
k:0.97% j:0.96% i:0.96% h:0.94% g:0.94% f:0.96% e:0.96% d:0.95% c:0.95%
b:0.96% a:0.96% `:1.03%
The bias in this case is mainly introduced by the algorithm itself. It picks alpha and digit at roughly 1/3 probability, but alpha is after all a-z A-Z 2*26 = 52 characters whereas digit is 0-9 = 10 characters. So you end up giving 10 characters the same probability as 52 characters. As a result, this algorithm is heavily biased towards having many digits in the password.
Normally when you generate random passwords, you make it meet minimum requirements (at least one lower, upper, digit, punct as that is unfortunately enforced on many sites) and then pick the rest at completely equal probabilities and not continue giving digits so much weight.
You can still generate passwords this way, if you somehow like passwords with many digits in them but you should be aware that this is what your algorithm is doing after all. Now if the password is long enough, it doesn't matter since entropy adds up anyway, but in a short password, you end up having less effective entropy than expected. I was not sure, from the original code and later reply, that this behavior is intended.
]]>If you are a beginner programmer⁽¹⁾ and want to play with code, it is great and everybody here appreciates that. I hope you will progress and one day contribute to the community. If that is just a program to generate random stuff for no particular purpose, it’s a nice exercise. But, if that is for actual password generation, from all the territories to explore as a rookie you have chosen one of the few areas, that you should absolutely not travel to at this skill level. Using an RPG analogy, you are armed with a knife and a wooden shield, but you want to jump into a monster lair where lvl 99 players have trouble surviving.⁽²⁾ If that goes to production, it’s even worse, as you are luring those monsters onto other players.
It’s not a matter of us pointing to a few specific vulnerabilities. Like a biased RNG you use⁽³⁾ or inadequate validation⁽⁴⁾. Sure, you will fix that… and what next? If we can no longer notice any it’s not because your code is fine — it’s because we can spot only so many mistakes, because we are not professional cryptographers, as cryptographers we would not spend days of otherwise well paid time on listing shortcomings of an obviously wrong approach, and — most importantly — it’s not a matter of whether you or us can break it, but potential adversaries.
No, the problem is not with the particular code. The problem is in what you are armed with. To approach that opponent, you must have a decent grasp of at least information theory, probabilistics, combinatorics, and knowledge of existing vulnerabilities and attacks. I can’t even find a way to express, how large is the gap between what you want to do and what you can do, because you first must be on the other side of this gap to have the clear view on it.
So I urge you, please refrain from touching this topic now. In particular for production. Study how key generation is done in existing software, understand why specific approaches were taken, see how existing algorithms were broken, spend some time observing the computer security playground. Then you may try doing that again. But I doubt you will want to.
____
⁽¹⁾ You may not be, but I can only judge what I see. And so far in this thread you gave exactly that impression.
⁽²⁾ With 13 years of professional programming under my belt, much more in general, and a long standing interest in security, this is still a tricky minefield to me.
⁽³⁾ $SRANDOM itself may be unbiased, but you unconsciously disturb the distribution by the use of operator `%`.
⁽⁴⁾ That validation gives some numbers, that are not answering any important question. Not only about cryptographic properties, but even about basic statistical ones.
For RANDOM you have to consider, it's pseudorandom and the number range is small (0-32767). In bash you can use SRANDOM instead.
I rewrote the code with SRANDOM.
exam.password ()
{
local len=${1:-8};
local amount=${2:-1};
local tmp res res1 res2 i j l num='0123456789';
local alpha='abcdefghijgklmnopqrstuvwxyzABCDEFGHIGKLMNOPQRSTUWXYZ';
local punct="'!\"#$%&()+,-.:;<=>?@[*]^_\`{|}/\~";
[[ $len -lt 8 ]] && {
\builtin echo "Length must be greater then 7.";
return
};
for ((j=0; j<$amount; j++))
do
res="${num:$[$SRANDOM%10]:1}${alpha:$[$SRANDOM%52]:1}${punct:$[$SRANDOM%32]:1}";
res1=${res:$[$SRANDOM%3]:1};
res=${res/"$res1"/};
res2=${res:$[$SRANDOM%2]:1};
res=${res/"$res2"/};
res+="$res1$res2";
l=$[len-3];
for ((i=0; i<$l; i++))
do
tmp="${num:$[$SRANDOM%10]:1}${alpha:$[$SRANDOM%52]:1}${punct:$[$SRANDOM%32]:1}";
res+="${tmp:$[$SRANDOM%3]:1}";
done;
\builtin echo "${res}";
done;
}
exam.password.validate ()
{
declare -A Res;
local len=${1:-8};
local i total amount=${2:-1};
[[ $len -lt 8 ]] && {
\builtin echo "Length must be greater then 7.";
return
};
while read line; do
for ((i=0; i<${#line}; i++))
do
[[ "${line:$i:1}" =~ [[:digit:]] ]] && {
Res[digit]=$[${Res[digit]} + 1];
continue
};
[[ "${line:$i:1}" =~ [[:alpha:]] ]] && {
Res[alpha]=$[${Res[alpha]} + 1];
continue
};
[[ "${line:$i:1}" =~ [[:punct:]] ]] && {
Res[punct]=$[${Res[punct]} + 1]
};
done;
done <<< "$(exam.password $len $amount)";
total=$[${Res[digit]} + ${Res[alpha]} + ${Res[punct]}];
for i in "${!Res[@]}";
do
\builtin printf "${i}:";
\builtin printf '%8.2f' "$[${Res[$i]}*10000/total]e-2";
\builtin echo "%";
done;
}
Tested probability.
$: time exam.password.validate 8 100000
alpha: 33.64%
punct: 32.63%
digit: 33.72%
real 0m21.147s
user 0m21.004s
sys 0m0.930s
And compared with pwgen:
$: time pwgen --capitalize --numerals --symbols --secure 100000 8 >/dev/null
real 0m0.223s
user 0m0.014s
sys 0m0.208s
$:time exam.password 8 100000 >/dev/null
real 0m7.154s
user 0m6.699s
sys 0m0.455s
Would this new version be considered to generate thousands of passwords for production use?
]]>doesn't matter if you create one password. might matter if you do thousands...
]]>head -c $(((len+4)/5*4)) /dev/urandom | basenc -w 0 --z85 | head -c $len
(editted a couple times, man does shell arithmetic suck... apparently the order of operations matters as flooring the values happens between operations. So '4 * (len+4) / 5' gives different results from '(len+4)/5 * 4'.
EDIT: perhaps it doesn't really matter. I'm not sure I fully understand it, but it seems /dev/urandom will just pull an initial seed from the actual entropy pool regardless of how much data is requested; then that seed is used to initialize the generator which can provide an effectively endless stream of (pseudo) random bytes. So perhaps a read of hundreds of kb from /dev/urandom has no more detriment to the entropy pool than a read of just a handful of bytes.
]]>basenc -w 0 --z85 /dev/urandom | head -c 29