Why is this method better than /aline -n? The main use of /filter here is just to loop through the lines in the file. /filter could still be used with /aline -n for the same purpose. The actual differences here are:
- instead of using a hidden window as workspace, you are using a file (tempfile.txt)
- instead of /aline -n, you are using $read() to check for lines already encountered
So the $read()-based method is essentially the same, except much slower because
for each line in the original file, it uses $read() on a temp file to scan for the presence of the line (meaning opens the temp file, looks at each line until a match is found, closes the file) and then calls /write (meaning opening the file, writing a line at the end, closing the file).
/aline -n uses the same (fundamentally slow) approach of scanning
all lines in the workspace (until a match is found),
for each line in the source. However this is all done in a hidden window, so the process is faster.
A different approach that is much faster in theory (and in practice) is to use hash tables to store the lines encountered so far. Due to the way hash tables work, checking if a new line in the source has been encountered is very fast. Here are all 3 methods:
rmduplines1 {
write -c temp.txt
filter -k $1- rmduplines1-callback ?*
}
rmduplines1-callback {
if (* !iswm $read(temp.txt,nw,$1)) write temp.txt $1
}
rmduplines2 {
window -h @a
filter -k $1- rmduplines2-callback ?*
savebuf @a temp.txt
close -@ @a
}
rmduplines2-callback {
if (* iswm $gettok($1,1-,32)) aline -n @a $1
}
rmduplines3 {
window -h @a
filter -k $1- rmduplines3-callback ?*
savebuf @a temp.txt
close -@ @a
hfree -w lines
}
rmduplines3-callback {
if * iswm $gettok($1,1-,32) {
if (!$hget(lines,$crc($1,0))) aline @a $1
hadd -m lines $crc($1,0) 1
}
}
I made some minor modifications to your original code, like adding the n switch in $read and removing the asterisk from $1-: you want to scan for lines
equal to $1-, not
starting with $1-. Another minor modification was replacing the if (!$read()) check with if (* !iswm $read()), which is equivalent to if ($read() != $null): this way it won't fail with "$false" or "0" etc.
On a sidenote, using $read with the w switch would break if $1- itself contained wildcards: a way to avoid this is to use the s switch instead and also check $readn.
All 3 rmduplinesN aliases accept a filename and create the file temp.txt with duplicate lines removed.
I timed the 3 methods using the full
versions.txt (more than 11,000 lines) and got the following results from one run (subsequent runs produced similar numbers):
rmduplines1 time: 31949 ms
rmduplines2 time: 3588 ms
rmduplines3 time: 873 ms
The $read/write method is by far the slowest, with the hash table method being the fastest.