|
Joined: Dec 2002
Posts: 99
Babel fish
|
OP
Babel fish
Joined: Dec 2002
Posts: 99 |
Greetings,
I'm seeking and needing help in writing a alias to split up a huge text file. It contains 100,000 entrys, 1 entry per line, 100,000 lines in length into files containing 1 entry per line 5,000 entrys total, 5,000 lines in length.
Ideally, the alias would also incorporate a generated number with the output text filename generated for each 5,000 entrys stored, ie text1.txt, text2.txt
Here is a snippit of one of my crude attempts to just pull off the first 5,000 entrys, but its not working:
alias TextSplitter { set %Textnumb 1 set %Textnum 5000 :next if (%Textnumb > %Textnum) { /goto finish } set %info $read(C:\1Report\text.txt,%Textnumb) .write C:\1Report\text1.txt %info set %info $null inc %Textnumb goto next :finish }
Again, this crude attempt of mine doesn't include any updating of the output filename and was just my attempt to peel off the top 5,000 entrys as a example here.
Thanks to each for your comments and suggestions, MDA
|
|
|
|
Joined: Nov 2003
Posts: 2,327
Hoopy frood
|
Hoopy frood
Joined: Nov 2003
Posts: 2,327 |
Use /filter and make use of the n-n2 properties  /help /filter
New username: hixxy
|
|
|
|
Joined: Dec 2002
Posts: 99
Babel fish
|
OP
Babel fish
Joined: Dec 2002
Posts: 99 |
Greetings Matt,
I'm not filtering lines, I'm trying to rewrite the entire whopper 100,000 file into several manageable 5,000 line text files.
Thanks anyway, MDA
|
|
|
|
Joined: Nov 2003
Posts: 2,327
Hoopy frood
|
Hoopy frood
Joined: Nov 2003
Posts: 2,327 |
I understand what you want to do and my suggestion stays the same. /filter will be (one of) the quickest way(s) to do that: alias hugefiltertest {
var %i = 1, %ticks = $ticks
if (!$isdir) { mkdir test }
.fopen -no bigfile test\bigfile.txt
while (%i <= 100000) {
.fwrite -n bigfile %i
inc %i
}
.fclose bigfile
filter -cffr 1-10000 test\bigfile.txt test\file1.txt
filter -cffr 10001-20000 test\bigfile.txt test\file2.txt
filter -cffr 20001-30000 test\bigfile.txt test\file3.txt
filter -cffr 30001-40000 test\bigfile.txt test\file4.txt
filter -cffr 40001-50000 test\bigfile.txt test\file5.txt
filter -cffr 50001-60000 test\bigfile.txt test\file6.txt
filter -cffr 60001-70000 test\bigfile.txt test\file7.txt
filter -cffr 70001-80000 test\bigfile.txt test\file8.txt
filter -cffr 80001-90000 test\bigfile.txt test\file9.txt
filter -cffr 90001-100000 test\bigfile.txt test\file10.txt
echo -a Finished in $calc(($ticks - %ticks) / 1000) seconds.
}
Last edited by tidy_trax; 03/07/05 08:55 PM.
New username: hixxy
|
|
|
|
Joined: Dec 2002
Posts: 99
Babel fish
|
OP
Babel fish
Joined: Dec 2002
Posts: 99 |
Matt,
I'm not understanding what you posted here. It looks like you are pulling off the top 5,000 entrys each time. Meaning the same 5,000 would be copyed to files 1,2,3.
MDA
|
|
|
|
Joined: Nov 2003
Posts: 2,327
Hoopy frood
|
Hoopy frood
Joined: Nov 2003
Posts: 2,327 |
I was, I made a mistake. I've edited the above post.
New username: hixxy
|
|
|
|
Joined: Dec 2002
Posts: 99
Babel fish
|
OP
Babel fish
Joined: Dec 2002
Posts: 99 |
Matt,
That routine took awhile to run, and produced 20 files with nothing but the 5,000 line numbers in each of them, no data whatsoever from the original file was transfered over. I need the data transfered to be intact, no extra information added such as line numbers etc.
alias hugefiltertest { var %i = 1, %ticks = $ticks if (!$isdir) { mkdir test } .fopen -no bigfile test\bigfile.txt while (%i <= 1000000) { .fwrite -n bigfile %i inc %i } .fclose bigfile filter -cffr 1-5000 test\bigfile.txt test\file1.txt filter -cffr 5001-10000 test\bigfile.txt test\file2.txt filter -cffr 10001-15000 test\bigfile.txt test\file3.txt filter -cffr 15001-20000 test\bigfile.txt test\file4.txt filter -cffr 20001-25000 test\bigfile.txt test\file5.txt filter -cffr 25001-30000 test\bigfile.txt test\file6.txt filter -cffr 30001-35000 test\bigfile.txt test\file7.txt filter -cffr 35001-40000 test\bigfile.txt test\file8.txt filter -cffr 40001-45000 test\bigfile.txt test\file9.txt filter -cffr 45001-50000 test\bigfile.txt test\file10.txt filter -cffr 50001-55000 test\bigfile.txt test\file11.txt filter -cffr 55001-60000 test\bigfile.txt test\file12.txt filter -cffr 60001-65000 test\bigfile.txt test\file13.txt filter -cffr 65001-70000 test\bigfile.txt test\file14.txt filter -cffr 70001-75000 test\bigfile.txt test\file15.txt filter -cffr 75001-80000 test\bigfile.txt test\file16.txt filter -cffr 80001-85000 test\bigfile.txt test\file17.txt filter -cffr 85001-90000 test\bigfile.txt test\file18.txt filter -cffr 90001-95000 test\bigfile.txt test\file19.txt filter -cffr 95001-100000 test\bigfile.txt test\file20.txt echo -a Finished in $calc(($ticks - %ticks) / 1000) seconds. }
Regards, MDA
|
|
|
|
Joined: Dec 2002
Posts: 1,245
Hoopy frood
|
Hoopy frood
Joined: Dec 2002
Posts: 1,245 |
did you change test\bigfile.txt to C:\1Report\text.txt?
Last edited by MikeChat; 03/07/05 09:38 PM.
|
|
|
|
Joined: Dec 2002
Posts: 99
Babel fish
|
OP
Babel fish
Joined: Dec 2002
Posts: 99 |
Greetings MikeChat,
No, I created a test subdirectory in the mirc folder, copyed over that huge text file to the test subdirectory, then renamed it to bigfile.txt to fit the script alias you wrote.
MDA
|
|
|
|
Joined: Feb 2004
Posts: 2,019
Hoopy frood
|
Hoopy frood
Joined: Feb 2004
Posts: 2,019 |
EDIT: Had a version, but I like DaveC's version better, get that one.
Gone.
|
|
|
|
Joined: Sep 2003
Posts: 4,230
Hoopy frood
|
Hoopy frood
Joined: Sep 2003
Posts: 4,230 |
Here mr lazy is ya alias! ;
;usage $splitup(source,destination folder/file) * Do not add .txt!
;
alias Splitup {
if (!$isfile($1)) { return -1 } | ; invalid source file
if (($nofile($2)) && (!$isdir($v1))) { return -2 } | ; invalid destination folder
if (!$mkfn($nopath($2))) { return -3 } | ; invalid destination file
var %i = 1 | filter -cffr 1-5000 $+(",$1,") $+(",$nofile($2),$mkfn($nopath($2)),%i,.txt")
while ($filtered) {
inc %i | filter -cffr $+($calc(1 + %i * 5000),-,$calc(5000 + %i * 5000)) $+(",$1,") $+(",$nofile($2),$mkfn($nopath($2)),%i,.txt")
}
remove $+(",$nofile($2),$mkfn($nopath($2)),%i,.txt") | dec %i
return %i
} //echo -a i made $splitup(C:\1Report\text.txt,otherfilename) files
|
|
|
|
Joined: Nov 2003
Posts: 2,327
Hoopy frood
|
Hoopy frood
Joined: Nov 2003
Posts: 2,327 |
Mine was just an example, I didn't actually write the code for you 
New username: hixxy
|
|
|
|
Joined: Sep 2003
Posts: 4,230
Hoopy frood
|
Hoopy frood
Joined: Sep 2003
Posts: 4,230 |
EDIT: Had a version, but I like DaveC's version better, get that one. lol i almost fell of my chair on that one! Actually i thought mine was a bit dodgy in how it ends up making the extra file, then deletes it at the end. I was gonna say it was a feature, incase some old filenameNNN.txt files existed you could see where the new ones ended by the missing one, but i thought that would be stretching the fact its a fixup at the end just a little bit.
|
|
|
|
Joined: Dec 2002
Posts: 99
Babel fish
|
OP
Babel fish
Joined: Dec 2002
Posts: 99 |
Thanks Matt, DaveC but I'll pass on using the Filter command and just go back to the read/write format until I can work it thru.
I don't know enough about the filter command to trust any code that is listed with professional comments like, here mr. lazy. That reminds me very much of some jack that posted to another user having ram problems dumping his kernal32.exe stack to Delete his kernal32 file.
I just haven't used the filter command enough to spot intentional bad code and not. Obviously after following and trying the first suggested alias which generated 20 crap files filled with nothing but line numbers, I'll study up on the Filter command and in the meanwhile work it out with the $read/$write commands.
Thanks anyway, MDA
|
|
|
|
Joined: Sep 2003
Posts: 4,230
Hoopy frood
|
Hoopy frood
Joined: Sep 2003
Posts: 4,230 |
lol, dude, carm down, i was being sarcastic, you did ask for the alias remember. And you really should use the filter command OR at least switch to the /fopen $fread /fwrite /fclose commands. I say this becuase, when you read a line in a txt file using $read it reads from the begining of the file to the line number you requested, so reading a 100,000 line file means that line 1 takes a read of 1 line , line 2 takes a read of 2 lines, line 3 takes a read of 3 lines, etc by the time you have read to line 100,000 you have read a total of 5,000,000,000 lines, a large number indeed! The first codes passed were examples of how to use the filter command, you recieved, files with linenumbers in them becuase that was the 100,000 line file it was using as the source, the alias actually made the file, i think you reported a bug in it that caused it to produce duplicate files, and that was corrected, but even then it was still just going to show you how to use filter, not actually allow you to use it on your file. Ill explain quickly what the filter command does as i and others used it. /FILTER -cffr 1 5000 source destination matchtextFilter is designed to search the source (file) for matching lines (if no matchtext is preset then it matches everything so its a copy, however read on) the -c option tells it to clear the destination (file) before writing to it the -ff option is needed to tell /filter its using files since it can use windows -w and other things as the source/destination locations the -r option this is the big key item this tells /filter to read 2 numbers after the options, and these are the starting line and ending line to search from so we have /FILTER - cleardestination - file(source) - file(destination) - read 1 5000 source(file) destination(file) as i said above , since there is no matchtext it just copies all them lines Ill document my code below to let u see what its doing ***** I HAVE CORRECTED A BUG IN THE CODE ALSO ***** The line numbers in the while loop filter were wrong! ooops!
;usage $splitup(source,destination folder/file) * Do not add .txt!
;
alias Splitup {
if (!$isfile($1)) { return -1 } | ; invalid source file
;^ if $1 is not a file exit with a value -1 to indercate bad source filename
;
if (($nofile($2)) && (!$isdir($v1))) { return -2 } | ; invalid destination folder
;^if $2 minus the filename exists (theres a DIR there) then check if the DIR is valid, and if its not then exit with a value -2 to tell you this.
;
if (!$mkfn($nopath($2))) { return -3 } | ; invalid destination file
;^ if after replaceing illegal file characters in the destination filename there turns out to be no filename then exit with a value -3 to tell you this
;
;* If we reach here we can proceed to make the files. *
;
var %i = 1
;^set destination file # fileNNN.txt NNN being 1
;
filter -cffr 1-5000 $+(",$1,") $+(",$nofile($2),$mkfn($nopath($2)),%i,.txt")
;^ do the first filter lines 1 to 5000
;^ I add " " around the source filename since /filter doesnt like the names if spaces are in it
;^ I also do the same to the destination file, but i also get the path $nofile(),
; & then remove illegal filename characters from the filename $mkfn($nopath())
; & finally add the file number %i, and the .txt
;
;A while loop repeats untill the condition becomes false/zero, $filtered is a value saying howmany lines went through the filter
;on first encountering this below, unless the source was empty $filter well have some value (5000) on a 100,000 line file
while ($filtered) {
;^ enter here if some or all of the 5000 lines of the last /filter were copied
;
inc %i
;^ add 1 to the destination file number counter
;
[color:blue]filter -cffr $+($calc(%i * 5000 - 4999),-,$calc(%i * 5000)) $+(",$1,") $+(",$nofile($2),$mkfn($nopath($2)),%i,.txt")
;^ do a filter just like the one above, but this time use (%i * 5000 -4999) for the start line and (%i * 5000) for the end line
;^ ie: if %i was 2 then its (2x5000-4999) to (2x5000) aka lines 5,001 to 10,000
;^ ie: if %i was 3 then its (3x5000-4999) to (3x5000) aka lines 10,001 to 15,000
;[/color]
}
remove $+(",$nofile($2),$mkfn($nopath($2)),%i,.txt")
;^ since the while loop exits only when the last filter had no lines copied (ie its line numbers are beyond the end of the file)
;^ I must delete this last file I created, as its empty anyway
;
dec %i
;^ lastly reduce the destination file number counter to be the last file actually written
;
return %i
;^ exit with this value, so u know how many files were made
;
}
|
|
|
|
Joined: Dec 2002
Posts: 99
Babel fish
|
OP
Babel fish
Joined: Dec 2002
Posts: 99 |
Greetings,
The code 'looks' fine, however mIRC tells me its missing a bracket. The re-editing is going slowly trying to remove all the comments and there's some pipeing symbols which I'm not quite sure why you added them, they'll probably have to be removed also where they are next to a comment. ie: | ; invalid destination folder
What is the correct command line entry for your alias if it becomes functioning for a mIRC folder named Test and the original Filename Fatext.txt in that folder to split up?
is it
/Splitup Test Fatext
??
MDA
|
|
|
|
Joined: Dec 2002
Posts: 99
Babel fish
|
OP
Babel fish
Joined: Dec 2002
Posts: 99 |
Greetings DaveC,
You wrote, And you really should use the filter command OR at least switch to the /fopen $fread /fwrite /fclose commands. "I say this because, when you read a line in a txt file using $read it reads from the begining of the file to the line number you requested, so reading a 100,000 line file means that line 1 takes a read of 1 line , line 2 takes a read of 2 lines, line 3 takes a read of 3 lines, etc by the time you have read to line 100,000 you have read a total of 5,000,000,000 lines, a large number indeed!"
DaveC, $read, $write commands also include the ability to remove a specific number line of text after that variable has been read and written. One line would be all that would be needed to read at any time, that line written to another text file, a counter variable increased and then the original top line in the fat text file removed. It's also likely FASTER than using the Filter command.
A Simple Working Example:
alias TextFileSplitter { set %textnumb 1 set %textnum 5000 :next if (%textnumb > %textnum) { /goto finish } set %textee $read($mircdirtest\fatext.txt,1) if (%textee == $null) { echo 4 Data Textee is null | /goto finish } .write $mircdirtest\text1.txt %textee .write -dl $+ 1 $mircdirtest\fatext.txt set %textee $null inc %textnumb goto next :finish }
MDA
Last edited by MDA; 05/07/05 08:55 PM.
|
|
|
|
Joined: Sep 2003
Posts: 4,230
Hoopy frood
|
Hoopy frood
Joined: Sep 2003
Posts: 4,230 |
| ; invalid destination folder if you want to add a comment to the end of a line you need to use a | before it. the | is a command seperator (rather than a output pipe director) ex: //echo hello | echo blah hello blah so yes you just remove the | as well, i have included a un documented copy here ** one small addition i placed a . infront of the /remove command so it doesnt display the file removed message ** ;usage $splitup(source,destination folder/file) * Do not add .txt!
;
alias Splitup {
if (!$isfile($1)) { return -1 }
if (($nofile($2)) && (!$isdir($v1))) { return -2 }
if (!$mkfn($nopath($2))) { return -3 }
var %i = 1
filter -cffr 1-5000 $+(",$1,") $+(",$nofile($2),$mkfn($nopath($2)),%i,.txt")
while ($filtered) {
inc %i
filter -cffr $+($calc(%i * 5000 - 4999),-,$calc(%i * 5000)) $+(",$1,") $+(",$nofile($2),$mkfn($nopath($2)),%i,.txt")
}
.remove $+(",$nofile($2),$mkfn($nopath($2)),%i,.txt")
dec %i
return %i
} PS: incase your having trouble cutting and pasting this into mirc, you copy it, paste into wordpad, then copy it from there and paste into mirc, this is a problem with this forum and the way explorer (i think) copies text from it) What is the correct command line entry for your alias if it becomes functioning for a mIRC folder named Test and the original Filename Fatext.txt in that folder to split up? .method1 /Splitup sourcefile destinationfile sourcefile is the exact filename ex blahblah.txt or c:\wobble\blah.txt etc destinationfile is the fileto split into minus the .txt ex blob or c:\stats\blob etc .method2 var %filecount = $Splitup(sourcefile,destinationfile) sourcefile is the same as in method one, but can now also be big file.txt or c:\my folder\blah file.txtdestination file is the same, but may also contain spaces like sourcefile you need to do it this way, becuase mirc doesnt handle spaces in filenames passed to the routine well amagine /splitup text file results is it text & file results or text file & results or even text & file & ignore the word results!
|
|
|
|
Joined: Sep 2003
Posts: 4,230
Hoopy frood
|
Hoopy frood
Joined: Sep 2003
Posts: 4,230 |
DaveC, $read, $write commands also include the ability to remove a specific number line of text after that variable has been read and written. One line would be all that would be needed to read at any time, that line written to another text file, a counter variable increased and then the original top line in the fat text file removed. It's also likely FASTER than using the Filter command. It well be faster but still hugely disk intensive When you remove the first line of the file, the file is completely rewritten minus the firstline, i beleive however, it is rewritten in large blocks (64kb or 1meg etc) rather a line at a time. Doing so would be incredibly wastefull as its a process that is not needed. The fastest method of all would to be use the /fopen $fread /fwrite /fclose commands, then the source file well be read a total of 1 times, and the result files well be written to a total of 1 times, how ever i just ran a test on my alias and it took 2 seconds, 2 seconds for 100,006 lines isnt to shabby.
|
|
|
|
Joined: Apr 2003
Posts: 701
Hoopy frood
|
Hoopy frood
Joined: Apr 2003
Posts: 701 |
I really doubt it, but feel free to actually test them both (or all three) on such a file of 100000 lines.
The point remains, $read and /write do the following: - open file - read characters and count $crlf until specified line reached - read line in and return it - close file
This means those files are opened and closed 100000 times, and you still have to search 1+2+3+4+...+5000 lines for the /write in the smaller files...
Using /fopen, $fread /fwrite and /fclose, you bring that number back to 1+20 times, $fread reads in sequence, it remembers the last position it read, /fwrite just writes at the back, no searching needed. This makes it very likely to be faster, a lot faster even.
Now do the same in native compiled code instead of a script language like mIRC script and you get the performance of /filter.
* Kelder goes for the /filter if possible
Since you'll probably not beleive me, try these 2 scripts: alias test1 { var %i = 1, %time = $ticks fopen -no blub delme.txt while (%i < 100000) { .fwrite blub look! this is line number %i ! inc %i } fclose blub echo -s time taken: $calc($ticks - %time) ms }
alias test2 { var %i = 1, %time = $ticks while (%i < 100000) { write delme.txt look! this is line number %i ! inc %i } echo -s time taken: $calc($ticks - %time) ms }
Test1 runs in 10500 ms, test2 in 84200, and this test is only half the requirements...
ps: Look up /while, it might not be faster, but it has more chance of being readable and correct. While you're at it, /var is nice too!
|
|
|
|
|