たいのですがシャッフルのラインテキストファイルのUnixコマンドラインまたはシェルスクリプト?
-
23-09-2019 - |
質問
たいシャッフルのラインテキストファイルをランダムに新しいファイルです。ファイルの場合において何千ものです。
する方法を教えてくださいる cat
, awk
, cut
, などが有ります。
解決
利用できる shuf
.システムによっては少なくともなされるのはPOSIX).
としてjleedev可能性が指摘され sort -R
として提供するものであり、オプションになります。システムによっては少なくとも;を、まるで写ります。 で指摘されてい その sort -R
いんとシャッフルではなく並べ替え項目に応じてハッシュ値です。
[編集部注: sort -R
ほとんど shufflesことを除いて、 複製 ライン/ソートキーも 隣.言い換えればのみ 独自の 入力ライン/キーで真の選択することができますがその出力は ハッシュ値, の乱数の発生源からの選択をランダムハッシュ 機能 見 マニュアル.]
他のヒント
Perlのワンライナーは、マキシムのソリューション
の簡易版になりますperl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' < myfile
This answer complements the many great existing answers in the following ways:
The existing answers are packaged into flexible shell functions:
- The functions take not only
stdin
input, but alternatively also filename arguments - The functions take extra steps to handle
SIGPIPE
in the usual way (quiet termination with exit code141
), as opposed to breaking noisily. This is important when piping the function output to a pipe that is closed early, such as when piping tohead
.
- The functions take not only
A performance comparison is made.
- POSIX-compliant function based on
awk
,sort
, andcut
, adapted from the OP's own answer:
shuf() { awk 'BEGIN {srand(); OFMT="%.17f"} {print rand(), $0}' "$@" |
sort -k1,1n | cut -d ' ' -f2-; }
- Perl-based function - adapted from Moonyoung Kang's answer:
shuf() { perl -MList::Util=shuffle -e 'print shuffle(<>);' "$@"; }
- Python-based function, adapted from scai's answer:
shuf() { python -c '
import sys, random, fileinput; from signal import signal, SIGPIPE, SIG_DFL;
signal(SIGPIPE, SIG_DFL); lines=[line for line in fileinput.input()];
random.shuffle(lines); sys.stdout.write("".join(lines))
' "$@"; }
- Ruby-based function, adapted from hoffmanc's answer:
shuf() { ruby -e 'Signal.trap("SIGPIPE", "SYSTEM_DEFAULT");
puts ARGF.readlines.shuffle' "$@"; }
Performance comparison:
Note: These numbers were obtained on a late-2012 iMac with 3.2 GHz Intel Core i5 and a Fusion Drive, running OSX 10.10.3. While timings will vary with OS used, machine specs, awk
implementation used (e.g., the BSD awk
version used on OSX is usually slower than GNU awk
and especially mawk
), this should provide a general sense of relative performance.
Input file is a 1-million-lines file produced with seq -f 'line %.0f' 1000000
.
Times are listed in ascending order (fastest first):
shuf
0.090s
- Ruby 2.0.0
0.289s
- Perl 5.18.2
0.589s
- Python
1.342s
with Python 2.7.6;2.407s
(!) with Python 3.4.2
awk
+sort
+cut
3.003s
with BSDawk
;2.388s
with GNUawk
(4.1.1);1.811s
withmawk
(1.3.4);
For further comparison, the solutions not packaged as functions above:
sort -R
(not a true shuffle if there are duplicate input lines)10.661s
- allocating more memory doesn't seem to make a difference
- Scala
24.229s
bash
loops +sort
32.593s
Conclusions:
- Use
shuf
, if you can - it's the fastest by far. - Ruby does well, followed by Perl.
- Python is noticeably slower than Ruby and Perl, and, comparing Python versions, 2.7.6 is quite a bit faster than 3.4.1
- Use the POSIX-compliant
awk
+sort
+cut
combo as a last resort; whichawk
implementation you use matters (mawk
is faster than GNUawk
, BSDawk
is slowest). - Stay away from
sort -R
,bash
loops, and Scala.
I use a tiny perl script, which I call "unsort":
#!/usr/bin/perl
use List::Util 'shuffle';
@list = <STDIN>;
print shuffle(@list);
I've also got a NULL-delimited version, called "unsort0" ... handy for use with find -print0 and so on.
PS: Voted up 'shuf' too, I had no idea that was there in coreutils these days ... the above may still be useful if your systems doesn't have 'shuf'.
Here is a first try that's easy on the coder but hard on the CPU which prepends a random number to each line, sorts them and then strips the random number from each line. In effect, the lines are sorted randomly:
cat myfile | awk 'BEGIN{srand();}{print rand()"\t"$0}' | sort -k1 -n | cut -f2- > myfile.shuffled
here's an awk script
awk 'BEGIN{srand() }
{ lines[++d]=$0 }
END{
while (1){
if (e==d) {break}
RANDOM = int(1 + rand() * d)
if ( RANDOM in lines ){
print lines[RANDOM]
delete lines[RANDOM]
++e
}
}
}' file
output
$ cat file
1
2
3
4
5
6
7
8
9
10
$ ./shell.sh
7
5
10
9
6
8
2
1
3
4
A one-liner for python:
python -c "import random, sys; lines = open(sys.argv[1]).readlines(); random.shuffle(lines); print ''.join(lines)," myFile
And for printing just a single random line:
python -c "import random, sys; print random.choice(open(sys.argv[1]).readlines())," myFile
But see this post for the drawbacks of python's random.shuffle()
. It won't work well with many (more than 2080) elements.
Simple awk-based function will do the job:
shuffle() {
awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' | sort -n | cut -c8-
}
usage:
any_command | shuffle
This should work on almost any UNIX. Tested on Linux, Solaris and HP-UX.
Update:
Note, that leading zeros (%06d
) and rand()
multiplication makes it to work properly also on systems where sort
does not understand numbers. It can be sorted via lexicographical order (a.k.a. normal string compare).
Ruby FTW:
ls | ruby -e 'puts STDIN.readlines.shuffle'
One liner for Python based on scai's answer, but a) takes stdin, b) makes the result repeatable with seed, c) picks out only 200 of all lines.
$ cat file | python -c "import random, sys;
random.seed(100); print ''.join(random.sample(sys.stdin.readlines(), 200))," \
> 200lines.txt
A simple and intuitive way would be to use shuf
.
Example:
Assume words.txt
as:
the
an
linux
ubuntu
life
good
breeze
To shuffle the lines, do:
$ shuf words.txt
which would throws the shuffled lines to standard output; So, you've to pipe it to an output file like:
$ shuf words.txt > shuffled_words.txt
One such shuffle run could yield:
breeze
the
linux
an
ubuntu
good
life
We have a package to do the very job:
sudo apt-get install randomize-lines
Example:
Create an ordered list of numbers, and save it to 1000.txt:
seq 1000 > 1000.txt
to shuffle it, simply use
rl 1000.txt
This is a python script that I saved as rand.py in my home folder:
#!/bin/python
import sys
import random
if __name__ == '__main__':
with open(sys.argv[1], 'r') as f:
flist = f.readlines()
random.shuffle(flist)
for line in flist:
print line.strip()
On Mac OSX sort -R
and shuf
are not available so you can alias this in your bash_profile as:
alias shuf='python rand.py'
If like me you came here to look for an alternate to shuf
for macOS then use randomize-lines
.
Install randomize-lines
(homebrew) package, which has an rl
command which has similar functionality to shuf
.
brew install randomize-lines
Usage: rl [OPTION]... [FILE]...
Randomize the lines of a file (or stdin).
-c, --count=N select N lines from the file
-r, --reselect lines may be selected multiple times
-o, --output=FILE
send output to file
-d, --delimiter=DELIM
specify line delimiter (one character)
-0, --null set line delimiter to null character
(useful with find -print0)
-n, --line-number
print line number with output lines
-q, --quiet, --silent
do not output any errors or warnings
-h, --help display this help and exit
-V, --version output version information and exit
If you have Scala installed, here's a one-liner to shuffle the input:
ls -1 | scala -e 'for (l <- util.Random.shuffle(io.Source.stdin.getLines.toList)) println(l)'
This bash function has the minimal dependency(only sort and bash):
shuf() {
while read -r x;do
echo $RANDOM$'\x1f'$x
done | sort |
while IFS=$'\x1f' read -r x y;do
echo $y
done
}
In windows You may try this batch file to help you to shuffle your data.txt, The usage of the batch code is
C:\> type list.txt | shuffle.bat > maclist_temp.txt
After issuing this command, maclist_temp.txt will contain a randomized list of lines.
Hope this helps.
Not mentioned as of yet:
The
unsort
util. Syntax (somewhat playlist oriented):unsort [-hvrpncmMsz0l] [--help] [--version] [--random] [--heuristic] [--identity] [--filenames[=profile]] [--separator sep] [--concatenate] [--merge] [--merge-random] [--seed integer] [--zero-terminated] [--null] [--linefeed] [file ...]
msort
can shuffle by line, but it's usually overkill:seq 10 | msort -jq -b -l -n 1 -c r
Another awk
variant:
#!/usr/bin/awk -f
# usage:
# awk -f randomize_lines.awk lines.txt
# usage after "chmod +x randomize_lines.awk":
# randomize_lines.awk lines.txt
BEGIN {
FS = "\n";
srand();
}
{
lines[ rand()] = $0;
}
END {
for( k in lines ){
print lines[k];
}
}