سؤال
I have been reading the cookbook for Linux to get a hang of it. I am fairly new to it.
I cam across a topic called Concordance of text. Now I understand what it is, but I am not able to get a sequence of commands using tr, sort and uniq ( That's what the cookbook says ) that would generate the concordance.
Can someone tell me how to create a basic concordance? i.e. just sort and display word frequency for each unique word.
The idea presented in the cookbook to use tr to translate all spaces to newline characters so that each word goes into a new line, which is then passed to the sorter, and then passed to the uniq with the -c flag to make a count of the unique terms.
I am not able to figure out the correct parameters though. Can someone explain please while explaining what each parameter does?
I have googled out for this but I am not able to get a clearly defined answer to my problem.
Any help is much appreciated!
المحلول
tr ' ' '\n' <input | sort | uniq -c
If I understand your comment correctly, you want the total of all words over all files in a directory. You can do that like this:
find mydir -type f -exec cat {} + | tr ' ' '\n' | sort | uniq -c
find
will recursively search mydir
for files that match its arguments: -type f
tells it to only keep normal files (as opposed to directories or a couple other types you shouldn't have to worry about yet), then find
will execute cat
, giving it all the file names as arguments; cat
concatenates files, printing all their contents as if it were one big file. That output then goes through the same tr
/sort
/uniq
pipeline to actually calculate the concordance.
نصائح أخرى
There are many ways to do this, but this is my solution. It uses different commands than you mention, but, through the use of sed
and a final `sort, it may produce more desirable output.
find . -type f -print0 | xargs -0 cat | sed 's/[[:punct:]]//g' | sed -r 's/\s+/\n/g' | sort | uniq -c | sort -n
find . -type f -print0
will recursively search all the folders and files from your current directory downwards. -type f
will return only files. -print0
will use the special \0
character to end file names so that spaces aren't confusing to the next the command in the pipe.
xargs
takes input and turns it into arguments for a command, in this case cat
. cat
will print the contents of all files given to it as arguments. The -0
tells xargs that its input is delimited by the special \0
character, not by spaces.
sed
is a pattern-matching stream editor. The first sed
command subsitutes (s
) all punctuation using the [[:punct:]]
pattern and replaces the punctuation with nothing. It matches all such patterns in each line given to it (g
).
The second sed command turns all instances of 1 or more spaces in a row (\s+
) into newlines (\n
) through the input string (g
).
sort
organizes the words alphabetically.
uniq -c
eliminates adjacent duplicates in the output list while counting how many there were.
sort -n
sorts this output numerically yielding a list of words sorted by word frequency.
sed
and xargs
are very powerful commands, especially if used in conjunction. But, as another poster has noted, find
also has almost unbridled power. tr
is useful, but is more specific than sed
.