Concordance of text

https://stackoverflow.com/questions/9056829

linux
tr

04-12-2019
|

سؤال

I have been reading the cookbook for Linux to get a hang of it. I am fairly new to it.

I cam across a topic called Concordance of text. Now I understand what it is, but I am not able to get a sequence of commands using tr, sort and uniq ( That's what the cookbook says ) that would generate the concordance.

Can someone tell me how to create a basic concordance? i.e. just sort and display word frequency for each unique word.

The idea presented in the cookbook to use tr to translate all spaces to newline characters so that each word goes into a new line, which is then passed to the sorter, and then passed to the uniq with the -c flag to make a count of the unique terms.

I am not able to figure out the correct parameters though. Can someone explain please while explaining what each parameter does?

I have googled out for this but I am not able to get a clearly defined answer to my problem.

Any help is much appreciated!

المحلول

tr ' ' '\n' <input | sort | uniq -c

If I understand your comment correctly, you want the total of all words over all files in a directory. You can do that like this:

find mydir -type f -exec cat {} + | tr ' ' '\n' | sort | uniq -c

find will recursively search mydir for files that match its arguments: -type f tells it to only keep normal files (as opposed to directories or a couple other types you shouldn't have to worry about yet), then find will execute cat, giving it all the file names as arguments; cat concatenates files, printing all their contents as if it were one big file. That output then goes through the same tr/sort/uniq pipeline to actually calculate the concordance.

نصائح أخرى

There are many ways to do this, but this is my solution. It uses different commands than you mention, but, through the use of sed and a final `sort, it may produce more desirable output.

find . -type f -print0 | xargs -0 cat | sed 's/[[:punct:]]//g' | sed -r 's/\s+/\n/g' | sort | uniq -c | sort -n

find . -type f -print0 will recursively search all the folders and files from your current directory downwards. -type f will return only files. -print0 will use the special \0 character to end file names so that spaces aren't confusing to the next the command in the pipe.

xargs takes input and turns it into arguments for a command, in this case cat. cat will print the contents of all files given to it as arguments. The -0 tells xargs that its input is delimited by the special \0 character, not by spaces.

sed is a pattern-matching stream editor. The first sed command subsitutes (s) all punctuation using the [[:punct:]] pattern and replaces the punctuation with nothing. It matches all such patterns in each line given to it (g).

The second sed command turns all instances of 1 or more spaces in a row (\s+) into newlines (\n) through the input string (g).

sort organizes the words alphabetically.

uniq -c eliminates adjacent duplicates in the output list while counting how many there were.

sort -n sorts this output numerically yielding a list of words sorted by word frequency.

sed and xargs are very powerful commands, especially if used in conjunction. But, as another poster has noted, find also has almost unbridled power. tr is useful, but is more specific than sed.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow