سؤال

First of all, I am a bash noob, so please be gentle :)

I am trying to sum the size of folders that are in different places but have the same name. It looks like this :

root
--- directory 1

------ folder 1
--------subfolder 1
--------subfolder 2
------ folder 2
--------subfolder 3
--------subfolder 4
------ folder 3
--------subfolder 5
--------subfolder 6

--- directory 2

------ folder 1
--------subfolder 1
--------subfolder 2
------ folder 2
--------subfolder 3
--------subfolder 4
------ folder 3
--------subfolder 5
--------subfolder 6

I am trying to sum the size of subdirectories 1 to 6 and output that to a .csv

At the moment I am simply outputting sizes of subdirectories in two seperate CSV files. One for directory 1 and one for directory 2

At the moment I have this to output sizes of subfodlers that I run where I need them :

du -h --max-depth=1 --block-size=GB * | grep "[\/]" | sort -n -r > ~/lists/disks/RC_job.csv

The output look like this :

40GB folder1/subfolder1

15GB folder1/subfolder2

10GB folder2/subfolder 3
...

I have one output for directory 1 and one for directory 2. I would like to sum the size of subfolders from directory one and two and have an output that looks like this

60GB subfolder1

25GB subfolder2

10GB subfolder3

Where subfolder1 is directory1/folder1/subfolder1 + directory2/folder1/subfolder1

This is my first post here I do not know if this enough info. I would be pleased to provide more if necessary. I am pretty sure this can be done with awl, but I haven't really used that yet.

Cheers !

Edit to answer question in comments :

(Part of the) output of du -h /net/rcq-rp/job/rcq/vault/image/film /net/rcq-rp/job/rcq/film --max-depth=1 --block-size=GB * is :

1GB /net/rcq-rp/job/rcq/vault/image/film/nr106/nr106_0010
1GB /net/rcq-rp/job/rcq/vault/image/film/nr106/nr106_0020
1GB /net/rcq-rp/job/rcq/vault/image/film/nr106/nr106_0030
1GB /net/rcq-rp/job/rcq/vault/image/film/nr106/nr106_0035
1GB /net/rcq-rp/job/rcq/vault/image/film/nr106/nr106_0040
1GB /net/rcq-rp/job/rcq/vault/image/film/nr106/nr106_0045
2GB /net/rcq-rp/job/rcq/vault/image/film/nr106/nr106_0050
1GB /net/rcq-rp/job/rcq/vault/image/film/nr106/nr106_0060
1GB /net/rcq-rp/job/rcq/film/nr106/nr106_0010
1GB /net/rcq-rp/job/rcq/film/nr106/nr106_0020
1GB /net/rcq-rp/job/rcq/film/nr106/nr106_0030
1GB /net/rcq-rp/job/rcq/film/nr106/nr106_0035
1GB /net/rcq-rp/job/rcq/film/nr106/nr106_0040
1GB /net/rcq-rp/job/rcq/film/nr106/nr106_0045
1GB /net/rcq-rp/job/rcq/film/nr106/nr106_0050
1GB /net/rcq-rp/job/rcq/film/nr106/nr106_0060

Ideally the final output would be :

2GB nr106_0010

etc...
هل كانت مفيدة؟

المحلول

One way to do this is with an associative array. An associative array maps a series of keys to values, for example:

directory1 -> 10 GB
directory2 -> 12 MB
directory3 -> 40 KB

The keys in an associative array must be unique. That's great! The paths to our directories are also unique. Let's put them in an associative array. I will show how to do this in awk but plenty of other languages have associative arrays (like Perl, which calls them hashes).

du | awk '{ val = $1; dir = $2; sizes[dir] = val }'

(I took out the arguments you pass to du for simplicity)

What does this do? awk reads the output of du line by line; for each line, it adds an element to the associative array sizes with the directory name as the index and the size as the value. If our original input looked like this

40GB folder1/subfolder1
15GB folder1/subfolder2
10GB folder2/subfolder1

our array would look like this:

sizes[folder1/subfolder1] -> 40GB
sizes[folder1/subfolder2] -> 15GB
sizes[folder2/subfolder1] -> 10GB

But in our final output we just want to see values for the subdirectories. awk has functions for string manipulation, so let's tweak our code to strip off leading directories:

du | awk '{ val = $1; dir = $2; sub(/^.*\//, "", dir); sizes[dir] = val }'

The sub function strips off everything from the last / to the beginning of the path. Now our array looks like this:

sizes[subfolder2] -> 15GB
sizes[subfolder1] -> 10GB

Great! Now we only have values for the subdirectories. There's just one little problem. The values aren't totals. Since we had more than one subdirectory named subfolder1, we overwrote the first value (40GB) with the second one (10GB). When we run into an index that already exists in our array, what we really want to do is add its value to the existing value:

du | awk '{ val = $1; dir = $2; sub(/^.*\//, "", dir); sizes[dir] += val }'

(I changed sizes[dir] = val, which uses assignment, to sizes[dir] += val, which adds val to whatever is already in sizes[dir])

awk magically takes care of some things for us, like converting 15GB to the number 15. Now our array looks like this:

sizes[subfolder2] -> 15
sizes[subfolder1] -> 50

which shows us the totals we're looking for. Now, how do we display this? We can loop through the array and print out the keys and values like this:

du | awk '{ val = $1; dir = $2; sub(/^.*\//, "", dir); sizes[dir] += val } \
          END { for (dir in sizes) print dir, sizes[dir], "GB" }'

and our results are

subfolder1 50 GB
subfolder2 15 GB

EDIT: Here are the results I get using the du output in the updated question.

nr106_0060 2 GB
nr106_0050 3 GB
nr106_0045 2 GB
nr106_0040 2 GB
nr106_0035 2 GB
nr106_0030 2 GB
nr106_0020 2 GB
nr106_0010 2 GB

نصائح أخرى

I am not sure how many csv files you will need in the end, but maybe this can help:

 du -h --block-size=GB ./* | awk -F "[: \t/]" '{size[$NF] += $1} END {for (dir in size) print size[dir], dir}' | sort -n -r

The command size[$NF] += $1 sums over the size (first column) storing the result in the associative array indexed by the directory name (last column).

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top