Question

I use gsutil in a Linux environment for managing files in GCS. I enjoy being able to use the command

gsutil -m cp -I gs://...

preceded by some other command to pass the STDIN to gsutil for uploading files; in doing so, I can maintain a local list of files that have been uploaded or generate specific patterns to upload and hand them off.

I would like to be able to do a similar command like

gsutil -m rm -I gs://...

to scrub files similarly. Presently, I build a big list of files to remove and run it with the following code:

while read line
do
gsutil rm gs://...
done < "$myfile.txt"

This is extraordinarily slow compared to the multithreaded "gsutil -m rm..." command, and enabling the -m flag has no effect when you have to process files one at a time from a list. I also experimented with just running

gsutil -m rm gs://.../* # remove everything
<my command> | gsutil -m cp -I gs://.../ # put back the pieces that I want

but this involves recopying a lot of a data and wastes a lot of time; the data is already there and just needs to have some removed. Any thoughts would be appreciated. Also, I don't have a lot of flexibility on either end with renaming files; otherwise, a quick rename before uploading would handle all of this.

Was it helpful?

Solution

As an interim solution, since we don't have a -I option for rm right now, how about just creating a string of all the objects you want to delete in your loop and then using gsutil -m rm to delete it? You could also do this with a simple python script that invokes the gsutil command from within python as a separate process.

Expanding on your earlier example, maybe something like the following (disclaimer: my bash-fu isn't the greatest, and I haven't tested this):

objects=''
while read line
do
  objects="$objects gs://$line"
done
gsutil -m rm $objects

OTHER TIPS

For anyone wondering, I wound up doing like Zach Wilt indicated above. For reference, I was removing on the order of a couple thousand files from a span of 5 directories, so roughly 10,000 files. Doing this without the "-m" switch was taking upwards of 30 minutes; with the "-m" switch, it takes less than 30 seconds. Zoom!

For a robust example: I am using this to update Google Cloud Storage files to match local files. On the current day, I have a program that dumps lots of files that are incremental, and also a handful that are "rolled up". After a week, the incremental files get scrubbed locally automatically, but the same should happen in GCS to save the space. Here's how to do this:

#!/bin/bash

# get the full date strings for touch
start=`date --date='-9 days' +%x`
end=`date --date='-8 days' +%x`

# other vars
mon=`date --date='-9 days' +%b | tr [A-Z] [a-z]`
day=`date --date='-9 days' +%d`

# display start and finish times
echo "Cleaning files from $start"

# update start and finish times
touch --date="$start" /tmp/start1
touch --date="$end" /tmp/end1

# repeat for all servers
for dr in "dir1" "dir2" "dir3" ... 
do

    # list files in range and build retention file
    find /local/path/$dr/ -newer /tmp/start1 ! -newer /tmp/end1 > "$dr-local.txt"

    # get list of all files from appropriate folder on GCS
    gsutil ls gs://gcs_path/$mon/$dr/$day/ > "$dr-gcs.txt"

    # formatting the host list file
    sed -i "s|gs://gcs_path/$mon/$dr/$day/|/local/path/$dr/|" "$dr-gcs.txt"

    # build sed command file to delete matches
    while read line
    do
        echo "\|$line|d" >> "$dr-del.txt"
    done < "$dr-local.txt"

    # run command file to strip lines for files that need to remain
    sed -f "$dr-del.txt" <"$dr-gcs.txt" >"$dr-out.txt"

    # convert local names to GCS names
    sed -i "s|/local/path/$dr/|gs://gcs_path/$mon/$dr/$day/|" "$dr-out.txt"

    # new variable to hold string
    del=""

    # convert newline separated file to one long string
    while read line
    do
        del="$del$line "
    done < "$dr-out.txt"

    # remove all files matching the final output
    gsutil -m rm $del

    # cleanup files
    rm $dr-local.txt
    rm $dr-gcs.txt
    rm $dr-del.txt
    rm $dr-out.txt

done

You'll need to modify to fit your needs, but this is a concrete and working method for deleting files locally, and then synchronizing the change to Google Cloud Storage. Obviously, modify to fit your needs. Thanks again to @Zach Wilt.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top