Question

I'm using Python to parse a WordPress site downloaded via wget. All the HTML files are nested inside a complicated folder structure (thanks to WordPress and its long URLs), like site_dump/2010/03/11/post-title/index.html.

However, within the post-title directory there are other directories for the feed and for Google News-esque number-based indexes:

site_dump/2010/03/11/post-title/index.html  # I want this
site_dump/2010/03/11/post-title/feed/index.html  # Not these
site_dump/2010/03/11/post-title/115232/site.com/2010/03/11/post-title/index.html

I only want to access the index.html files that are at the 5th nested level (site_dump/2010/03/11/post-title/index.html), and not beyond. Right now I split the root variable by a slash (/) in the os.walk loop and only deal with the file if it is inside 5 levels of folders:

import os

for root, dirs, files in os.walk('site_dump'):
  nested_levels = root.split('/')
  if len(nested_levels) == 5:
    print(nested_levels)  # Eventually do stuff with the file here

However, this seems kind of inefficient, since os.walk is still traversing those really deep folders. Is there a way to limit how deep os.walk goes when traversing a directory tree?

Was it helpful?

Solution

You can modify dirs in place to prevent further traversal into the directory structure.

for root, dirs, files in os.walk('site_dump'):
  nested_levels = root.split('/')
  if len(nested_levels) == 5:
    del dirs[:]
    # Eventually do stuff with the file here

del dirs[:] will remove the contents of the list, rather than replace dirs with a reference to a new list. When doing this it is important to modify the list in-place.

From the docs, with topdown referring to an optional parameter for os.walk that you omitted and defaults to True:

When topdown is True, the caller can modify the dirnames list in-place (perhaps using del or slice assignment), and walk() will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search, impose a specific order of visiting, or even to inform walk() about directories the caller creates or renames before it resumes walk() again. Modifying dirnames when topdown is False is ineffective, because in bottom-up mode the directories in dirnames are generated before dirpath itself is generated.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top