Searching for filename potentially deep in nested archives of heterogeneous types

Gamall · November 27, 2023, 11:08am

Hello,

I’m looking for old files that may be in a .rar in a .zip in a .7z…

I know the filename.

ugrep can search patterns https://github.com/Genivia/ugrep inside files inside archives, but I just want to search for the filename, not the contents, which ugrep doesn’t do, so far as I can see.

Basically I need a version of find or fd with the same kind of options regarding archives as ugrep does.

Is there such a thing?

DromundKaas · November 27, 2023, 11:26am

If you’re looking for a solution you can implement on your own, my little development would be like this:

Read from a find command all files matching in subdirectories
Depending on the extension run an uncompressor with a list files option

Then search if any of those uncompressor listed files matches. It’s kind of the Unix way. String together small tools that each implement one step very well.

I’d probably use Python or Perl.

BS86 · November 27, 2023, 11:37am

I fed your post to ChatGPT and this was it’s answer. Untested, just fyi, might give you some hints:

Summary

As of my last knowledge update in January 2022, there might not be a single tool that combines the file search capabilities of find or fd with the archive inspection capabilities of ugrep. However, you can create a combination of tools to achieve what you’re looking for.

One way to approach this is to use find or fd to locate files with a specific name and then use a tool like 7z or unzip along with grep to search within archives. Here’s an example using find and grep:

find /path/to/search -name "*.zip" -exec unzip -l {} \; | grep "your_filename"
find /path/to/search -name "*.rar" -exec unrar l {} \; | grep "your_filename"
find /path/to/search -name "*.7z" -exec 7z l {} \; | grep "your_filename"
find /path/to/search -name "your_filename" # for searching regular files

Explanation:

Use find to locate files with specific extensions (*.zip, *.rar, *.7z).
For each found archive, use unzip -l, unrar l, or 7z l to list the contents of the archive.
Pipe the output to grep to search for your specific filename.

This approach allows you to search for filenames within archives. Keep in mind that this may not be as efficient as a tool specifically designed for this purpose, but it can get the job done using common command-line tools.

Additionally, since my knowledge is based on information available up to January 2022, there might be new tools or updates that provide more integrated solutions for searching within archives. I recommend checking the documentation or release notes of relevant tools for the latest features and options.

emk2203 · November 27, 2023, 11:51am

Nice approach.

You could do everything in one line with a broader find and only using 7z:

find /path/to/search -iname "*.7z" -o -iname "*.rar" -o -iname "*.zip" -exec 7z l {} + | grep "your_filename"

Using + instead of \; is faster by factor 3 for me.

Gamall · November 27, 2023, 1:08pm

In summary, the answer to my, very specific, question (“is there such a tool?”) is “No, not to our knowledge”.

Which is fine.

------- Note about the scope of the question, as I intended it ----------------

A tool implementing what I described in a sane, fast, and reliable way would be much more complex than the proposed scripts; the recursive/deeply nested aspect – which I tried to emphasise in the OP, because that’s why I asked the question at all – is not at all addressed.

You need more than grepping the file list from all archives, which is trivial – and which I did before asking this question (with fd instead of find, same thing) – you need to identify archives therein, extract them, and repeat recursively. Given large archives, the memory management would not be trivial at all if you want your search to be done in parallel. Things get more difficult if you need to rely on magic numbers rather than extensions to identify archives. Then you need to extract everything (or just the first few bytes of each file if the archive format supports it). Whenever you extract anything, you need to be mindful of where you put it (tmpfs, preferably), lest you exceed the capacity of the current drive / wear out an SSD, etc, because you may need to extract GiBs of data. Those are just the obvious things if you want the tool to actually work in practice.

It’s a specialised problem that calls for a specialised tool with some actual thought and engineering put into it.

The fact that few search / grepping tools even attempt to support that kind of thing (ugrep is the only one that does it, to my knowledge) is an indication that it’s probably not trivial to get right.

At some point, I’ll probably patch together a kludge in Python to see whether I get lucky (no parallel search, no magic numbers).

In the meantime, for my immediate need, I’ll just email the original author of the files I’m looking for, and mark this as “solved”

BluishHumility · November 28, 2023, 5:03pm

Sorry I’m a little late to this thread.

It does, with the -g flag. See here: https://github.com/Genivia/ugrep#archives

Searching compressed files and archives with -z

-z, --decompress
        Search compressed files and archives.  Archives (.cpio, .pax, .tar)
        and compressed archives (e.g. .zip, .taz, .tgz, .tpz, .tbz, .tbz2,
        .tb2, .tz2, .tlz, .txz, .tzst) are searched and matching pathnames
        of files in archives are output in braces.  When used with option
        --zmax=NUM, searches the contents of compressed files and archives
        stored within archives up to NUM levels.  If -g, -O, -M, or -t is
        specified, searches files stored in archives whose filenames match
        globs, match filename extensions, match file signature magic bytes,
        or match file types, respectively.

If you read through the examples it shows a few different ways the -g flag can be used. Plain old “by filename” is definitely supported.

Here is the full description of the flag:

-g GLOBS, --glob=GLOBS
        Search only files whose name matches the specified comma-separated
        list of GLOBS, same as --include='glob' for each `glob' in GLOBS.
        When a `glob' is preceded by a `!' or a `^', skip files whose name
        matches `glob', same as --exclude='glob'.  When `glob' contains a
        `/', full pathnames are matched.  Otherwise basenames are matched.
        When `glob' ends with a `/', directories are matched, same as
        --include-dir='glob' and --exclude-dir='glob'.  A leading `/'
        matches the working directory.  This option may be repeated and may
        be combined with options -M, -O and -t to expand searches.  See
        `ugrep --help globs' and `man ugrep' section GLOBBING for details.

I hope that helps, happy searching.

Gamall · November 28, 2023, 5:28pm

Thanks, I kinda dismissed ugrep entirely because I obsessed over the fact that it looks for a pattern inside the files, and is not meant as a replacement to find, but you can pass an empty pattern or --match to ignore that aspect completely (I wonder whether it still extracts the contents in that case, but that’s not a big deal if so).

Making ugrep work for this is probably a better idea than cobbling together a find-based script. I’ll try that.

Gamall · November 28, 2023, 5:41pm

ugrep has a fatal flaw, though:

it does not support 7z or rar:

Supported compression formats: gzip (.gz), compress (.Z), zip, bzip2 (requires suffix .bz, .bz2, .bzip2, .tbz, .tbz2, .tb2, .tz2), lzma and xz (requires suffix .lzma, .tlz, .xz, .txz), lz4 (requires suffix .lz4), zstd (requires suffix .zst, .zstd, .tzst), brotli (requires suffix .br), bzip3 (requires suffix .bz3).

That may be the real reason why I dismissed ugrep when I looked at it seriously (that was a while ago); most of my old archives are rar, the newer ones are 7z, so… it may implement the logic needed to do the task, but not the support for the required formats.

BluishHumility · November 28, 2023, 9:26pm

Ah, too bad. Yes, I do see there is some kind of issue with 7z support:

I don’t really understand the workaround mentioned toward the end of the thread, to be honest–I am not sure what that person is doing with the command they mentioned.

Anyway, sorry I was not able to be more helpful. Good luck!

Gamall · November 28, 2023, 10:45pm

(I dont know what key I pressed but the post went out in the middle of typing)

ugrep --filter=‘7z:7z x -so %’ pattern filename.7z

–filter transforms files before searching for the pattern; the typical example is extracting the text from a pdf, with: --filter='pdf:pdftotext % -' . The argument expects a Unix filter (stdin to stdout).

In that case, 7z x -so % extracts everything, and -so sends all that to stdout.

So, using this, you can find patterns in files in 7z archives.

This is fine, but does not replace 7z support for the purpose of recursive search, because it’s non-recursive. Actually it doesn’t do anything with file names either. Of course you could use 7z l -so instead for that; still non-recursive. Also won’t work with globbing, because that counts as file contents as far as ugrep is concerned, not file names.

It was helpful, in that it forced me to take a better look at ugrep (I had forgotten the real reasons why it wasn’t suitable for that task). I’m sure it will come in handy for something else, someday…

Cheers

system · November 30, 2023, 10:46pm

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.